See the OSdev wiki for details on sysenter
, including a note about how to avoid a security/safety problem. Also see the Intel / AMD manuals for that. They go into a lot of the detail that OS developers need. See the x86 tag wiki for links.
Overview of the various system-call instructions:
int
: available since forever (8086)
- Trap by executing an invalid instruction, apparently was the fastest way to enter the kernel on 80386. (But that's not the case anymore).
- call gate (i.e. a
far call
). See the OSdev link for details on that and traps.
sysenter
: (http://wiki.osdev.org/Sysenter) Introduced by Intel before x86-64 existed, adopted by AMD not long after (many years ago). Available on all modern x86 CPUs. Very minimalist design, requires user-space cooperation for the kernel to be able to return, because it doesn't save EIP, ESP, or EFLAGS anywhere.
Linux supports it in 32 and 64-bit kernels for system calls from 32-bit processes only. IDK if you could design a kernel that used it for 64-bit system calls as well / instead. (I know that wasn't the question, but it's related.)
Using sysenter
requires user-space cooperation to provide the return address and save its own ESP and EFLAGS. In Linux, the kernel exports a page of code which has the user-space side of this dance. User-space is expected to call
this code instead of using sysenter
directly, but feel free to design your OS however you want. Looking at Linux's code for both sides of this dance will probably be useful, if you don't find an example somewhere else.
syscall
from 64-bit user-space: available everywhere because Intel implemented it along with the rest of AMD64. Well-designed interface that masks RFLAGS (with a configurable mask) before entering the kernel, so you can avoid a race window (if you had to disable interrupts manually with cli
). Used with swapgs
for the kernel to get access to its stack and so on.
On mainstream x86 OSes (like Linux), syscall
is the only way to make 64-bit system calls.
syscall
from 32-bit user-space: A totally different instruction from long mode syscall
, only available on AMD CPUs. The kernel-side interface is different for 32-bit kernels (legacy mode) vs. 64-bit kernels running 32-bit user-space (compat mode).
The Linux kernel has some useful comments on it:
entry_64_compat.S
32-bit SYSCALL entry (32-bit syscall
entry point into a 64-bit kernel)
/* ...
* - Most programmers do not directly target AMD CPUs, and the 32-bit
* SYSCALL instruction does not exist on Intel CPUs. Even on AMD
* CPUs, Linux disables the SYSCALL instruction on 32-bit kernels
* because the SYSCALL instruction in legacy/native 32-bit mode (as
* opposed to compat mode) is sufficiently poorly designed as to be
* essentially unusable.
Maybe a toy OS could use it without worrying about whatever problems make it unsuitable for Linux, IDK. But unless you're just plain curious, don't waste your time with it. OTOH, if you're interested in OS & CPU design, finding out what's wrong with the ISA design might be interesting.
BTW, when AMD was designing AMD64, they got some feedback from Linux kernel devs on the amd64 mailing list that improved the design of 64-bit syscall
(to configurably mask RFLAGS) because their initial design would have been problematic for Linux. Links to those archived mailing list posts in this answer.
Recommendation: Use sysenter
for your 32-bit kernel. It should be usable everywhere, including on AMD CPUs for years now. Ancient CPUs that don't support it can use the int 0x80
ABI (or whatever number you picked for your OS), if you want to add a 2nd compatibility ABI.
The Linux kernel entry points are well commented, and written fairly readably. While writing What happens if you use the 32-bit int 0x80 Linux ABI in 64-bit code?, I had an easy time figuring out what was going on in the entry points into a 64-bit kernel using syscall
(native 64-bit system calls), or int 0x80
or sysenter
(32-bit system calls, normally from compat mode but int 0x80
is supported for 64-bit processes. But it still invokes the 32-bit ABI!) There's a bunch of complicated stuff going on in case various kinds of tracing / debugging are enabled, but the other parts are fairly easy to follow. See that answer for a walk-through of some of Linux's system-call handling internals.
In arch/x86/entry
, these are the main files of interest:
entry_32.S
: 32-bit kernel code for entry from user-space. (legacy mode)
entry_64_compat.S
: 64-bit kernel code for entry from 32-bit user-space (compat mode -> long mode).
entry_64.S
: 64-bit kernel code for entry from 64-bit user-space (long mode -> long mode).
You should be able to find Linux's VDSO code for the user-space side of the sysenter
dance that passes the kernel the values it needs to return to user-space. (What is better "int 0x80" or "syscall"?). Related: What is better "int 0x80" or "syscall"?, and The Definitive Guide to Linux System Calls will give some useful info on the design choices Linux made.
Is true that sysret
instruction isn't safe?
Intel and AMD both have separate bugs with non-canonical RIP when returning to 64-bit user-space. e.g. on Intel, Linux's entry_64.S
describes it this way:
/*
* On Intel CPUs, SYSRET with non-canonical RCX/RIP will #GP
* in kernel space. This essentially lets the user take over
* the kernel, since userspace controls RSP.
That can happen if a ptrace
system call (e.g. made by a debugger) changed the saved value of the process's RIP
to a non-canonical address.
Linux checks whether it can use sysret
, and if not uses its iret
return path. (The sysret
path is fast enough that it's worth doing extra work to check that it's safe).
Note that if a system call blocks / sleeps, the "master copy" of user-space's integer register state is on its kernel stack, where the system call entry point pushed it. (In Linux. Other designs are possible!) But anyway, this is why it's possible to end up with weird saved state that user-space couldn't have run syscall
with (because it would have faulted on jmp
to a non-canonical address), or with saved_rcx != saved_RIP
(64-bit syscall
sets RCX=RIP, and R11=RFLAGS (before masking), so it clobbers RCX and R11 but allows the kernel to restore RIP and RFLAGS.)
I don't know how 32-bit syscall
works, sorry I got off topic here. But I suspect that what you may have read about sysret
being unsafe was talking about 64-bit kernels.
IDK if there are any similar bugs in 32-bit-kernel sysret
, or 64-bit-kernel sysret
-to-compat-mode.