TL:DR: int 0x80
works when used correctly, as long as any pointers fit in 32 bits (stack pointers don't fit). But beware that strace
decodes it wrong unless you have a very recent strace + kernel.
int 0x80
zeros r8-r11, and preserves everything else. Use it exactly like you would in 32-bit code, with the 32-bit call numbers. (Or better, don't use it!)
Not all systems even support int 0x80
: The Windows Subsystem for Linux (WSL) is strictly 64-bit only: int 0x80
doesn't work at all. It's also possible to build Linux kernels without IA-32 emulation either. (No support for 32-bit executables, no support for 32-bit system calls).
The details: what's saved/restored, which parts of which regs the kernel uses
int 0x80
uses eax
(not the full rax
) as the system-call number, dispatching to the same table of function-pointers that 32-bit user-space int 0x80
uses. (These pointers are to sys_whatever
implementations or wrappers for the native 64-bit implementation inside the kernel. System calls are really function calls across the user/kernel boundary.)
Only the low 32 bits of arg registers are passed. The upper halves of rbx
-rbp
are preserved, but ignored by int 0x80
system calls. Note that passing a bad pointer to a system call doesn't result in SIGSEGV; instead the system call returns -EFAULT
. If you don't check error return values (with a debugger or tracing tool), it will appear to silently fail.
All registers (except eax of course) are saved/restored (including RFLAGS, and the upper 32 of integer regs), except that r8-r11 are zeroed. r12-r15
are call-preserved in the x86-64 SysV ABI's function calling convention, so the registers that get zeroed by int 0x80
in 64-bit are the call-clobbered subset of the "new" registers that AMD64 added.
This behaviour has been preserved over some internal changes to how register-saving was implemented inside the kernel, and comments in the kernel mention that it's usable from 64-bit, so this ABI is probably stable. (I.e. you can count on r8-r11 being zeroed, and everything else being preserved.)
The return value is sign-extended to fill 64-bit rax
. (Linux declares 32-bit sys_ functions as returning signed long
.) This means that pointer return values (like from void *mmap()
) need to be zero-extended before use in 64-bit addressing modes
Unlike sysenter
, it preserves the original value of cs
, so it returns to user-space in the same mode that it was called in. (Using sysenter
results in the kernel setting cs
to $__USER32_CS
, which selects a descriptor for a 32-bit code segment.)
Older strace
decodes int 0x80
incorrectly for 64-bit processes. It decodes as if the process had used syscall
instead of int 0x80
. This can be very confusing. e.g. strace
prints write(0, NULL, 12 <unfinished ... exit status 1>
for eax=1
/ int $0x80
, which is actually _exit(ebx)
, not write(rdi, rsi, rdx)
.
I don't know the exact version where the PTRACE_GET_SYSCALL_INFO
feature was added, but Linux kernel 5.5 / strace 5.5 handle it. It misleadingly says the process "runs in 32-bit mode" but does decode correctly. (Example).
int 0x80
works as long as all arguments (including pointers) fit in the low 32 of a register. This is the case for static code and data in the default code model ("small") in the x86-64 SysV ABI. (Section 3.5.1
: all symbols are known to be located in the virtual addresses in the range 0x00000000
to 0x7effffff
, so you can do stuff like mov edi, hello
(AT&T mov $hello, %edi
) to get a pointer into a register with a 5 byte instruction).
But this is not the case for position-independent executables, which many Linux distros now configure gcc
to make by default (and they enable ASLR for executables). For example, I compiled a hello.c
on Arch Linux, and set a breakpoint at the start of main. The string constant passed to puts
was at 0x555555554724
, so a 32-bit ABI write
system call would not work. (GDB disables ASLR by default, so you always see the same address from run to run, if you run from within GDB.)
Linux puts the stack near the "gap" between the upper and lower ranges of canonical addresses, i.e. with the top of the stack at 2^48-1. (Or somewhere random, with ASLR enabled). So rsp
on entry to _start
in a typical statically-linked executable is something like 0x7fffffffe550
, depending on size of env vars and args. Truncating this pointer to esp
does not point to any valid memory, so system calls with pointer inputs will typically return -EFAULT
if you try to pass a truncated stack pointer. (And your program will crash if you truncate rsp
to esp
and then do anything with the stack, e.g. if you built 32-bit asm source as a 64-bit executable.)
How it works in the kernel:
In the Linux source code, arch/x86/entry/entry_64_compat.S
defines
ENTRY(entry_INT80_compat)
. Both 32 and 64-bit processes use the same entry point when they execute int 0x80
.
entry_64.S
is defines native entry points for a 64-bit kernel, which includes interrupt / fault handlers and syscall
native system calls from long mode (aka 64-bit mode) processes.
entry_64_compat.S
defines system-call entry-points from compat mode into a 64-bit kernel, plus the special case of int 0x80
in a 64-bit process. (sysenter
in a 64-bit process may go to that entry point as well, but it pushes $__USER32_CS
, so it will always return in 32-bit mode.) There's a 32-bit version of the syscall
instruction, supported on AMD CPUs, and Linux supports it too for fast 32-bit system calls from 32-bit processes.
I guess a possible use-case for int 0x80
in 64-bit mode is if you wanted to use a custom code-segment descriptor that you installed with modify_ldt
. int 0x80
pushes segment registers itself for use with iret
, and Linux always returns from int 0x80
system calls via iret
. The 64-bit syscall
entry point sets pt_regs->cs
and ->ss
to constants, __USER_CS
and __USER_DS
. (It's normal that SS and DS use the same segment descriptors. Permission differences are done with paging, not segmentation.)
entry_32.S
defines entry points into a 32-bit kernel, and is not involved at all.
The int 0x80
entry point in Linux 4.12's entry_64_compat.S
:
/*
* 32-bit legacy system call entry.
*
* 32-bit x86 Linux system calls traditionally used the INT $0x80
* instruction. INT $0x80 lands here.
*
* This entry point can be used by 32-bit and 64-bit programs to perform
* 32-bit system calls. Instances of INT $0x80 can be found inline in
* various programs and libraries. It is also used by the vDSO's
* __kernel_vsyscall fallback for hardware that doesn't support a faster
* entry method. Restarted 32-bit system calls also fall back to INT
* $0x80 regardless of what instruction was originally used to do the
* system call.
*
* This is considered a slow path. It is not used by most libc
* implementations on modern hardware except during process startup.
...
*/
ENTRY(entry_INT80_compat)
... (see the github URL for the full source)
The code zero-extends eax into rax, then pushes all the registers onto the kernel stack to form a struct pt_regs
. This is where it will restore from when the system call returns. It's in a standard layout for saved user-space registers (for any entry point), so ptrace
from other process (like gdb or strace
) will read and/or write that memory if they use ptrace
while this process is inside a system call. (ptrace
modification of registers is one thing that makes return paths complicated for the other entry points. See comments.)
But it pushes $0
instead of r8/r9/r10/r11. (sysenter
and AMD syscall32
entry points store zeros for r8-r15.)
I think this zeroing of r8-r11 is to match historical behaviour. Before the Set up full pt_regs for all compat syscalls commit, the entry point only saved the C call-clobbered registers. It dispatched directly from asm with call *ia32_sys_call_table(, %rax, 8)
, and those functions follow the calling convention, so they preserve rbx
, rbp
, rsp
, and r12-r15
. Zeroing r8-r11
instead of leaving them undefined was probably a way to avoid info-leaks from the kernel. IDK how it handled ptrace
if the only copy of user-space's call-preserved registers was on the kernel stack where a C function saved them. I doubt it used stack-unwinding metadata to find them there.
The current implementation (Linux 4.12) dispatches 32-bit-ABI system calls from C, reloading the saved ebx
, ecx
, etc. from pt_regs
. (64-bit native system calls dispatch directly from asm, <a href="https