🚀 Update: This patch has been applied to the RISC-V Linux kernel, which is my first one! Check out the details in the commit page.
Welcome to this demo of a kernel bug affecting ptrace in the Linux kernel <= 6.6 on RISC-V, resulting in different bahavior compared with x86 and arm64. This README shows the reproduction and analysis of this issue, where a blocking syscall such as read()
doesn't restart properly after being interrupted by ptrace.
Follow the steps below to replicate this issue:
- Initialize pipes and use
vfork()
andexecl()
to start a victim process. This process will run a couple of simple system calls -read()
,write()
, andgetsid()
. - In the parent process, execute
PTRACE_SEIZE
followed byPTRACE_INTERRUPT
to pause the victim process (you can find this in theinterrupt_task()
subfunction). - Inject
ebreak
(brk
on arm64 andint3
on x86_64) and restart the victim process. This includes several steps:- Backup the original registers and instruction at
new_pc
usingPTRACE_GETREGSET
andPTRACE_PEEKDATA
. - Set
pc
tonew_pc
and modify some other regs to disrupt the system call restart condition. Inject an ebreak instruction atnew_pc
usingPTRACE_SETREGSET
andPTRACE_POKEDATA
. - Execute
ebreak
in the victim withPTRACE_CONT
, which should then give control back to the tracer. - Wait for the victim to finish its execution.
- Restores
pc
, the other registers and instruction atnew_pc
usingPTRACE_SETREGSET
andPTRACE_POKEDATA
.
- Backup the original registers and instruction at
- Resume the victim process using
PTRACE_DETACH
(you can find this inresume_task()
). - Kick the victim process again. You should now see a difference in behavior: On RISC-V, the
read()
syscall ends with errno 512ERESTARTSYS
which shouldn't appear in user space. In contrast, on x86_64 and arm64, theread()
system call restarts and finishes successfully.
The crux of this issue lies within arch/${arch}/kernel/signal.c
, in the arch_do_signal_or_restart()
(RISC-V and x86) and do_signal()
(arm64) methods. While these functions serve the same objective, their names vary slightly across architectures.
In the RISC-V implementation:
- The tracee is initially blocked in
syscall_handler()
. WhenPTRACE_INTERRUPT
is activated, the process returns witha0 == -512
and traversessyscall_exit_to_user_mode()
,__syscall_exit_to_user_mode_work()
,exit_to_user_mode_prepare()
,exit_to_user_mode_loop()
, and finallyarch_do_signal_or_restart()
. - It first enters
get_signal()
. Due to the setup ofPTRACE_INTERRUPT
(primarily theJOBCTL_TRAP_MASK
flag), it goes throughdo_jobctl_trap()
,ptrace_do_notify()
, and finally halts inptrace_stop()
, allowing the tracer to inspect and manipulate it. - Here, the tracer performs step 3 from the reproduction process, checkpoints all user-space regs, mainly
a0
,a7
, andpc
. However, thecause
register, being in supervisor mode, cannot be accessed via ptrace. As a result ofebreak
,cause
transitions fromEXC_SYSCALL
toEXC_BREAKPOINT
. The tracee will halt again after executingebreak
, at which point further examination or actions may take place. In this case, we simply restore all elements and resume the tracee usingPTRACE_DETACH
. - The change of
cause
disrupts the system call restart condition. Upon re-exceution, the system call restart process inarch_do_signal_or_restart()
is bypassed, and the return value and errno are set toERESTARTSYS
. According toinclude/linux/errno.h
, this should never be seen by user programs.
Let's contrast this with the x86 and arm64 architectures, where the system call can be restarted correctly:
- In x86, the syscall restart condition is evaluated using
regs->orig_ax != -1
, whereorig_ax
is exposed to user space and will be checkpointed & restored using ptrace. Therefore, the syscall restart condition remains intact. - Arm64 operates differently. The syscall restart condition is evaluated using
regs->syscallno != NO_SYSCALL
, also a kernel space register. However, arm64 applies a uniquedo_signal()
structure: it attempts to restart the syscall beforeget_signal()
, then reverts this decision if it's unsuitable to restart afterget_signal()
. This design allows the syscall to restart prior to being trapped and modified inptrace_stop()
.
Link: https://patchwork.kernel.org/project/linux-riscv/patch/[email protected]/
From 3ba4e3f69597d38b83b60943c3d35f892364d878 Mon Sep 17 00:00:00 2001
From: Haorong Lu <[email protected]>
Date: Thu, 3 Aug 2023 14:51:00 -0700
Subject: [PATCH] riscv: signal: handle syscall restart before get_signal
In the current riscv implementation, blocking syscalls like read() may
not correctly restart after being interrupted by ptrace. This problem
arises when the syscall restart process in arch_do_signal_or_restart()
is bypassed due to changes to the regs->cause register, such as an
ebreak instruction.
Steps to reproduce:
1. Interrupt the tracee process with PTRACE_SEIZE & PTRACE_INTERRUPT.
2. Backup original registers and instruction at new_pc.
3. Change pc to new_pc, and inject an instruction (like ebreak) to this
address.
4. Resume with PTRACE_CONT and wait for the process to stop again after
executing ebreak.
5. Restore original registers and instructions, and detach from the
tracee process.
6. Now the read() syscall in tracee will return -1 with errno set to
ERESTARTSYS.
Specifically, during an interrupt, the regs->cause changes from
EXC_SYSCALL to EXC_BREAKPOINT due to the injected ebreak, which is
inaccessible via ptrace so we cannot restore it. This alteration breaks
the syscall restart condition and ends the read() syscall with an
ERESTARTSYS error. According to include/linux/errno.h, it should never
be seen by user programs. X86 can avoid this issue as it checks the
syscall condition using a register (orig_ax) exposed to user space.
Arm64 handles syscall restart before calling get_signal, where it could
be paused and inspected by ptrace/debugger.
This patch adjusts the riscv implementation to arm64 style, which also
checks syscall using a kernel register (syscallno). It ensures the
syscall restart process is not bypassed when changes to the cause
register occur, providing more consistent behavior across various
architectures.
For a simplified reproduction program, feel free to visit:
https://github.com/ancientmodern/riscv-ptrace-bug-demo.
Signed-off-by: Haorong Lu <[email protected]>
---
arch/riscv/kernel/signal.c | 85 +++++++++++++++++++++-----------------
1 file changed, 46 insertions(+), 39 deletions(-)
diff --git a/arch/riscv/kernel/signal.c b/arch/riscv/kernel/signal.c
index 180d951d3624..d2d7169048ea 100644
--- a/arch/riscv/kernel/signal.c
+++ b/arch/riscv/kernel/signal.c
@@ -391,30 +391,6 @@ static void handle_signal(struct ksignal *ksig, struct pt_regs *regs)
sigset_t *oldset = sigmask_to_save();
int ret;
- /* Are we from a system call? */
- if (regs->cause == EXC_SYSCALL) {
- /* Avoid additional syscall restarting via ret_from_exception */
- regs->cause = -1UL;
- /* If so, check system call restarting.. */
- switch (regs->a0) {
- case -ERESTART_RESTARTBLOCK:
- case -ERESTARTNOHAND:
- regs->a0 = -EINTR;
- break;
-
- case -ERESTARTSYS:
- if (!(ksig->ka.sa.sa_flags & SA_RESTART)) {
- regs->a0 = -EINTR;
- break;
- }
- fallthrough;
- case -ERESTARTNOINTR:
- regs->a0 = regs->orig_a0;
- regs->epc -= 0x4;
- break;
- }
- }
-
rseq_signal_deliver(ksig, regs);
/* Set up the stack frame */
@@ -428,35 +404,66 @@ static void handle_signal(struct ksignal *ksig, struct pt_regs *regs)
void arch_do_signal_or_restart(struct pt_regs *regs)
{
+ unsigned long continue_addr = 0, restart_addr = 0;
+ int retval = 0;
struct ksignal ksig;
+ bool syscall = (regs->cause == EXC_SYSCALL);
- if (get_signal(&ksig)) {
- /* Actually deliver the signal */
- handle_signal(&ksig, regs);
- return;
- }
+ /* If we were from a system call, check for system call restarting */
+ if (syscall) {
+ continue_addr = regs->epc;
+ restart_addr = continue_addr - 4;
+ retval = regs->a0;
- /* Did we come from a system call? */
- if (regs->cause == EXC_SYSCALL) {
/* Avoid additional syscall restarting via ret_from_exception */
regs->cause = -1UL;
- /* Restart the system call - no handlers present */
- switch (regs->a0) {
+ /*
+ * Prepare for system call restart. We do this here so that a
+ * debugger will see the already changed PC.
+ */
+ switch (retval) {
case -ERESTARTNOHAND:
case -ERESTARTSYS:
case -ERESTARTNOINTR:
- regs->a0 = regs->orig_a0;
- regs->epc -= 0x4;
- break;
case -ERESTART_RESTARTBLOCK:
- regs->a0 = regs->orig_a0;
- regs->a7 = __NR_restart_syscall;
- regs->epc -= 0x4;
+ regs->a0 = regs->orig_a0;
+ regs->epc = restart_addr;
break;
}
}
+ /*
+ * Get the signal to deliver. When running under ptrace, at this point
+ * the debugger may change all of our registers.
+ */
+ if (get_signal(&ksig)) {
+ /*
+ * Depending on the signal settings, we may need to revert the
+ * decision to restart the system call, but skip this if a
+ * debugger has chosen to restart at a different PC.
+ */
+ if (regs->epc == restart_addr &&
+ (retval == -ERESTARTNOHAND ||
+ retval == -ERESTART_RESTARTBLOCK ||
+ (retval == -ERESTARTSYS &&
+ !(ksig.ka.sa.sa_flags & SA_RESTART)))) {
+ regs->a0 = -EINTR;
+ regs->epc = continue_addr;
+ }
+
+ /* Actually deliver the signal */
+ handle_signal(&ksig, regs);
+ return;
+ }
+
+ /*
+ * Handle restarting a different system call. As above, if a debugger
+ * has chosen to restart at a different PC, ignore the restart.
+ */
+ if (syscall && regs->epc == restart_addr && retval == -ERESTART_RESTARTBLOCK)
+ regs->a7 = __NR_restart_syscall;
+
/*
* If there is no signal to deliver, we just put the saved
* sigmask back.
--
2.41.0