Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Thread takeover on attach fails b/c SIGUSR2 is blocked: switch to another signal? #5458

Closed
derekbruening opened this issue Apr 11, 2022 · 2 comments · Fixed by #5461
Closed
Assignees

Comments

@derekbruening
Copy link
Contributor

Attaching (via ptrace #38 or for statically-linked DR) to a process that has masked most non-fatal signals fails to take over the rest of the app threads. We could try to use ptrace to take them over but that is difficult for the static-link case. Or we could switch from SIGUSR2 to a signal less likely to be masked, like SIGFPE. We would distinguish from a synchronous signal by looking at si->code (set as far back as 2.2 kernel; some other siginfo fields were unreliable back then but not this one) and other fields.

E.g., this is hit attaching to mysqld, which blocks all non-fatal signals. The ptrace attach succeeds but then DR's takeover times out and fails.

@derekbruening derekbruening self-assigned this Apr 11, 2022
derekbruening added a commit that referenced this issue Apr 12, 2022
Changes the signal that DR uses to suspend a thread from SIGUSR2,
which is sometimes blocked by the app at attach time, to SIGFPE, which
as a fatal normally-synchronous signal is less likely to be blocked.

Manually tested on an attach to mysqld which failed with SIGUSR2 but
succeeds now.

Fixes #5458
@derekbruening
Copy link
Contributor Author

Unfortunately it's looking like QEMU does not handle DR sending SIGFPE via SYS_kill: it just crashes right up front.

  • SIGSEGV and SIGBUS are harder b/c DR looks for crashes with those.
  • SIGABRT is used asynchronously already and so harder to distinguish from
    the app: though that's true of SIGUSR2 too so it's a candidate.
    I see some suspect code looking for it in core/unix/os.c.
  • SIGILL is already used for nudges and init-time.
  • SIGTRAP??

Could try:

  • SIGSTKFLT ("Stack fault on coprocessor (unused)"): not there on Mac.

Maybe the best thing is to change it from a compile-time constant to a
runtime variable controlled by an option so it can be adjusted for
different circumstances.

@derekbruening
Copy link
Contributor Author

I think it is a little too complex to have a runtime option-controlled signal with all the constraints: going to go with hardcoded SIGSTKFLT on Linux and SIGFPE on Mac for now.

derekbruening added a commit that referenced this issue Apr 13, 2022
Changes the signal that DR uses to suspend a thread from SIGUSR2,
which is sometimes blocked by the app at attach time, to SIGSTKFLT on
Linux and SIGFPE on Mac.  (SIGFPE was the first choice on Linux but
QEMU crashes when we use it.)  These are fatal normally-synchronous
signals and so are less likely to be blocked.

Manually tested on an attach to mysqld which failed with SIGUSR2 but
succeeds now.

Fixes #5458
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant