Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Translation problems in clean call mangling causing post-detach crashes #4219

Open
derekbruening opened this issue Mar 25, 2020 · 1 comment
Assignees

Comments

@derekbruening
Copy link
Contributor

While testing my new drwrap post-call method for #4197 I wrote a drwrap detach test and ended up hitting a bunch of failures running it in a loop locally.

The visible symptoms are that right after detach we get the "Cannot correctly handle a received signal.>" fatal error, because a SIGSEGV arrived in a now-native thread for which we have no setup to deliver a signal as we're nearly all the way shut down.

The translate code for clean call mangling is only thinking about synchronous faults, which would only happen on argument setup. At that point, there is indeed a clean app state stored. However, asynchronous translation, such as for detach, could happen anywhere. Plus, there are clean call optimizations that do not store all the state. There is also clean call inlining. It does not look like the translation code is properly considering all of that.

It will take some effort to go through and support all that complexity. As a first step I'm planning to just refuse to translate at all when in clean call mangling. If I put in a change to just refuse to translate in any clean call mangling, it fixes the failures in the drwrap detach test. Although since the args could have a synchronous fault we'll have to translate there, so it will take a little more effort.

I also want to put in some stats on how many asynch translations are refused, to help inform whether we could speed up detach by supporting more translation points.

@derekbruening derekbruening self-assigned this Mar 25, 2020
derekbruening added a commit that referenced this issue Mar 25, 2020
Adds the signal number and thread id to this fatal message for better diagnostics:
  <Cannot correctly handle received signal 11 in thread 204511.>

Tested in the #4219 failures.
derekbruening added a commit that referenced this issue Mar 25, 2020
Adds the signal number and thread id to this fatal message for better diagnostics:
  <Cannot correctly handle received signal 11 in thread 204511.>

Tested in the #4219 failures.
derekbruening added a commit that referenced this issue Mar 25, 2020
Adds an alternative scheme for achieving a post-call control point
that does not require flushing or shared data structure examination
per-call: replacing the return address with a sentinel.

When the new flag DRWRAP_REPLACE_RETADDR is set, the return address is
replaced with the address of a single return instruction in the client
library, with the real address saved.  When a block is seen consisting
of that sentinel instruction, post-call callbacks are called, and then
control is sent to the saved real address using
dr_redirect_native_target().

Adds wrapping tests to drwrap-test.

This new scheme requires restoring return addresses on the stack on
detach or other state translation.  Adds functionality to do so, along
with a new test client.drwrap-test-detach.

This requires the client's state restoration event be called for
addresses not in the code cache.  Adds such a call.

Adds comments about translation problems with clean call mangling
which is filed as i#4219.  The issues seen here are all limited to
traces, so the test works around the problems with -disable_traces.

Tested the core drwrap behavior on ARM and AArch64 but missing general
detach support there (#1578) prevents enabling the detach test there.

Issue: #4219
Fixes #4197
derekbruening added a commit that referenced this issue Mar 26, 2020
Adds an alternative scheme for achieving a post-call control point
that does not require flushing or shared data structure examination
per-call: replacing the return address with a sentinel.

When the new flag DRWRAP_REPLACE_RETADDR is set, the return address is
replaced with the address of a single return instruction in the client
library, with the real address saved.  When a block is seen consisting
of that sentinel instruction, post-call callbacks are called, and then
control is sent to the saved real address using
dr_redirect_native_target().

Adds wrapping tests to drwrap-test.

This new scheme requires restoring return addresses on the stack on
detach or other state translation.  Adds functionality to do so, along
with a new test client.drwrap-test-detach.

This requires the client's state restoration event be called for
addresses not in the code cache.  Adds such a call.

Adds comments about translation problems with clean call mangling
which is filed as i#4219.  The issues seen here are all limited to
traces, so the test works around the problems with -disable_traces.

Tested the core drwrap behavior on ARM and AArch64 but missing general
detach support there (#1578) prevents enabling the detach test there.

Issue: #4219
Fixes #4197
derekbruening added a commit that referenced this issue Mar 26, 2020
Adds an rstat "synchs_not_at_safe_spot" measuring the count of
synch_with_thread attempts that had to be re-tried due to the target
being suspended at an un-translatable (i.e., unsafe) spot.

Adds a corresponding dr_stats_t field "synchs_not_at_safe_spot" for
extraction.

These will be useful for measuring the impact of clean calls and other
un-translatable places on detach and other synch operations.

Issue: #4219
derekbruening added a commit that referenced this issue Mar 27, 2020
Adds an rstat "synchs_not_at_safe_spot" measuring the count of
synch_with_thread attempts that had to be re-tried due to the target
being suspended at an un-translatable (i.e., unsafe) spot.

Adds a corresponding dr_stats_t field "synchs_not_at_safe_spot" for
extraction.

These will be useful for measuring the impact of clean calls and other
un-translatable places on detach and other synch operations.

Issue: #4219
@derekbruening
Copy link
Contributor Author

I split off #4232 to cover performance. This issue will focus on correctness.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant