-
Notifications
You must be signed in to change notification settings - Fork 569
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Translation problems in clean call mangling causing post-detach crashes #4219
Labels
Comments
derekbruening
added a commit
that referenced
this issue
Mar 25, 2020
Adds the signal number and thread id to this fatal message for better diagnostics: <Cannot correctly handle received signal 11 in thread 204511.> Tested in the #4219 failures.
derekbruening
added a commit
that referenced
this issue
Mar 25, 2020
Adds the signal number and thread id to this fatal message for better diagnostics: <Cannot correctly handle received signal 11 in thread 204511.> Tested in the #4219 failures.
derekbruening
added a commit
that referenced
this issue
Mar 25, 2020
Adds an alternative scheme for achieving a post-call control point that does not require flushing or shared data structure examination per-call: replacing the return address with a sentinel. When the new flag DRWRAP_REPLACE_RETADDR is set, the return address is replaced with the address of a single return instruction in the client library, with the real address saved. When a block is seen consisting of that sentinel instruction, post-call callbacks are called, and then control is sent to the saved real address using dr_redirect_native_target(). Adds wrapping tests to drwrap-test. This new scheme requires restoring return addresses on the stack on detach or other state translation. Adds functionality to do so, along with a new test client.drwrap-test-detach. This requires the client's state restoration event be called for addresses not in the code cache. Adds such a call. Adds comments about translation problems with clean call mangling which is filed as i#4219. The issues seen here are all limited to traces, so the test works around the problems with -disable_traces. Tested the core drwrap behavior on ARM and AArch64 but missing general detach support there (#1578) prevents enabling the detach test there. Issue: #4219 Fixes #4197
derekbruening
added a commit
that referenced
this issue
Mar 26, 2020
Adds an alternative scheme for achieving a post-call control point that does not require flushing or shared data structure examination per-call: replacing the return address with a sentinel. When the new flag DRWRAP_REPLACE_RETADDR is set, the return address is replaced with the address of a single return instruction in the client library, with the real address saved. When a block is seen consisting of that sentinel instruction, post-call callbacks are called, and then control is sent to the saved real address using dr_redirect_native_target(). Adds wrapping tests to drwrap-test. This new scheme requires restoring return addresses on the stack on detach or other state translation. Adds functionality to do so, along with a new test client.drwrap-test-detach. This requires the client's state restoration event be called for addresses not in the code cache. Adds such a call. Adds comments about translation problems with clean call mangling which is filed as i#4219. The issues seen here are all limited to traces, so the test works around the problems with -disable_traces. Tested the core drwrap behavior on ARM and AArch64 but missing general detach support there (#1578) prevents enabling the detach test there. Issue: #4219 Fixes #4197
derekbruening
added a commit
that referenced
this issue
Mar 26, 2020
Adds an rstat "synchs_not_at_safe_spot" measuring the count of synch_with_thread attempts that had to be re-tried due to the target being suspended at an un-translatable (i.e., unsafe) spot. Adds a corresponding dr_stats_t field "synchs_not_at_safe_spot" for extraction. These will be useful for measuring the impact of clean calls and other un-translatable places on detach and other synch operations. Issue: #4219
derekbruening
added a commit
that referenced
this issue
Mar 27, 2020
Adds an rstat "synchs_not_at_safe_spot" measuring the count of synch_with_thread attempts that had to be re-tried due to the target being suspended at an un-translatable (i.e., unsafe) spot. Adds a corresponding dr_stats_t field "synchs_not_at_safe_spot" for extraction. These will be useful for measuring the impact of clean calls and other un-translatable places on detach and other synch operations. Issue: #4219
I split off #4232 to cover performance. This issue will focus on correctness. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
While testing my new drwrap post-call method for #4197 I wrote a drwrap detach test and ended up hitting a bunch of failures running it in a loop locally.
The visible symptoms are that right after detach we get the "Cannot correctly handle a received signal.>" fatal error, because a SIGSEGV arrived in a now-native thread for which we have no setup to deliver a signal as we're nearly all the way shut down.
The translate code for clean call mangling is only thinking about synchronous faults, which would only happen on argument setup. At that point, there is indeed a clean app state stored. However, asynchronous translation, such as for detach, could happen anywhere. Plus, there are clean call optimizations that do not store all the state. There is also clean call inlining. It does not look like the translation code is properly considering all of that.
It will take some effort to go through and support all that complexity. As a first step I'm planning to just refuse to translate at all when in clean call mangling. If I put in a change to just refuse to translate in any clean call mangling, it fixes the failures in the drwrap detach test. Although since the args could have a synchronous fault we'll have to translate there, so it will take a little more effort.
I also want to put in some stats on how many asynch translations are refused, to help inform whether we could speed up detach by supporting more translation points.
The text was updated successfully, but these errors were encountered: