Translation problems in clean call mangling causing post-detach crashes #4219

derekbruening · 2020-03-25T15:33:56Z

While testing my new drwrap post-call method for #4197 I wrote a drwrap detach test and ended up hitting a bunch of failures running it in a loop locally.

The visible symptoms are that right after detach we get the "Cannot correctly handle a received signal.>" fatal error, because a SIGSEGV arrived in a now-native thread for which we have no setup to deliver a signal as we're nearly all the way shut down.

The translate code for clean call mangling is only thinking about synchronous faults, which would only happen on argument setup. At that point, there is indeed a clean app state stored. However, asynchronous translation, such as for detach, could happen anywhere. Plus, there are clean call optimizations that do not store all the state. There is also clean call inlining. It does not look like the translation code is properly considering all of that.

It will take some effort to go through and support all that complexity. As a first step I'm planning to just refuse to translate at all when in clean call mangling. If I put in a change to just refuse to translate in any clean call mangling, it fixes the failures in the drwrap detach test. Although since the args could have a synchronous fault we'll have to translate there, so it will take a little more effort.

I also want to put in some stats on how many asynch translations are refused, to help inform whether we could speed up detach by supporting more translation points.

Adds the signal number and thread id to this fatal message for better diagnostics: <Cannot correctly handle received signal 11 in thread 204511.> Tested in the #4219 failures.

Adds an alternative scheme for achieving a post-call control point that does not require flushing or shared data structure examination per-call: replacing the return address with a sentinel. When the new flag DRWRAP_REPLACE_RETADDR is set, the return address is replaced with the address of a single return instruction in the client library, with the real address saved. When a block is seen consisting of that sentinel instruction, post-call callbacks are called, and then control is sent to the saved real address using dr_redirect_native_target(). Adds wrapping tests to drwrap-test. This new scheme requires restoring return addresses on the stack on detach or other state translation. Adds functionality to do so, along with a new test client.drwrap-test-detach. This requires the client's state restoration event be called for addresses not in the code cache. Adds such a call. Adds comments about translation problems with clean call mangling which is filed as i#4219. The issues seen here are all limited to traces, so the test works around the problems with -disable_traces. Tested the core drwrap behavior on ARM and AArch64 but missing general detach support there (#1578) prevents enabling the detach test there. Issue: #4219 Fixes #4197

Adds an rstat "synchs_not_at_safe_spot" measuring the count of synch_with_thread attempts that had to be re-tried due to the target being suspended at an un-translatable (i.e., unsafe) spot. Adds a corresponding dr_stats_t field "synchs_not_at_safe_spot" for extraction. These will be useful for measuring the impact of clean calls and other un-translatable places on detach and other synch operations. Issue: #4219

derekbruening · 2020-03-27T18:53:44Z

I split off #4232 to cover performance. This issue will focus on correctness.

derekbruening added the Bug-AppCrash label Mar 25, 2020

derekbruening self-assigned this Mar 25, 2020

derekbruening mentioned this issue Mar 25, 2020

Add signal and thread numbers to fatal signal message #4220

Merged

derekbruening mentioned this issue Mar 25, 2020

i#4197: Add new drwrap post-call scheme: replace retaddr #4221

Merged

derekbruening mentioned this issue Mar 26, 2020

i#4219 xl8: Add an rstat on synch xl8 failures #4228

Merged

derekbruening mentioned this issue Mar 27, 2020

Speed up detach by reducing translation retries #4232

Open

derekbruening mentioned this issue Feb 5, 2021

client.drwrap-test-detach fails 28/100 times on aarch64: hang #4718

Closed

derekbruening mentioned this issue Dec 15, 2022

i#5786: Add precise clean call mangling identification #5791

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Translation problems in clean call mangling causing post-detach crashes #4219

Translation problems in clean call mangling causing post-detach crashes #4219

derekbruening commented Mar 25, 2020

derekbruening commented Mar 27, 2020

Translation problems in clean call mangling causing post-detach crashes #4219

Translation problems in clean call mangling causing post-detach crashes #4219

Comments

derekbruening commented Mar 25, 2020

derekbruening commented Mar 27, 2020