Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support re-attach after full detach with the same DR library instance #2157

Open
derekbruening opened this issue Jan 27, 2017 · 3 comments
Open
Assignees

Comments

@derekbruening
Copy link
Contributor

Split from #95

For something like a ptrace-based external attach with an injected DR library, the solution here would be to remove the library completely on detach, leaving no extra work for a re-attach. This issue covers instead a re-attach for a DR library that we cannot remove, as it is either statically linked with the app or was not loaded by us as part of the attach but rather by the system loader up front.

Xref discussion on needing to re-attach after a full detach for start/stop when stop always does a full detach: #95 (comment)

It's worth repeating the main paragraph there:

Supporting re-takeover when stopping is tied to full cleanup is problematic
as it requires that DR fully zero all static and global variables. There
are many cases of static variables scattered around, such as inside
DO_ONCE, in the initializers for Extensions (drmgr, etc.), in memoized
functions, etc. We'd have to make all those non-global-scope static vars
exposed to get access to them, or try to zero out the whole .data and .bss
(which by itself is not enough as there's a lot of non-zero-init stuff in
.data). This has performance impliciations for chains of short-lived
processes. We also have to deal with subtle things like #1271, where we
threw out the .1config file under the assumption that we wouldn't re-read
the options later. Plus, even if we make DR work in this model,
third-party Extensions are unlikely to follow this: we would have to
noisily demand a different programming model than is usually assumed.

Despite all of those problems, in the past we have gotten such a re-attach to work for simple cases, and even if the solution is fragile and "hacky" and does not cover all corner cases it may still be worth best-effort support as it removes a severe limitation of useful usage scenarios such as bursty traces.

@derekbruening
Copy link
Contributor Author

I put in initial best-effort support in 2dd9659

However, it ends up failing on Travis in tests that pass locally:

https://travis-ci.org/DynamoRIO/dynamorio/builds/201376528
debug-internal-32: 259 tests passed, **** 3 tests failed: ****
	code_api|tool.drcacheoff.burst_static =>    (16821).  Internal Error: DynamoRIO debug check failure: 
	code_api|tool.drcacheoff.burst_client =>    (16840).  Internal Error: DynamoRIO debug check failure: 
	code_api|api.static_detach =>  Application /home/travis/build/DynamoRIO/dynamorio/build_debug-internal-32/suite/tests/bin/api.static_detach (16921).  Internal Error: DynamoRIO debug check failure: /home/travis/build/DynamoRIO/dynamorio/core/unix/os.c:8907 vsyscall_page_start == NULL 

I put a diagnostic into a pull request and:

https://travis-ci.org/DynamoRIO/dynamorio/jobs/201391836
254: Test command: /home/travis/build/DynamoRIO/dynamorio/build_debug-internal-32/bin32/runstats "-s" "90" "-killpg" "-silent" "-env" "LD_LIBRARY_PATH" "/home/travis/build/DynamoRIO/dynamorio/build_debug-internal-32/lib32/debug:/home/travis/build/DynamoRIO/dynamorio/build_debug-internal-32/ext/lib32/debug:" "-env" "DYNAMORIO_OPTIONS" "-stderr_mask 0xC -dumpcore_mask 0 -code_api" "/home/travis/build/DynamoRIO/dynamorio/build_debug-internal-32/suite/tests/bin/api.static_detach"
254: Test timeout computed to be: 600
252: pre-DR stop
252: all done
254: pre-DR init
254: vsyscall_page_start is 0x00000000
254: in dr_client_main
254: pre-DR start
254: pre-DR detach
254: Saw some bb events
254: clearing vsyscall_page_start
254: re-attach attempt
254: vsyscall_page_start is 0x00000000
254: vsyscall_page_start is 0xf77bb000
254: <Application /home/travis/build/DynamoRIO/dynamorio/build_debug-internal-32/suite/tests/bin/api.static_detach (17053).  Internal Error: DynamoRIO debug check failure: /home/travis/build/DynamoRIO/dynamorio/core/unix/os.c:8909 vsyscall_page_start == NULL
254: (Error occurred @457 frags)
254: version 6.2.17211, custom build
254: -stderr_mask 12 -stack_size 56K -max_elide_jmp 0 -max_elide_call 0 -no_inline_ignored_syscalls -native_exec_default_list '' -no_native_exec_managed_code -no_indcall2direct 
254: 0xffae96ec 0x08136695
254: 0xffae991c 0x082ea30c
254: 0xffae9a20 0x081d9d31
254: 0xffae9aac 0x080b2f1c
254: 0xffaea2e8 0x080b65ee
254: 0xffaea300 0x080b6875
254: 0xffaea310 0x08051c86
254: 0xffaea328 0xf75c7ad3>
254/262 Test #254: code_api|api.static_detach .......................................***Failed  Required regular expression not found.Regex=[^pre-DR init

So either vdso is in the maps file twice, or find_executable_vm_areas is
called twice. Both are odd. I'm disabling the assert temporarily while I try to reproduce this or investigate further using pull requests.

@derekbruening
Copy link
Contributor Author

I can repro in a 14.04.5 VM (but not in 15.04 or on Fedora). The vdso pages are split into two entries,
presumably by something DR did to them (vsyscall hook I suppose):

f7740000-f7741000 r-xp 00000000 00:00 0                                  [vdso]
f7741000-f7742000 r-xp 00000000 00:00 0                                  [vdso]

Carrotman42 added a commit that referenced this issue Sep 11, 2017
When doing_detach is false, the current stack frame is actually in the
heap, so unmapping causes a segfault.

Issue #2157
derekbruening pushed a commit that referenced this issue Sep 25, 2017
On UNIX we're on a permanent non-vmm stack at detach, so we can free
the full vmm region.

I also included a fix to vmm_heap_unit_init which accidentally left
vmh->alloc_start uninitialized in the branch related to reserving OS
memory at a preferred location.

Issue: #2157
Carrotman42 added a commit that referenced this issue Nov 15, 2017
Take care to set the registered_fault bool back to false after
event unregister so that it can be re-registered later.

Issue #2157
Carrotman42 added a commit that referenced this issue Nov 16, 2017
Take care to set the registered_fault bool back to false after
event unregister so that it can be re-registered later.

Issue #2157
fhahn pushed a commit that referenced this issue Dec 4, 2017
On UNIX we're on a permanent non-vmm stack at detach, so we can free
the full vmm region.

I also included a fix to vmm_heap_unit_init which accidentally left
vmh->alloc_start uninitialized in the branch related to reserving OS
memory at a preferred location.

Issue: #2157
fhahn pushed a commit that referenced this issue Dec 4, 2017
Take care to set the registered_fault bool back to false after
event unregister so that it can be re-registered later.

Issue #2157
Carrotman42 added a commit that referenced this issue Dec 8, 2017
Fixes a reattach-based crash where this TLS leak caused drmgr to run out
of TLS slots.

Issue #2157
Carrotman42 added a commit that referenced this issue Dec 9, 2017
Fixes a reattach-based crash where this TLS leak caused drmgr to run out
of TLS slots.

Issue #2157
Carrotman42 added a commit that referenced this issue Dec 13, 2017
After 5 seconds of waiting for a thread to acknowledge a received
signal, os_thread_suspend now returns false so that the caller can
retry.

Issue #2157
Carrotman42 added a commit that referenced this issue Dec 15, 2017
On Unix, after 5 seconds of waiting for a thread to acknowledge a received
signal, os_thread_suspend now returns false so that the caller can
retry.

This fixes a thread related to creating a new application thread close to the time when
detach happens.

Issue: #2157
@derekbruening
Copy link
Contributor Author

Xref #3065

derekbruening pushed a commit that referenced this issue Jun 22, 2018
This commit is the supplement for PR #3050.
We also need to clean postcall_cache on drwrap_exit, otherwise
post_callback will not be invoked at re-attach. This is because the registration
of post_callback relies on pre_callback, and the pre_callback checks postcall_cache
before registering the post_callback. The stale data in postcall_cache prevents
post_callback being registered to the hash table.

Issue: #3065, #2157 
Fixes #3049
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants