Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ASSERT dynamo_exited: on detach, thread suddenly dies #2021

Closed
derekbruening opened this issue Oct 13, 2016 · 1 comment
Closed

ASSERT dynamo_exited: on detach, thread suddenly dies #2021

derekbruening opened this issue Oct 13, 2016 · 1 comment

Comments

@derekbruening
Copy link
Contributor

The new test for #2016, #2018, #1921, and #95, api.static_signal, is
failing once every hundred or so runs. The ctest output is a rank order
violation:

  pre-DR stop
  <rank order violation report_buf_lock(mutex)@/work/dr/git/src/core/utils.c:2053 acquired after memory_info_buf_lock(mutex)@/work/dr/git/src/core/unix/memquery_linux.c:71 in tid:7cbe>
  <end of output>

Trying to pause there does not work: that thread is killed immediately for
unknown reasons.

Running -no_deadlock_avoidance hits an ASSERT(dynamo_exited) in
build_bb_ilist: so the dcontext is NULL. Adding some detach output
indicates that this is after entering detach_on_permanent_stack() but
before its synchall finishes.

  pre-DR stop
  <Detaching from application /work/dr/git/build_x64_dbg_tests/suite/tests/bin/api.static_signal (24005)>
  <Application /work/dr/git/build_x64_dbg_tests/suite/tests/bin/api.static_signal (24005).  Internal Error: DynamoRIO debug check failure: /work/dr/git/src/core/arch/interp.c:3271 dynamo_exited
  (Error occurred @530 frags)
  version 6.2.17086, custom build
  -no_deadlock_avoidance -stack_size 56K -max_elide_jmp 0 -max_elide_call 0 -no_inline_ignored_syscalls -native_exec_default_list '' -no_native_exec_managed_code -no_indcall2direct 
  0x000000004f2ba970 0x00000000004f8041
  0x000000004f2baad0 0x0000000000666061
  0x000000004f2bad30 0x000000000066ebf1
  0x000000004f2baf10 0x00000000004e1f59
  0x000000004f2baff0 0x000000004f20eec5
  0x00007f68e25008e0 0x00007f68e2ad7430
  0x00007f68e2500f00 0x000000000040bd6e
  0x00007f68e2500f50 0x00007f68e2ace555>

The 2nd thread in this app shouldn't be making new bbs at this point.
Could it be bb creation for xl8 during the synch? But why is it missing a
dcontext? And why does it die?

Finally got some data by unlimiting cores:

in dr_client_main
Sending SIGUSR1 pre-DR-start
Got SIGUSR1
pre-DR start
Sending SIGUSR1 under DR
Got SIGUSR1
pre-DR stop
<Detaching from application /work/dr/git/build_x64_dbg_tests/suite/tests/bin/api.static_signal (30151)>
<rank order violation report_buf_lock(mutex)@/work/dr/git/src/core/utils.c:2053 acquired after memory_info_buf_lock(mutex)@/work/dr/git/src/core/unix/memquery_linux.c:71 in tid:75c8>
Segmentation fault (core dumped)
Core was generated by `suite/tests/bin/api.static_signal'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  syscall_ready () at /work/dr/git/src/core/arch/x86/x86_shared.asm:180
180             pop      REG_XBX
Missing separate debuginfos, use: dnf debuginfo-install glibc-2.21-13.fc22.x86_64

(gdb) p 0x75c8
$1 = 30152
(gdb) thread apply all bt

Thread 2 (Thread 0x7f03ddead700 (LWP 30151)):
#0  syscall_ready () at /work/dr/git/src/core/arch/x86/x86_shared.asm:180
#1  0x000000000080c6d0 in ?? ()
#2  0x00000000006ebacf in ksynch_wait (futex=0x4dfb90dc, mustbe=0) at /work/dr/git/src/core/unix/ksynch_linux.c:120
#3  0x00000000006b8753 in os_thread_suspend (tr=0x4def0fa0) at /work/dr/git/src/core/unix/os.c:3173
#4  0x00000000005e4a79 in synch_with_thread (id=30152, block=false, hold_initexit_lock=true, caller_state=THREAD_SYNCH_NONE, 
    desired_state=THREAD_SYNCH_SUSPENDED_VALID_MCONTEXT, flags=2) at /work/dr/git/src/core/synch.c:926
#5  0x00000000005e636a in synch_with_all_threads (desired_synch_state=THREAD_SYNCH_SUSPENDED_VALID_MCONTEXT, 
    threads_out=0x7ffe20220f60, num_threads_out=0x7ffe20220f5c, cur_state=THREAD_SYNCH_NO_LOCKS_NO_XFER, flags=2)
    at /work/dr/git/src/core/synch.c:1326
#6  0x00000000005e84e2 in detach_on_permanent_stack (internal=true, do_cleanup=true) at /work/dr/git/src/core/synch.c:1912
#7  0x000000000047401e in dr_app_stop_and_cleanup () at /work/dr/git/src/core/dynamo.c:2647
#8  0x000000000040c0a2 in main (argc=1, argv=0x7ffe202210c8) at /work/dr/git/src/suite/tests/api/static_signal.c:166

Thread 1 (Thread 0x7f03dd1cf700 (LWP 30152)):
#0  syscall_ready () at /work/dr/git/src/core/arch/x86/x86_shared.asm:180
#1  0x0000000000000006 in ?? ()
#2  0x00000000006ebacf in ksynch_wait (futex=0x7b74a0 <memory_info_buf_lock>, mustbe=1)
    at /work/dr/git/src/core/unix/ksynch_linux.c:120
#3  0x00000000006c7c4b in mutex_wait_contended_lock (lock=0x7b74a0 <memory_info_buf_lock>) at /work/dr/git/src/core/unix/os.c:8861
#4  0x00000000004fa79d in mutex_lock (lock=0x7b74a0 <memory_info_buf_lock>) at /work/dr/git/src/core/utils.c:888
#5  0x00000000006e2ed3 in memquery_iterator_start (iter=0x4dfe2430, start=0x4dfe2000 '\253' <repeats 200 times>..., may_alloc=false)
    at /work/dr/git/src/core/unix/memquery_linux.c:139
#6  0x00000000006e37df in memquery_from_os (pc=0x4dfe2000 '\253' <repeats 200 times>..., info=0x4dfe2550, have_type=0x4dfe251e)
    at /work/dr/git/src/core/unix/memquery_linux.c:328
#7  0x00000000006c7a99 in query_memory_ex_from_os (pc=0x4dfe2000 '\253' <repeats 200 times>..., info=0x4dfe2550)
    at /work/dr/git/src/core/unix/os.c:8775
#8  0x00000000006c79f6 in get_memory_info (pc=0x4dfe2000 '\253' <repeats 200 times>..., base_pc=0x0, size=0x0, prot=0x4dfe25a4)
    at /work/dr/git/src/core/unix/os.c:8752
#9  0x00000000006cf8ad in copy_frame_to_stack (dcontext=0x4df93980, sig=12, frame=0x4dfe2738, sp=0x4dfe2738 "\357\260i", 
    from_pending=false) at /work/dr/git/src/core/unix/signal.c:2682
#10 0x00000000006dc213 in sig_detach (dcontext=0x4df93980, frame=0x4dfe2738, detached=0x4dfb90f8)
    at /work/dr/git/src/core/unix/signal.c:6217
#11 0x00000000006dc848 in handle_suspend_signal (dcontext=0x4df93980, ucxt=0x4dfe2740, frame=0x4dfe2738)
    at /work/dr/git/src/core/unix/signal.c:6352
#12 0x00000000006d5747 in master_signal_handler_C (sig=12, siginfo=0x4dfe2870, ucxt=0x4dfe2740, xsp=0x4dfe2738 "\357\260i")
    at /work/dr/git/src/core/unix/signal.c:4444
(gdb) info thread
  Id   Target Id         Frame 
  2    Thread 0x7f03ddead700 (LWP 30151) syscall_ready () at /work/dr/git/src/core/arch/x86/x86_shared.asm:180
+ 1    Thread 0x7f03dd1cf700 (LWP 30152) syscall_ready () at /work/dr/git/src/core/arch/x86/x86_shared.asm:180
(gdb) info reg
rax            0xfffffffffffffffc       -4
rbp            0x4dfe2340       0x4dfe2340
rsp            0x4dfe2300       0x4dfe2300
(gdb) x/10i $rip-5
   0x6f3312 <syscall_ready>:    mov    %rcx,%r10
   0x6f3315 <syscall_ready+3>:  syscall 
=> 0x6f3317 <syscall_ready+5>:  pop    %rbx
   0x6f3318 <syscall_ready+6>:  retq   

So SYS_futex gets interrupted (-4 == EINTR) and then crashes b/c its stack
is messed up?!?

Why is memory_info_buf_lock going down a contended path?

(gdb) p *lock
$2 = {
  lock_requests = 1, 
  contended_event = -1, 
  name = 0x77d248 "memory_info_buf_lock(mutex)@/work/dr/git/src/core/unix/memquery_linux.c:71", 
  rank = 86, 
  owner = 30152, 

Why is Thread 1 running sig_detach() already, when Thread 2 is just trying
to suspend it??

The state doesn't make sense: if 1 was somehow already in the suspend loop
in handle_suspend_signal() when 2 asked to suspend it here, and it made its
way to the sig_detach, ostd_resumed should now be 1, not 0. Plus,
suspend_count should prevent that from happening.

(gdb) p /x *ostd
$5 = {
  stack_base = 0x0, 
  stack_top = 0x0, 
  suspend_lock = {
    lock_requests = 0xffffffff, 
    contended_event = 0xffffffff, 
    name = 0x774be8, 
    rank = 0x3d, 
    owner = 0x0, 
    owning_dcontext = 0x0, 
    prev_owned_lock = 0x0, 
    count_times_acquired = 0x29, 
    count_times_contended = 0x0, 
    count_times_spin_pause = 0x0, 
    max_contended_requests = 0x0, 
    count_times_spin_only = 0x0, 
    prev_process_lock = 0x7bb180, 
    next_process_lock = 0x7baa40, 
    callstack = {0x0, 0x0, 0x0, 0x0}, 
    app_lock = 0x0, 
    deleted = 0x0
  }, 
  suspend_count = 0x1, 
  fork_threads = 0x0, 
  fork_num_threads = 0x0, 
  suspended = 0x0, 
  wakeup = 0x0, 
  resumed = 0x0, 
  suspended_sigcxt = 0x0, 
  terminate = 0x0, 
  terminated = 0x0, 
  detached = 0x0, 
  retakeover = 0x0, 
  processing_signal = 0x0, 
  priv_lib_tls_base = 0x4dfa6700, 
  priv_alt_tls_base = 0x4dfa1000, 
  dr_tls_base = 0x4dfa1000, 
  app_thread_areas = 0x4dfb97a8
}

The suspend signal interrupted another EINTR syscall:

(gdb) p /x *ucxt
$3 = {
  uc_flags = 0x1, 
  uc_link = 0x0, 
  uc_stack = {
    ss_sp = 0x4dfe1000, 
    ss_flags = 0x1, 
    ss_size = 0xe000
  }, 
  uc_mcontext = {
    r8 = 0x0, 
    r9 = 0x0, 
    r10 = 0x0, 
    r11 = 0x202, 
    r12 = 0x7ffe20220f3f, 
    r13 = 0x7f03dd1cf700, 
    r14 = 0x800000, 
    r15 = 0x0, 
    rdi = 0x7b74a0, 
    rsi = 0x0, 
    rbp = 0x4dfe2d40, 
    rbx = 0x6, 
    rdx = 0x1, 
    rax = 0xfffffffffffffffc, 
    rcx = 0x6f3317, 
    rsp = 0x4dfe2d00, 
    rip = 0x6f3317, 
    eflags = 0x202, 
    cs = 0x33, 
    gs = 0x0, 
    fs = 0x0, 
    __pad0 = 0xabab, 
    err = 0x0, 
    trapno = 0x0, 
    oldmask = 0xfffffffffffbf2ff, 
    cr2 = 0x0, 
    fpstate = 0x4dfe2900, 
    reserved1 = {0xabababababababab, 0xabababababababab, 0xabababababababab, 0xabababababababab, 0xabababababababab, 
      0xabababababababab, 0xabababababababab, 0xabababababababab}
  }, 
  uc_sigmask = {
    sig = {0xfffffffe7ffbfeff}
  }
}
(gdb) x/4i ucxt->uc_mcontext.rip
   0x6f3317 <syscall_ready+5>:  pop    %rbx
   0x6f3318 <syscall_ready+6>:  retq   
   0x6f3319 <dr_fpu_exception_init>:    fninit 
   0x6f331b <dr_fpu_exception_init+2>:  pushq  $0x1f80
(gdb) x/200wx 0x4dfe2d00
0x4dfe2d00:     0x00000006      0x00000000      0x006ebacf      0x00000000
0x4dfe2d10:     0x00000000      0x00000000      0x00000000      0x00000000
0x4dfe2d20:     0x4dfe2d40      0x00000001      0x007b74a0      0x00000000
0x4dfe2d30:     0x00000000      0xffffffff      0x00000006      0x00000002
0x4dfe2d40:     0x4dfe2d80      0x00000000      0x006c7c4b      0x00000000
0x4dfe2d50:     0x4dfe2d01      0x00000000      0x007b74a0      0x00000000
0x4dfe2d60:     0xabababab      0xabababab      0x007b74a0      0x00000000
0x4dfe2d70:     0x00000006      0x00000000      0x4df93980      0x00000000
0x4dfe2d80:     0x4dfe2db0      0x00000000      0x004fa79d      0x00000000
0x4dfe2d90:     0xabababab      0xabababab      0x007b74a0      0x00000000
0x4dfe2da0:     0xabababab      0xabababab      0x0100abab      0x00001b58
0x4dfe2db0:     0x4dfe2e00      0x00000000      0x006e2ed3      0x00000000
0x4dfe2dc0:     0xabababab      0xabababab      0xabababab      0xababab00
0x4dfe2dd0:     0x4dfe3000      0x00000000      0x4dfe2e30      0x00000000
0x4dfe2de0:     0xabababab      0xabababab      0xabababab      0xabababab
0x4dfe2df0:     0xabababab      0xabababab      0x4dfe2e61      0x00000000
0x4dfe2e00:     0x4dfe2ef0      0x00000000      0x006e37df      0x00000000
0x4dfe2e10:     0xabababab      0xabababab      0x4dfe2f1e      0x00000000
0x4dfe2e20:     0x4dfe2f50      0x00000000      0x4dfe3000      0x00000000
0x4dfe2e30:     0xabababab      0xabababab      0xabababab      0xabababab
0x4dfe2e40:     0xabababab      0xabababab      0xabababab      0xabababab
0x4dfe2e50:     0xabababab      0xabababab      0xabababab      0xabababab
0x4dfe2e60:     0xabababab      0xabababab      0xabababab      0xabababab
0x4dfe2e70:     0xabababab      0xabababab      0xabababab      0xabababab
0x4dfe2e80:     0xabababab      0xabababab      0xabababab      0xabababab
0x4dfe2e90:     0xabababab      0xabababab      0xabababab      0xabababab
0x4dfe2ea0:     0xabababab      0xabababab      0xabababab      0xabababab
0x4dfe2eb0:     0xabababab      0xabababab      0xabababab      0xabababab
0x4dfe2ec0:     0xabababab      0xabababab      0xabababab      0x00ababab
0x4dfe2ed0:     0xffffffff      0xffffffff      0x00000000      0x00000000
0x4dfe2ee0:     0xabababab      0xabababab      0x00000006      0x00000000
0x4dfe2ef0:     0x4dfe2f20      0x00000000      0x006c7a99      0x00000000
0x4dfe2f00:     0x4dfe2f50      0x00000000      0x4dfe3000      0x00000000
0x4dfe2f10:     0x00000001      0x00000000      0x4dfe3000      0x00000000
0x4dfe2f20:     0x4dfe2f70      0x00000000      0x006c79f6      0x00000000
0x4dfe2f30:     0x4dfe2fa4      0x00000000      0x00000000      0x00000000
0x4dfe2f40:     0x00000000      0x00000000      0x4dfe3000      0x00000000
0x4dfe2f50:     0x4dfe3138      0x00000000      0x4dfe2f70      0x00000000
0x4dfe2f60:     0x006ceb52      0x00000000      0x4dfe3138      0x00000000
0x4dfe2f70:     0x4dfe2fe0      0x00000000      0x006cf8ad      0x00000000
0x4dfe2f80:     0x4dfe3138      0x00000000      0x4dfe3138      0x00000000
0x4dfe2f90:     0xababab00      0x0000000c      0x4df93980      0x00000000
0x4dfe2fa0:     0xabababab      0x0000000c      0x4df93980      0x00000000
0x4dfe2fb0:     0xabababab      0xabababab      0x4dfe3168      0x00000000
0x4dfe2fc0:     0x4dfb9140      0x00000538      0x000001b8      0x01000000
0x4dfe2fd0:     0x4dfb9140      0x00000000      0x4dfe3000      0x00000000
0x4dfe2fe0:     0x4dfe3040      0x00000000      0x006dc213      0x00000000
0x4dfe2ff0:     0xabababab      0xabababab      0x4dfb90f8      0x00000000
0x4dfe3000:     0x4dfe3138      0x00000000      0x4df93980      0x00000000
0x4dfe3010:     0x00000000      0x00000000      0x00000000      0x00000000
(gdb) x/1i 0x006ebacf
   0x6ebacf <ksynch_wait+127>:  add    $0x10,%rsp
(gdb) x/1i 0x006c7c4b
   0x6c7c4b <mutex_wait_contended_lock+175>:    mov    %rax,-0x18(%rbp)
(gdb) x/1i 0x006e2ed3
   0x6e2ed3 <memquery_iterator_start+95>:       mov    -0x8(%rbp),%rax
(gdb) x/1i 0x006c7a99
   0x6c7a99 <query_memory_ex_from_os+43>:       mov    %al,-0x1(%rbp)
(gdb) x/1i 0x006c79f6
   0x6c79f6 <get_memory_info+64>:       xor    $0x1,%eax
(gdb) x/1i 0x006c79f6
   0x6c79f6 <get_memory_info+64>:       xor    $0x1,%eax
(gdb) x/1i 0x006c79f6-5
   0x6c79f1 <get_memory_info+59>:       callq  0x6c7a6e <query_memory_ex_from_os>
(gdb) x/1i get_memory_info
   0x6c79b6 <get_memory_info>:  push   %rbp
(gdb) x/1i 0x006cf8ad
   0x6cf8ad <copy_frame_to_stack+352>:  mov    %al,-0x29(%rbp)
(gdb) x/1i 0x006dc213
   0x6dc213 <sig_detach+254>:   mov    -0x10(%rbp),%rax

It cannot be two detaches: it would have asserted.

Theory: if synch fails due to bad xl8, it will try again, and the resume
there will then check doing_detach and call sig_detach. The bug is using
the master flag set up front: it has to use a local flag set after the
synch suspends are finished.

@derekbruening
Copy link
Contributor Author

After the fix:

# for ((i=0; i<10000; i++)); do suite/tests/bin/api.static_signal; done > /tmp/x 2>&1; grep -c 'all done' /tmp/x
10000

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant