Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix for data races when threads are dynamically created #49

Open
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

jklinkenberg
Copy link

@jklinkenberg jklinkenberg commented Jan 7, 2021

Problem Description

I discovered data race issues during one of my audits of the EU project POP2 using Extrae.
The code I used when I ran into those issues was using Intel Thread Building Block (TBB) where threads are sometimes dynamically created.

Looking a bit deeper into Extrae, I noticed that data structures for tracing events are created for each thread which is good. However, the global management data structures that point to those data structures are reallocated when the number of threads changes. This process is not thread-safe.

Reproducer

Attached I also created a small reproducer for the problem.
The reproducer creates a pthread that is then dynamically creating more pthreads whereas the master thread is constantly spaming MPI_IProbes. Thus, there is constant access to tracing data structures conflicting with resizing data structures when the number of threads changes.

extrae-datarace-reproducer.zip

With the current Extrae version this results in the following seg fault:

Caught signal 11 (Segmentation fault: address not mapped to object at address 0x9)
==== backtrace ====
    0  /usr/lib64/libucs.so.0(+0x17970) [0x2abcf957f970]
    1  /usr/lib64/libucs.so.0(+0x17b22) [0x2abcf957fb22]
    2  /lib64/libc.so.6(+0x80419) [0x2abc388c1419]
    3  /lib64/libc.so.6(+0x820a5) [0x2abc388c30a5]
    4  /lib64/libc.so.6(+0x83ce2) [0x2abc388c4ce2]
    5  /lib64/libc.so.6(realloc+0x1d2) [0x2abc388c6d82]
    6  /home/jk869269/install/c_extrae/3.8.3/lib/libptmpitrace-3.8.3.so(realloc+0x3d) [0x2abc3507abbd]
    7  /home/jk869269/install/c_extrae/3.8.3/lib/libptmpitrace-3.8.3.so(Extrae_reallocate_thread_info+0x1e) [0x2abc35073f0e]
    8  /home/jk869269/install/c_extrae/3.8.3/lib/libptmpitrace-3.8.3.so(Backend_ChangeNumberOfThreads+0x3e3) [0x2abc3506be93]
    9  /home/jk869269/install/c_extrae/3.8.3/lib/libptmpitrace-3.8.3.so(pthread_create+0x103) [0x2abc350865f3]
   10  reproducer.exe() [0x4024d5]
   11  /home/jk869269/install/c_extrae/3.8.3/lib/libptmpitrace-3.8.3.so(+0x974be) [0x2abc350864be]
   12  /lib64/libpthread.so.0(+0x7ea5) [0x2abc36036ea5]
   13  /lib64/libc.so.6(clone+0x6d) [0x2abc3893f8dd]
===================

Solution

With some debugging and analysis, I created a fix for the problem by using read write locks whereas possible to keep performance up but get rid of the data races. Not sure whether I found all places.
At least the proposed changes solved the issues I had with the TBB code and the reproducer.

Further I also implemented some changes to use lock/unlock Extrae internal pthread locks directly. Otherwise these would also be instrumented which creates additional overhead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant