Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenCL hot loop (100% one thread) when using two command queues with profiling #186

Open
preda opened this issue Mar 16, 2024 · 9 comments
Assignees

Comments

@preda
Copy link
Contributor

preda commented Mar 16, 2024

On Ubuntu 22.04, kernel 6.7.9, ROCm 6.1.0 (RC), Radeon Pro VII.

In brief: when creating a second command-queue (that is not even used at all) one thread starts eating 100% CPU, i.e. doing busy-wait. The performance of the other command queue is impacted as well.

Below, I'm going to compare the "normal" situation that is observed when using a single command-queue, vs what is observed when creating a second command queue (hot-loop).

Using a plain mono-threaded OpenCL app with a single host command-queue (in-order, with profiling enabled), with the queue accessed only from the main thread, this is the thread layout that I see ("the normal")

(gdb) i th
  Id   Target Id                                      Frame 
* 1    Thread 0x7ffff7c5a080 (LWP 142088) "prpll-amd" __futex_abstimed_wait_common64 (private=<optimized out>, cancel=true, abstime=0x7fffffffd450, op=393, expected=0, futex_word=0x555555691068) at ./nptl/futex-internal.c:57
  2    Thread 0x7fffee200640 (LWP 142091) "prpll-amd" __GI___ioctl (fd=fd@entry=4, request=request@entry=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
  5    Thread 0x7ffff797d640 (LWP 142094) "prpll-amd" __GI___ioctl (fd=fd@entry=4, request=request@entry=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
(gdb) bt
#0  __futex_abstimed_wait_common64 (private=<optimized out>, cancel=true, abstime=0x7fffffffd450, op=393, expected=0, futex_word=0x555555691068) at ./nptl/futex-internal.c:57
#1  __futex_abstimed_wait_common (cancel=true, private=<optimized out>, abstime=0x7fffffffd450, clockid=0, expected=0, futex_word=0x555555691068) at ./nptl/futex-internal.c:87
#2  __GI___futex_abstimed_wait_cancelable64 (futex_word=futex_word@entry=0x555555691068, expected=expected@entry=0, clockid=clockid@entry=0, abstime=abstime@entry=0x7fffffffd450, private=<optimized out>) at ./nptl/futex-internal.c:139
#3  0x00007ffff769c9b0 in do_futex_wait (sem=sem@entry=0x555555691068, abstime=abstime@entry=0x7fffffffd450, clockid=0) at ./nptl/sem_waitcommon.c:111
#4  0x00007ffff769ca53 in __new_sem_wait_slow64 (sem=0x555555691068, abstime=0x7fffffffd450, clockid=0) at ./nptl/sem_waitcommon.c:183
#5  0x00007ffff7ecadd1 in ?? () from /opt/rocm/lib/libamdocl64.so
#6  0x00007ffff7ecab2e in ?? () from /opt/rocm/lib/libamdocl64.so
#7  0x00007ffff7eb2298 in ?? () from /opt/rocm/lib/libamdocl64.so
#8  0x00007ffff7eb64c2 in ?? () from /opt/rocm/lib/libamdocl64.so
#9  0x00007ffff7e6c1f9 in clFinish () from /opt/rocm/lib/libamdocl64.so
#10 0x00005555555b4cb3 in finish (q=0x555555940260) at src/clwrap.cpp:296
#11 0x00005555555669ef in Queue::finish (this=0x5555556a2660) at src/Queue.cpp:76
#12 0x000055555559fec6 in Gpu::isPrimePRP (this=0x555555940750, args=..., task=...) at src/Gpu.cpp:1145
#13 0x00005555555bbde3 in Task::execute (this=0x7fffffffd9b0, context=..., args=...) at src/Task.cpp:163
#14 0x00005555555935d2 in threadWorker (args=..., context=..., instance=1) at src/main.cpp:25
#15 0x0000555555593b70 in main (argc=7, argv=0x7fffffffdec8) at src/main.cpp:78
(gdb) t 2
[Switching to thread 2 (Thread 0x7fffee200640 (LWP 142091))]
#0  __GI___ioctl (fd=fd@entry=4, request=request@entry=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
36	../sysdeps/unix/sysv/linux/ioctl.c: No such file or directory.
(gdb) bt
#0  __GI___ioctl (fd=fd@entry=4, request=request@entry=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
#1  0x00007fffee91a560 in kmtIoctl (fd=4, request=request@entry=3222817548, arg=arg@entry=0x7fffee1f8220) at /long_pathname_so_that_rpms_can_package_the_debug_info/src/libhsakmt/src/libhsakmt.c:13
#2  0x00007fffee913330 in hsaKmtWaitOnMultipleEvents_Ext (event_age=0x7fffee1f82d0, Milliseconds=4294967294, WaitOnAll=false, NumEvents=<optimized out>, Events=0x7fffee1f8350) at /long_pathname_so_that_rpms_can_package_the_debug_info/src/libhsakmt/src/events.c:409
#3  hsaKmtWaitOnMultipleEvents_Ext (Events=0x7fffee1f8350, NumEvents=4, WaitOnAll=<optimized out>, Milliseconds=4294967294, event_age=0x7fffee1f82d0) at /long_pathname_so_that_rpms_can_package_the_debug_info/src/libhsakmt/src/events.c:380
#4  0x00007fffee89c70c in rocr::core::Signal::WaitAny (signal_count=signal_count@entry=9, hsa_signals=hsa_signals@entry=0x7fffe8001150, conds=conds@entry=0x7fffe8000cd0, values=values@entry=0x7fffe80011e0, timeout=timeout@entry=18446744073709551615, wait_hint=<optimized out>, wait_hint@entry=HSA_WAIT_STATE_BLOCKED, satisfying_value=<optimized out>)
    at /long_pathname_so_that_rpms_can_package_the_debug_info/src/hsa/runtime/opensrc/hsa-runtime/core/runtime/signal.cpp:324
#5  0x00007fffee8613f6 in rocr::AMD::hsa_amd_signal_wait_any (signal_count=9, hsa_signals=0x7fffe8001150, conds=0x7fffe8000cd0, values=0x7fffe80011e0, timeout_hint=timeout_hint@entry=18446744073709551615, wait_hint=wait_hint@entry=HSA_WAIT_STATE_BLOCKED, satisfying_value=0x7fffee1f8488)
    at /long_pathname_so_that_rpms_can_package_the_debug_info/src/hsa/runtime/opensrc/hsa-runtime/core/runtime/hsa_ext_amd.cpp:587
#6  0x00007fffee879aca in rocr::core::Runtime::AsyncEventsLoop () at /long_pathname_so_that_rpms_can_package_the_debug_info/src/hsa/runtime/opensrc/hsa-runtime/core/runtime/runtime.cpp:1473
#7  0x00007fffee82a85b in rocr::os::ThreadTrampoline (arg=<optimized out>) at /long_pathname_so_that_rpms_can_package_the_debug_info/src/hsa/runtime/opensrc/hsa-runtime/core/util/lnx/os_linux.cpp:81
#8  0x00007ffff7694ac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#9  0x00007ffff7726850 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
(gdb) t 5
[Switching to thread 5 (Thread 0x7ffff797d640 (LWP 142094))]
#0  __GI___ioctl (fd=fd@entry=4, request=request@entry=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
36	in ../sysdeps/unix/sysv/linux/ioctl.c
(gdb) bt
#0  __GI___ioctl (fd=fd@entry=4, request=request@entry=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
#1  0x00007fffee91a560 in kmtIoctl (fd=4, request=request@entry=3222817548, arg=arg@entry=0x7ffff79751c0) at /long_pathname_so_that_rpms_can_package_the_debug_info/src/libhsakmt/src/libhsakmt.c:13
#2  0x00007fffee913330 in hsaKmtWaitOnMultipleEvents_Ext (event_age=0x7ffff79752b8, Milliseconds=4294967294, WaitOnAll=true, NumEvents=<optimized out>, Events=0x7ffff7975278) at /long_pathname_so_that_rpms_can_package_the_debug_info/src/libhsakmt/src/events.c:409
#3  hsaKmtWaitOnMultipleEvents_Ext (Events=0x7ffff7975278, NumEvents=1, WaitOnAll=<optimized out>, Milliseconds=4294967294, event_age=0x7ffff79752b8) at /long_pathname_so_that_rpms_can_package_the_debug_info/src/libhsakmt/src/events.c:380
#4  0x00007fffee913adb in hsaKmtWaitOnEvent_Ext (Event=<optimized out>, Milliseconds=<optimized out>, event_age=<optimized out>) at /long_pathname_so_that_rpms_can_package_the_debug_info/src/libhsakmt/src/events.c:226
#5  0x00007fffee8655d0 in rocr::core::InterruptSignal::WaitRelaxed (this=0x7ff9e027bd40, condition=HSA_SIGNAL_CONDITION_LT, compare_value=1, timeout=<optimized out>, wait_hint=HSA_WAIT_STATE_BLOCKED) at /long_pathname_so_that_rpms_can_package_the_debug_info/src/hsa/runtime/opensrc/hsa-runtime/core/runtime/interrupt_signal.cpp:243
#6  0x00007fffee8652ae in rocr::core::InterruptSignal::WaitAcquire (this=<optimized out>, condition=<optimized out>, compare_value=<optimized out>, timeout=<optimized out>, wait_hint=<optimized out>) at /long_pathname_so_that_rpms_can_package_the_debug_info/src/hsa/runtime/opensrc/hsa-runtime/core/runtime/interrupt_signal.cpp:251
#7  0x00007fffee859ef9 in rocr::HSA::hsa_signal_wait_scacquire (hsa_signal=..., condition=HSA_SIGNAL_CONDITION_LT, compare_value=1, timeout_hint=18446744073709551615, wait_state_hint=HSA_WAIT_STATE_BLOCKED) at /long_pathname_so_that_rpms_can_package_the_debug_info/src/hsa/runtime/opensrc/hsa-runtime/core/runtime/hsa.cpp:1220
#8  0x00007ffff7eea454 in ?? () from /opt/rocm/lib/libamdocl64.so
#9  0x00007ffff7eeb9fe in ?? () from /opt/rocm/lib/libamdocl64.so
#10 0x00007ffff7eefef8 in ?? () from /opt/rocm/lib/libamdocl64.so
#11 0x00007ffff7eb5c93 in ?? () from /opt/rocm/lib/libamdocl64.so
#12 0x00007ffff7eb713f in ?? () from /opt/rocm/lib/libamdocl64.so
#13 0x00007ffff7e4d5b0 in ?? () from /opt/rocm/lib/libamdocl64.so
#14 0x00007ffff7eaa157 in ?? () from /opt/rocm/lib/libamdocl64.so
#15 0x00007ffff7694ac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#16 0x00007ffff7726850 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

When adding a second command queue (but without using it at all), the thread layout becomes:

(gdb) i th
  Id   Target Id                                      Frame 
* 1    Thread 0x7ffff7c5a080 (LWP 141830) "prpll-amd" __futex_abstimed_wait_common64 (private=<optimized out>, cancel=true, abstime=0x7fffffffd440, op=393, expected=0, futex_word=0x555555691068) at ./nptl/futex-internal.c:57
  2    Thread 0x7fffee200640 (LWP 141833) "prpll-amd" __GI___ioctl (fd=fd@entry=4, request=request@entry=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
  5    Thread 0x7ffff797d640 (LWP 141836) "prpll-amd" __futex_abstimed_wait_common64 (private=<optimized out>, cancel=true, abstime=0x7ffff7975330, op=393, expected=0, futex_word=0x5555559406a8) at ./nptl/futex-internal.c:57
  6    Thread 0x7ffff790f640 (LWP 141837) "prpll-amd" 0x00007fffee865451 in rocr::__rdtsc () at /usr/lib/gcc/x86_64-linux-gnu/11/include/ia32intrin.h:114
(gdb) bt
#0  __futex_abstimed_wait_common64 (private=<optimized out>, cancel=true, abstime=0x7fffffffd440, op=393, expected=0, futex_word=0x555555691068) at ./nptl/futex-internal.c:57
#1  __futex_abstimed_wait_common (cancel=true, private=<optimized out>, abstime=0x7fffffffd440, clockid=0, expected=0, futex_word=0x555555691068) at ./nptl/futex-internal.c:87
#2  __GI___futex_abstimed_wait_cancelable64 (futex_word=futex_word@entry=0x555555691068, expected=expected@entry=0, clockid=clockid@entry=0, abstime=abstime@entry=0x7fffffffd440, private=<optimized out>) at ./nptl/futex-internal.c:139
#3  0x00007ffff769c9b0 in do_futex_wait (sem=sem@entry=0x555555691068, abstime=abstime@entry=0x7fffffffd440, clockid=0) at ./nptl/sem_waitcommon.c:111
#4  0x00007ffff769ca53 in __new_sem_wait_slow64 (sem=0x555555691068, abstime=0x7fffffffd440, clockid=0) at ./nptl/sem_waitcommon.c:183
#5  0x00007ffff7ecadd1 in ?? () from /opt/rocm/lib/libamdocl64.so
#6  0x00007ffff7ecab2e in ?? () from /opt/rocm/lib/libamdocl64.so
#7  0x00007ffff7eb2298 in ?? () from /opt/rocm/lib/libamdocl64.so
#8  0x00007ffff7eb64c2 in ?? () from /opt/rocm/lib/libamdocl64.so
#9  0x00007ffff7e6c1f9 in clFinish () from /opt/rocm/lib/libamdocl64.so
#10 0x00005555555b4cff in finish (q=0x5555556a26e0) at src/clwrap.cpp:296
#11 0x00005555555669ef in Queue::finish (this=0x555555944ed0) at src/Queue.cpp:76
#12 0x000055555559ff12 in Gpu::isPrimePRP (this=0x555555940b10, args=..., task=...) at src/Gpu.cpp:1145
#13 0x00005555555bbe2f in Task::execute (this=0x7fffffffd9a0, context=..., args=...) at src/Task.cpp:163
#14 0x00005555555935d2 in threadWorker (args=..., context=..., instance=1) at src/main.cpp:25
#15 0x0000555555593b92 in main (argc=7, argv=0x7fffffffdec8) at src/main.cpp:78
(gdb) t 2
[Switching to thread 2 (Thread 0x7fffee200640 (LWP 141833))]
#0  __GI___ioctl (fd=fd@entry=4, request=request@entry=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
36	../sysdeps/unix/sysv/linux/ioctl.c: No such file or directory.
(gdb) bt
#0  __GI___ioctl (fd=fd@entry=4, request=request@entry=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
#1  0x00007fffee91a560 in kmtIoctl (fd=4, request=request@entry=3222817548, arg=arg@entry=0x7fffee1f8220) at /long_pathname_so_that_rpms_can_package_the_debug_info/src/libhsakmt/src/libhsakmt.c:13
#2  0x00007fffee913330 in hsaKmtWaitOnMultipleEvents_Ext (event_age=0x7fffee1f82d0, Milliseconds=4294967294, WaitOnAll=false, NumEvents=<optimized out>, Events=0x7fffe80015d0) at /long_pathname_so_that_rpms_can_package_the_debug_info/src/libhsakmt/src/events.c:409
#3  hsaKmtWaitOnMultipleEvents_Ext (Events=0x7fffe80015d0, NumEvents=4, WaitOnAll=<optimized out>, Milliseconds=4294967294, event_age=0x7fffee1f82d0) at /long_pathname_so_that_rpms_can_package_the_debug_info/src/libhsakmt/src/events.c:380
#4  0x00007fffee89c70c in rocr::core::Signal::WaitAny (signal_count=signal_count@entry=11, hsa_signals=hsa_signals@entry=0x7fffe80011f0, conds=conds@entry=0x7fffe8000d70, values=values@entry=0x7fffe8001280, timeout=timeout@entry=18446744073709551615, wait_hint=<optimized out>, wait_hint@entry=HSA_WAIT_STATE_BLOCKED, satisfying_value=<optimized out>)
    at /long_pathname_so_that_rpms_can_package_the_debug_info/src/hsa/runtime/opensrc/hsa-runtime/core/runtime/signal.cpp:324
#5  0x00007fffee8613f6 in rocr::AMD::hsa_amd_signal_wait_any (signal_count=11, hsa_signals=0x7fffe80011f0, conds=0x7fffe8000d70, values=0x7fffe8001280, timeout_hint=timeout_hint@entry=18446744073709551615, wait_hint=wait_hint@entry=HSA_WAIT_STATE_BLOCKED, satisfying_value=0x7fffee1f8488)
    at /long_pathname_so_that_rpms_can_package_the_debug_info/src/hsa/runtime/opensrc/hsa-runtime/core/runtime/hsa_ext_amd.cpp:587
#6  0x00007fffee879aca in rocr::core::Runtime::AsyncEventsLoop () at /long_pathname_so_that_rpms_can_package_the_debug_info/src/hsa/runtime/opensrc/hsa-runtime/core/runtime/runtime.cpp:1473
#7  0x00007fffee82a85b in rocr::os::ThreadTrampoline (arg=<optimized out>) at /long_pathname_so_that_rpms_can_package_the_debug_info/src/hsa/runtime/opensrc/hsa-runtime/core/util/lnx/os_linux.cpp:81
#8  0x00007ffff7694ac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#9  0x00007ffff7726850 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
(gdb) t 3
Unknown thread 3.
(gdb) t 5
[Switching to thread 5 (Thread 0x7ffff797d640 (LWP 141836))]
#0  __futex_abstimed_wait_common64 (private=<optimized out>, cancel=true, abstime=0x7ffff7975330, op=393, expected=0, futex_word=0x5555559406a8) at ./nptl/futex-internal.c:57
57	./nptl/futex-internal.c: No such file or directory.
(gdb) bt
#0  __futex_abstimed_wait_common64 (private=<optimized out>, cancel=true, abstime=0x7ffff7975330, op=393, expected=0, futex_word=0x5555559406a8) at ./nptl/futex-internal.c:57
#1  __futex_abstimed_wait_common (cancel=true, private=<optimized out>, abstime=0x7ffff7975330, clockid=0, expected=0, futex_word=0x5555559406a8) at ./nptl/futex-internal.c:87
#2  __GI___futex_abstimed_wait_cancelable64 (futex_word=futex_word@entry=0x5555559406a8, expected=expected@entry=0, clockid=clockid@entry=0, abstime=abstime@entry=0x7ffff7975330, private=<optimized out>) at ./nptl/futex-internal.c:139
#3  0x00007ffff769c9b0 in do_futex_wait (sem=sem@entry=0x5555559406a8, abstime=abstime@entry=0x7ffff7975330, clockid=0) at ./nptl/sem_waitcommon.c:111
#4  0x00007ffff769ca53 in __new_sem_wait_slow64 (sem=0x5555559406a8, abstime=0x7ffff7975330, clockid=0) at ./nptl/sem_waitcommon.c:183
#5  0x00007ffff7ecadd1 in ?? () from /opt/rocm/lib/libamdocl64.so
#6  0x00007ffff7ecab2e in ?? () from /opt/rocm/lib/libamdocl64.so
#7  0x00007ffff7eb5d58 in ?? () from /opt/rocm/lib/libamdocl64.so
#8  0x00007ffff7eb713f in ?? () from /opt/rocm/lib/libamdocl64.so
#9  0x00007ffff7e4d5b0 in ?? () from /opt/rocm/lib/libamdocl64.so
#10 0x00007ffff7eaa157 in ?? () from /opt/rocm/lib/libamdocl64.so
#11 0x00007ffff7694ac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#12 0x00007ffff7726850 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
(gdb) t 6
[Switching to thread 6 (Thread 0x7ffff790f640 (LWP 141837))]
#0  0x00007fffee865451 in rocr::__rdtsc () at /usr/lib/gcc/x86_64-linux-gnu/11/include/ia32intrin.h:114
114	  return __builtin_ia32_rdtsc ();
(gdb) bt
#0  0x00007fffee865451 in rocr::__rdtsc () at /usr/lib/gcc/x86_64-linux-gnu/11/include/ia32intrin.h:114
#1  rocr::timer::fast_clock::raw_now () at /long_pathname_so_that_rpms_can_package_the_debug_info/src/hsa/runtime/opensrc/hsa-runtime/core/util/timer.h:149
#2  rocr::timer::fast_clock::now () at /long_pathname_so_that_rpms_can_package_the_debug_info/src/hsa/runtime/opensrc/hsa-runtime/core/util/timer.h:140
#3  rocr::core::InterruptSignal::WaitRelaxed (this=0x7ff9d803bbe0, condition=HSA_SIGNAL_CONDITION_LT, compare_value=1, timeout=<optimized out>, wait_hint=HSA_WAIT_STATE_BLOCKED) at /long_pathname_so_that_rpms_can_package_the_debug_info/src/hsa/runtime/opensrc/hsa-runtime/core/runtime/interrupt_signal.cpp:211
#4  0x00007fffee8652ae in rocr::core::InterruptSignal::WaitAcquire (this=<optimized out>, condition=<optimized out>, compare_value=<optimized out>, timeout=<optimized out>, wait_hint=<optimized out>) at /long_pathname_so_that_rpms_can_package_the_debug_info/src/hsa/runtime/opensrc/hsa-runtime/core/runtime/interrupt_signal.cpp:251
#5  0x00007fffee859ef9 in rocr::HSA::hsa_signal_wait_scacquire (hsa_signal=..., condition=HSA_SIGNAL_CONDITION_LT, compare_value=1, timeout_hint=18446744073709551615, wait_state_hint=HSA_WAIT_STATE_BLOCKED) at /long_pathname_so_that_rpms_can_package_the_debug_info/src/hsa/runtime/opensrc/hsa-runtime/core/runtime/hsa.cpp:1220
#6  0x00007ffff7eea454 in ?? () from /opt/rocm/lib/libamdocl64.so
#7  0x00007ffff7eeb9fe in ?? () from /opt/rocm/lib/libamdocl64.so
#8  0x00007ffff7eefef8 in ?? () from /opt/rocm/lib/libamdocl64.so
#9  0x00007ffff7eb5c93 in ?? () from /opt/rocm/lib/libamdocl64.so
#10 0x00007ffff7eb713f in ?? () from /opt/rocm/lib/libamdocl64.so
#11 0x00007ffff7e4d5b0 in ?? () from /opt/rocm/lib/libamdocl64.so
#12 0x00007ffff7eaa157 in ?? () from /opt/rocm/lib/libamdocl64.so
#13 0x00007ffff7694ac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#14 0x00007ffff7726850 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

The problem is created by the last thread above (6) which is eating 100% CPU being caugh up in a hot loop. To see a few other points in this loop:

#0  0x00007fffee8655bf in rocr::timer::duration_cast<std::chrono::duration<long, std::ratio<1, 1000> >, double, std::ratio<1, 1000000000000> > (d=...) at /long_pathname_so_that_rpms_can_package_the_debug_info/src/hsa/runtime/opensrc/hsa-runtime/core/util/timer.h:78
#1  rocr::core::InterruptSignal::WaitRelaxed (this=0x7ff9d80ac890, condition=HSA_SIGNAL_CONDITION_LT, compare_value=1, timeout=<optimized out>, wait_hint=HSA_WAIT_STATE_BLOCKED) at /long_pathname_so_that_rpms_can_package_the_debug_info/src/hsa/runtime/opensrc/hsa-runtime/core/runtime/interrupt_signal.cpp:240
#2  0x00007fffee8652ae in rocr::core::InterruptSignal::WaitAcquire (this=<optimized out>, condition=<optimized out>, compare_value=<optimized out>, timeout=<optimized out>, wait_hint=<optimized out>) at /long_pathname_so_that_rpms_can_package_the_debug_info/src/hsa/runtime/opensrc/hsa-runtime/core/runtime/interrupt_signal.cpp:251
#3  0x00007fffee859ef9 in rocr::HSA::hsa_signal_wait_scacquire (hsa_signal=..., condition=HSA_SIGNAL_CONDITION_LT, compare_value=1, timeout_hint=18446744073709551615, wait_state_hint=HSA_WAIT_STATE_BLOCKED) at /long_pathname_so_that_rpms_can_package_the_debug_info/src/hsa/runtime/opensrc/hsa-runtime/core/runtime/hsa.cpp:1220
#4  0x00007ffff7eea454 in ?? () from /opt/rocm/lib/libamdocl64.so
#5  0x00007ffff7eeb9fe in ?? () from /opt/rocm/lib/libamdocl64.so
#6  0x00007ffff7eefef8 in ?? () from /opt/rocm/lib/libamdocl64.so
#7  0x00007ffff7eb5c93 in ?? () from /opt/rocm/lib/libamdocl64.so
#8  0x00007ffff7eb713f in ?? () from /opt/rocm/lib/libamdocl64.so
#9  0x00007ffff7e4d5b0 in ?? () from /opt/rocm/lib/libamdocl64.so
#10 0x00007ffff7eaa157 in ?? () from /opt/rocm/lib/libamdocl64.so
#11 0x00007ffff7694ac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#12 0x00007ffff7726850 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

#0  rocr::core::InterruptSignal::WaitRelaxed (this=0x7ff9d801adf0, condition=HSA_SIGNAL_CONDITION_LT, compare_value=1, timeout=<optimized out>, wait_hint=HSA_WAIT_STATE_BLOCKED) at /usr/include/c++/11/chrono:521
#1  0x00007fffee8652ae in rocr::core::InterruptSignal::WaitAcquire (this=<optimized out>, condition=<optimized out>, compare_value=<optimized out>, timeout=<optimized out>, wait_hint=<optimized out>) at /long_pathname_so_that_rpms_can_package_the_debug_info/src/hsa/runtime/opensrc/hsa-runtime/core/runtime/interrupt_signal.cpp:251
#2  0x00007fffee859ef9 in rocr::HSA::hsa_signal_wait_scacquire (hsa_signal=..., condition=HSA_SIGNAL_CONDITION_LT, compare_value=1, timeout_hint=18446744073709551615, wait_state_hint=HSA_WAIT_STATE_BLOCKED) at /long_pathname_so_that_rpms_can_package_the_debug_info/src/hsa/runtime/opensrc/hsa-runtime/core/runtime/hsa.cpp:1220
#3  0x00007ffff7eea454 in ?? () from /opt/rocm/lib/libamdocl64.so
#4  0x00007ffff7eeb9fe in ?? () from /opt/rocm/lib/libamdocl64.so
#5  0x00007ffff7eefef8 in ?? () from /opt/rocm/lib/libamdocl64.so
#6  0x00007ffff7eb5c93 in ?? () from /opt/rocm/lib/libamdocl64.so
#7  0x00007ffff7eb713f in ?? () from /opt/rocm/lib/libamdocl64.so
#8  0x00007ffff7e4d5b0 in ?? () from /opt/rocm/lib/libamdocl64.so
#9  0x00007ffff7eaa157 in ?? () from /opt/rocm/lib/libamdocl64.so
#10 0x00007ffff7694ac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#11 0x00007ffff7726850 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

All the command queues are in-order and with profiling enabled.
The situation is pretty severe, basically precluding the possibility of using more than one command queue.

@preda
Copy link
Contributor Author

preda commented Mar 16, 2024

To repro you may check-out:
https://github.com/preda/gpuowl/tree/a20d77aa5d06832942d0fec59c690c739a4e7098

(e.g. by cloning the gpuowl project, and checking out the hash above)

Build with "make" in the source dir,
run with

./build-debug/prpll -d 0 -prp 118063003 -verbose

And observe the CPU usage of the process. (change "-d 0" to select a different GPU).

Now do a new compilation, after enabling the second queue -- in main.cpp, look for #if ENABLE_SECOND_QUEUE and either define that symbol or change the condition to be true.

Run the new build (with the second queue enabled), and observe difference in CPU usage and in performance. (my observed performance is OpenCL about 30% slower with the second queue enabled).

Note that the second queue is not actually used at all.

@preda
Copy link
Contributor Author

preda commented Mar 20, 2024

For the repro above, it is also needed to create a file "work-1.txt" in the work dir, like this:
echo PRP=118845473 > work-1.txt

@preda
Copy link
Contributor Author

preda commented Mar 23, 2024

I may have found a clue. In the "hot" thread (the one that is continously busy, 100%), at this location:

hsaKmtWaitOnEvent_Ext(event_, wait_ms, &event_age);

the member event_ is always null, which has the effect that the call to

hsaKmtWaitOnEvent_Ext(event_, wait_ms, &event_age);

returns immediately, which makes the loop inside WaitRelaxed() hot.

In contrast, in the other similar thread, the event_ isn't null, and the loop waits inside hsaKmtWaitOnEvent_Ext() making it not-hot.

So: why creating a second queue produces this adverse situation?

If confirmed, this seems like a rather serious bug as it precludes (in practice) the use of a second queue. This combined with the queue not supporting out-of-order (as per ROCm/clr#67 ) does not leave any alternatives.

@preda
Copy link
Contributor Author

preda commented Mar 23, 2024

In the good thread:

(gdb) bt
#0  rocr::core::InterruptSignal::WaitRelaxed (this=0x7ff9e0020cc0, condition=HSA_SIGNAL_CONDITION_LT, compare_value=1, timeout=18446744073709551615, wait_hint=HSA_WAIT_STATE_BLOCKED) at /home/preda/ROCR-Runtime/src/core/runtime/interrupt_signal.cpp:243
#1  0x00007fffee8af824 in rocr::core::InterruptSignal::WaitAcquire (this=0x7ff9e0020cc0, condition=HSA_SIGNAL_CONDITION_LT, compare_value=1, timeout=18446744073709551615, wait_hint=HSA_WAIT_STATE_BLOCKED) at /home/preda/ROCR-Runtime/src/core/runtime/interrupt_signal.cpp:251
#2  0x00007fffee8976e9 in rocr::HSA::hsa_signal_wait_scacquire (hsa_signal=..., condition=HSA_SIGNAL_CONDITION_LT, compare_value=1, timeout_hint=18446744073709551615, wait_state_hint=HSA_WAIT_STATE_BLOCKED) at /home/preda/ROCR-Runtime/src/core/runtime/hsa.cpp:1220
#3  0x00007fffee907c2d in hsa_signal_wait_scacquire (signal=..., condition=HSA_SIGNAL_CONDITION_LT, compare_value=1, timeout_hint=18446744073709551615, wait_expectancy_hint=HSA_WAIT_STATE_BLOCKED) at /home/preda/ROCR-Runtime/src/core/common/hsa_table_interface.cpp:341
#4  0x00007ffff7eea4b4 in ?? () from /opt/rocm/lib/libamdocl64.so
#5  0x00007ffff7eeba5e in ?? () from /opt/rocm/lib/libamdocl64.so
#6  0x00007ffff7eeff58 in ?? () from /opt/rocm/lib/libamdocl64.so
#7  0x00007ffff7eb5c93 in ?? () from /opt/rocm/lib/libamdocl64.so
#8  0x00007ffff7eb713f in ?? () from /opt/rocm/lib/libamdocl64.so
#9  0x00007ffff7e4d5b0 in ?? () from /opt/rocm/lib/libamdocl64.so
#10 0x00007ffff7eaa157 in ?? () from /opt/rocm/lib/libamdocl64.so
#11 0x00007ffff7694ac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#12 0x00007ffff7726850 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

(gdb) p this->event_
$27 = (HsaEvent *) 0x7ff9e0020d10
(gdb) p *this->event_
$28 = {EventId = 929, EventData = {EventType = HSA_EVENTTYPE_SIGNAL, EventData = {SyncVar = {SyncVar = {UserData = 0x0, UserDataPtrValue = 0}, SyncVarSize = 8}, NodeChangeState = {Flags = HSA_EVENTTYPE_NODECHANGE_ADD}, DeviceState = {NodeId = 0, Device = HSA_DEVICE_CPU, Flags = 8}, MemoryAccessFault = {NodeId = 0, VirtualAddress = 34359738368, Failure = {NotPresent = 0, 
          ReadOnly = 0, NoExecute = 0, GpuAccess = 0, ECC = 0, Imprecise = 0, ErrorType = 0, Reserved = 0}, Flags = HSA_EVENTID_MEMORY_RECOVERABLE}, HwException = {NodeId = 0, ResetType = 0, MemoryLost = 8, ResetCause = HSA_EVENTID_HW_EXCEPTION_GPU_HANG}}, HWData1 = 929, HWData2 = 140737353751816, HWData3 = 929}}

And in the hot thread:

(gdb) bt
#0  rocr::core::InterruptSignal::WaitRelaxed (this=0x7ff9d00175d0, condition=HSA_SIGNAL_CONDITION_LT, compare_value=1, timeout=18446744073709551615, wait_hint=HSA_WAIT_STATE_BLOCKED) at /home/preda/ROCR-Runtime/src/core/runtime/interrupt_signal.cpp:243
#1  0x00007fffee8af824 in rocr::core::InterruptSignal::WaitAcquire (this=0x7ff9d00175d0, condition=HSA_SIGNAL_CONDITION_LT, compare_value=1, timeout=18446744073709551615, wait_hint=HSA_WAIT_STATE_BLOCKED) at /home/preda/ROCR-Runtime/src/core/runtime/interrupt_signal.cpp:251
#2  0x00007fffee8976e9 in rocr::HSA::hsa_signal_wait_scacquire (hsa_signal=..., condition=HSA_SIGNAL_CONDITION_LT, compare_value=1, timeout_hint=18446744073709551615, wait_state_hint=HSA_WAIT_STATE_BLOCKED) at /home/preda/ROCR-Runtime/src/core/runtime/hsa.cpp:1220
#3  0x00007fffee907c2d in hsa_signal_wait_scacquire (signal=..., condition=HSA_SIGNAL_CONDITION_LT, compare_value=1, timeout_hint=18446744073709551615, wait_expectancy_hint=HSA_WAIT_STATE_BLOCKED) at /home/preda/ROCR-Runtime/src/core/common/hsa_table_interface.cpp:341
#4  0x00007ffff7eea4b4 in ?? () from /opt/rocm/lib/libamdocl64.so
#5  0x00007ffff7eeba5e in ?? () from /opt/rocm/lib/libamdocl64.so
#6  0x00007ffff7eeff58 in ?? () from /opt/rocm/lib/libamdocl64.so
#7  0x00007ffff7eb5c93 in ?? () from /opt/rocm/lib/libamdocl64.so
#8  0x00007ffff7eb713f in ?? () from /opt/rocm/lib/libamdocl64.so
#9  0x00007ffff7e4d5b0 in ?? () from /opt/rocm/lib/libamdocl64.so
#10 0x00007ffff7eaa157 in ?? () from /opt/rocm/lib/libamdocl64.so
#11 0x00007ffff7694ac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#12 0x00007ffff7726850 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
(gdb) p this->event_
$29 = (HsaEvent *) 0x0

@shwetagkhatri shwetagkhatri self-assigned this Mar 24, 2024
@shwetagkhatri
Copy link
Member

Thank you for reporting this hot thread issue with the second command queue handling in ROCm and your detailed analysis, preda. We've assigned this issue for immediate triage.

preda added a commit to preda/clr that referenced this issue Mar 25, 2024
This reverts commit 94c7004.

The reason for revert is: the original change introduces a bug in OpenCL
when more than one command queue is created:
ROCm/ROCR-Runtime#186

Also, in the orignal commit no reason for the change is provided.

In addition to that, 4096 is a particularly bad value (too large) as by default
the kernel can only provide 4094 events for DGPUs.
preda added a commit to preda/clr that referenced this issue Mar 25, 2024
The change is two-fold:

- add a flag to configure the pool size used when profiling,
this allows to clearly configure the two values (profile vs. not-profile).

- change the size of the profile pool down from 4096 which was too large:
the kernel only provides 4094 events for DGPUs, *and* using two command-queues
in OpenCL results in the bug described here:
ROCm/ROCR-Runtime#186

The new default of 1000 has this rationale: it allows up to 4 queues to fit
within the 4094 events provided by the kernel (with a little margin).
preda added a commit to preda/clr that referenced this issue Mar 25, 2024
The change is two-fold:

- add a flag to configure the pool size used when profiling,
this allows to clearly configure the two values (profile vs. not-profile).

- change the size of the profile pool down from 4096 which was too large:
the kernel only provides 4094 events for DGPUs, *and* using two command-queues
in OpenCL results in the bug described here:
ROCm/ROCR-Runtime#186

The new default of 1000 has this rationale: it allows up to 4 queues to fit
within the 4094 events provided by the kernel (with a little margin).
@shwetagkhatri
Copy link
Member

@preda -Like you have pointed out here- preda/clr@0385836, its true that kernel driver can only support a maximum of 4096 interrupt signals per device. The failure to allocate any further interrupt events leads to this hot loop.
This issue is going to be addressed by - ROCm/clr#71 (comment)

@preda
Copy link
Contributor Author

preda commented Mar 29, 2024

@shwetagkhatri while indeed what exposed the problem was CLR allocating too many events to their per-queue pool (and their fix there is welcome), the question remains what is the right behavior when the amount of kernel interrupt signals is exausted. Which is a situation that may still happen even after their fix.

For what concerns this issue, feel free to close it as soon as the fix makes it into a public release.

@preda
Copy link
Contributor Author

preda commented Apr 1, 2024

In fact I think the problem with the "hot queue" is not fully fixed by the CLR reducing their pool size; I found a different scenario that reproduces it even after the CLR fix. Here it is:

  • create the first queue (Q1)
  • enqueue a large number of kernels, let's say 10'000 on Q1, each kernel launched with completion event
  • don't consume the completion events early (enqueue all 10'000 kernels first)
  • finish() on the first queue, and release all the completion events
  • create the second queue (Q2)

At this point we have the hot thread again on the second queue. Let me explain why: when many kernels are launched with completion events, InterruptSignal::EventPool::alloc() allocates all available HW events. What's more, as these events are released by the client, they are not released back to the kernel but are kept in the cache of the EventPool ("events_").
So even after the client release, when the time to create Q2 comes there are no HW events available and the thread becomes hot as in the earlier scenario.

So the problem is not just the CLR pool. There are other ways to exhaust the kernel HW events, and the Queue can't function acceptably then.

I think there are two things to do:

  1. The queue should error-out loudly if it can't allocate at least one event, instead of silently trying to function without event.
  2. secondarilly, InterruptSignal::EventPool should also release events back to the kernel in some situations, to prevent kernel event exaustion even though the client does release all the events.

@saleelk
Copy link
Contributor

saleelk commented Apr 3, 2024

Yes, with enough signal usage or when profiling we may exhaust the number of interrupt signals because runtime only creates interrupt signals. I have a change in mind to mitigate this but its a bit more intricate. The idea is to only use interrupt signals for the cases where we need to wait, its usually the last kernel/barrier in the batch
for ex k0, k1, b1, k2, copy1, k3, ....kN, bN. Only the last kernel or bN which we queue needs to have an interrupt signal because thats what the async handler thread will wait on. Copies or rest of the kernels in the batch doesnt need interrupt signals

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants