-
Notifications
You must be signed in to change notification settings - Fork 109
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OpenCL hot loop (100% one thread) when using two command queues with profiling #186
Comments
To repro you may check-out: (e.g. by cloning the gpuowl project, and checking out the hash above) Build with "make" in the source dir,
And observe the CPU usage of the process. (change "-d 0" to select a different GPU). Now do a new compilation, after enabling the second queue -- in main.cpp, look for #if ENABLE_SECOND_QUEUE and either define that symbol or change the condition to be true. Run the new build (with the second queue enabled), and observe difference in CPU usage and in performance. (my observed performance is OpenCL about 30% slower with the second queue enabled). Note that the second queue is not actually used at all. |
For the repro above, it is also needed to create a file "work-1.txt" in the work dir, like this: |
I may have found a clue. In the "hot" thread (the one that is continously busy, 100%), at this location:
the member event_ is always null, which has the effect that the call to
returns immediately, which makes the loop inside WaitRelaxed() hot. In contrast, in the other similar thread, the event_ isn't null, and the loop waits inside hsaKmtWaitOnEvent_Ext() making it not-hot. So: why creating a second queue produces this adverse situation? If confirmed, this seems like a rather serious bug as it precludes (in practice) the use of a second queue. This combined with the queue not supporting out-of-order (as per ROCm/clr#67 ) does not leave any alternatives. |
In the good thread:
And in the hot thread:
|
Thank you for reporting this hot thread issue with the second command queue handling in ROCm and your detailed analysis, preda. We've assigned this issue for immediate triage. |
This reverts commit 94c7004. The reason for revert is: the original change introduces a bug in OpenCL when more than one command queue is created: ROCm/ROCR-Runtime#186 Also, in the orignal commit no reason for the change is provided. In addition to that, 4096 is a particularly bad value (too large) as by default the kernel can only provide 4094 events for DGPUs.
The change is two-fold: - add a flag to configure the pool size used when profiling, this allows to clearly configure the two values (profile vs. not-profile). - change the size of the profile pool down from 4096 which was too large: the kernel only provides 4094 events for DGPUs, *and* using two command-queues in OpenCL results in the bug described here: ROCm/ROCR-Runtime#186 The new default of 1000 has this rationale: it allows up to 4 queues to fit within the 4094 events provided by the kernel (with a little margin).
The change is two-fold: - add a flag to configure the pool size used when profiling, this allows to clearly configure the two values (profile vs. not-profile). - change the size of the profile pool down from 4096 which was too large: the kernel only provides 4094 events for DGPUs, *and* using two command-queues in OpenCL results in the bug described here: ROCm/ROCR-Runtime#186 The new default of 1000 has this rationale: it allows up to 4 queues to fit within the 4094 events provided by the kernel (with a little margin).
@preda -Like you have pointed out here- preda/clr@0385836, its true that kernel driver can only support a maximum of 4096 interrupt signals per device. The failure to allocate any further interrupt events leads to this hot loop. |
@shwetagkhatri while indeed what exposed the problem was CLR allocating too many events to their per-queue pool (and their fix there is welcome), the question remains what is the right behavior when the amount of kernel interrupt signals is exausted. Which is a situation that may still happen even after their fix. For what concerns this issue, feel free to close it as soon as the fix makes it into a public release. |
In fact I think the problem with the "hot queue" is not fully fixed by the CLR reducing their pool size; I found a different scenario that reproduces it even after the CLR fix. Here it is:
At this point we have the hot thread again on the second queue. Let me explain why: when many kernels are launched with completion events, InterruptSignal::EventPool::alloc() allocates all available HW events. What's more, as these events are released by the client, they are not released back to the kernel but are kept in the cache of the EventPool ("events_"). So the problem is not just the CLR pool. There are other ways to exhaust the kernel HW events, and the Queue can't function acceptably then. I think there are two things to do:
|
Yes, with enough signal usage or when profiling we may exhaust the number of interrupt signals because runtime only creates interrupt signals. I have a change in mind to mitigate this but its a bit more intricate. The idea is to only use interrupt signals for the cases where we need to wait, its usually the last kernel/barrier in the batch |
On Ubuntu 22.04, kernel 6.7.9, ROCm 6.1.0 (RC), Radeon Pro VII.
In brief: when creating a second command-queue (that is not even used at all) one thread starts eating 100% CPU, i.e. doing busy-wait. The performance of the other command queue is impacted as well.
Below, I'm going to compare the "normal" situation that is observed when using a single command-queue, vs what is observed when creating a second command queue (hot-loop).
Using a plain mono-threaded OpenCL app with a single host command-queue (in-order, with profiling enabled), with the queue accessed only from the main thread, this is the thread layout that I see ("the normal")
When adding a second command queue (but without using it at all), the thread layout becomes:
The problem is created by the last thread above (6) which is eating 100% CPU being caugh up in a hot loop. To see a few other points in this loop:
All the command queues are in-order and with profiling enabled.
The situation is pretty severe, basically precluding the possibility of using more than one command queue.
The text was updated successfully, but these errors were encountered: