You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Finding evidence that PAPI ROCm PAPI_read operations are missing results when executed in intercept mode. Sample workflows highlighting what I'm seeing:
Sample 1:
PAPI_start
PAPI_read <- Counter values zero as we’d expect
<kernel launch>
PAPI_read <- Counter values still zero
PAPI_stop <- Meaningful counter values collected
Sample 2:
PAPI_start
PAPI_read <- Counter values zero as we’d expect
<kernel launch A>
PAPI_read <- Counter values still zero
PAPI_read <- Counter values from <kernel launch A> collected
<kernel launch B>
PAPI_read <- No changes in values, still reports values from <kernel launch A>
PAPI_read <- Counter values from <kernel launch A> + <kernel launch B> collected
<kernel launch C>
PAPI_read <- No changes in values, still reports values from <kernel launch A> + <kernel launch B>
PAPI_stop <- Counter values from <kernel launch A> + <kernel launch B> + <kernel launch C> collected
Consulting with @gcongiu, this may be expected behavior as:
In intercept mode, PAPI_read(s) that happen before a kernel has finished running and/or before rocprofiler has fetched the kernel counters return whatever value was present until that point in the eventset counters (the component does not synchronize the GPU stream internally like old cuda component used to do). Otherwise, it reads the new counters (get_context_counters).
In your example above the behavior looks consistent with the ROCm component's code. If you wish to read counters for a kernel, in intercept mode, you should synchronize the stream first to make sure the kernel has finished running and the counters are collected.
However, it does not look like synchronizing the streams alone is enough to prevent the undesirable behavior, as I’m still detecting the issue after calling hipDeviceSynchronize before & after each read (after is overkill, but I wanted to be sure). Additionally, finding that pairing the device/stream synchronization with any of the following is also unfruitful:
Stopping/re-starting the queues with each read
Flushing the context pools prior to checking the dispatch queue results
Changing the context pools' properties/configurations
E.g. smaller number of entires, alternate profiling modes, etc.
If possible, it would ideal if we could find a means of enforcing a synchronization such that the counters could be resolved with each PAPI_read.
The text was updated successfully, but these errors were encountered:
Attached you will find vector_add.zip, which shows the following behavior on select platforms:
*****ROCm DRIVER VERSION*****
======================= ROCm System Management Interface =======================
========================= Version of System Component ==========================
Driver version: 6.0.5
================================================================================
============================= End of ROCm SMI Log ==============================
*****COMPILE*****
/opt/rocm-5.5.1/hip/bin/hipcc -g --offload-arch=gfx90a -o vector_add.o -c -I/opt/rocm-5.5.1/include -I/opt/rocm-5.5.1/include/hsa -I/<PAPI_PATH>/include vector_add.cpp
/opt/rocm-5.5.1/hip/bin/hipcc -o vector_add vector_add.o -L/opt/rocm-5.5.1/lib -lhsa-runtime64 -L/<PAPI_PATH>/lib64 -lpapi -I/opt/rocm-5.5.1/include -I/opt/rocm-5.5.1/include/hsa -I/<PAPI_PATH>/include
*****RUN*****
PAPI_read Before Kernel Launch
[JR-DEBUG] intercept_ctx_read dispatch_count=0
rocm:::GPUBusy:device=0 = 0
rocm:::SQ_WAVES:device=0 = 0
HIP Kernel Launch
hipDeviceSynchronize
PAPI_read After Kernel Launch
[JR-DEBUG] intercept_ctx_read dispatch_count=0
rocm:::GPUBusy:device=0 = 0
rocm:::SQ_WAVES:device=0 = 0
PAPI_read from PAPI_stop
[JR-DEBUG] intercept_ctx_read dispatch_count=1
rocm:::GPUBusy:device=0 = 100
rocm:::SQ_WAVES:device=0 = 16384
PASSED!
Note: in the above, the “[JR-DEBUG] intercept_ctx_read dispatch_count={0,1}” lines are a result of adding the following patch to PAPI:
@@ intercept_ctx_read(rocp_ctx_t rocp_ctx, long long **counts)
unsigned long tid = (*thread_id_fn)();
int dispatch_count = fetch_dispatch_counter(tid);
+// BEGIN JR TESTING
+ fprintf(stderr, "[JR-DEBUG] intercept_ctx_read dispatch_count=%d\n", dispatch_count);
+// END JR TESTING
if (dispatch_count == 0) {
*counts = rocp_ctx->u.intercept.counters;
goto fn_exit;
Environment configuration prior to launching reproducer:
# Setup PAPI
export PAPI_ROCM_ROOT=${ROCM_PATH}
export ROCP_METRICS=${PAPI_ROCM_ROOT}/rocprofiler/lib/metrics.xml
export HSA_TOOLS_LIB=${PAPI_ROCM_ROOT}/rocprofiler/lib/librocprofiler64.so
# Set PAPI to use intercept instead of default sampling
export ROCP_HSA_INTERCEPT=1
Let me know if there's any issues getting the reproducer going.
Finding evidence that PAPI ROCm
PAPI_read
operations are missing results when executed in intercept mode. Sample workflows highlighting what I'm seeing:Consulting with @gcongiu, this may be expected behavior as:
However, it does not look like synchronizing the streams alone is enough to prevent the undesirable behavior, as I’m still detecting the issue after calling
hipDeviceSynchronize
before & after each read (after is overkill, but I wanted to be sure). Additionally, finding that pairing the device/stream synchronization with any of the following is also unfruitful:If possible, it would ideal if we could find a means of enforcing a synchronization such that the counters could be resolved with each
PAPI_read
.The text was updated successfully, but these errors were encountered: