Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PAPI ROCm: Missed Reads Intercept Mode #69

Open
jrodgers-github opened this issue Aug 24, 2023 · 1 comment
Open

PAPI ROCm: Missed Reads Intercept Mode #69

jrodgers-github opened this issue Aug 24, 2023 · 1 comment
Assignees

Comments

@jrodgers-github
Copy link
Contributor

Finding evidence that PAPI ROCm PAPI_read operations are missing results when executed in intercept mode. Sample workflows highlighting what I'm seeing:

  • Sample 1:
              PAPI_start
              PAPI_read              <-  Counter values zero as we’d expect
              <kernel launch>
              PAPI_read              <-  Counter values still zero
              PAPI_stop              <-  Meaningful counter values collected
  • Sample 2:
              PAPI_start
              PAPI_read              <-  Counter values zero as we’d expect
              <kernel launch A>
              PAPI_read              <-  Counter values still zero
              PAPI_read              <-  Counter values from <kernel launch A> collected
              <kernel launch B>
              PAPI_read              <-  No changes in values, still reports values from <kernel launch A>
              PAPI_read              <-  Counter values from <kernel launch A> + <kernel launch B> collected
              <kernel launch C>
              PAPI_read              <-  No changes in values, still reports values from <kernel launch A> + <kernel launch B>
              PAPI_stop              <-  Counter values from <kernel launch A> + <kernel launch B> + <kernel launch C> collected

Consulting with @gcongiu, this may be expected behavior as:

In intercept mode, PAPI_read(s) that happen before a kernel has finished running and/or before rocprofiler has fetched the kernel counters return whatever value was present until that point in the eventset counters (the component does not synchronize the GPU stream internally like old cuda component used to do). Otherwise, it reads the new counters (get_context_counters).

In your example above the behavior looks consistent with the ROCm component's code. If you wish to read counters for a kernel, in intercept mode, you should synchronize the stream first to make sure the kernel has finished running and the counters are collected.

However, it does not look like synchronizing the streams alone is enough to prevent the undesirable behavior, as I’m still detecting the issue after calling hipDeviceSynchronize before & after each read (after is overkill, but I wanted to be sure). Additionally, finding that pairing the device/stream synchronization with any of the following is also unfruitful:

  • Stopping/re-starting the queues with each read
  • Flushing the context pools prior to checking the dispatch queue results
  • Changing the context pools' properties/configurations
    • E.g. smaller number of entires, alternate profiling modes, etc.

If possible, it would ideal if we could find a means of enforcing a synchronization such that the counters could be resolved with each PAPI_read.

@jrodgers-github
Copy link
Contributor Author

Attached you will find vector_add.zip, which shows the following behavior on select platforms:

*****ROCm DRIVER VERSION*****
======================= ROCm System Management Interface =======================
========================= Version of System Component ==========================
Driver version: 6.0.5
================================================================================
============================= End of ROCm SMI Log ==============================
*****COMPILE*****
/opt/rocm-5.5.1/hip/bin/hipcc -g --offload-arch=gfx90a -o vector_add.o -c -I/opt/rocm-5.5.1/include -I/opt/rocm-5.5.1/include/hsa -I/<PAPI_PATH>/include vector_add.cpp
/opt/rocm-5.5.1/hip/bin/hipcc -o vector_add vector_add.o -L/opt/rocm-5.5.1/lib -lhsa-runtime64 -L/<PAPI_PATH>/lib64 -lpapi -I/opt/rocm-5.5.1/include -I/opt/rocm-5.5.1/include/hsa -I/<PAPI_PATH>/include 
*****RUN*****
PAPI_read Before Kernel Launch
[JR-DEBUG] intercept_ctx_read dispatch_count=0
              rocm:::GPUBusy:device=0 = 0
              rocm:::SQ_WAVES:device=0 = 0
HIP Kernel Launch
hipDeviceSynchronize
PAPI_read After Kernel Launch
[JR-DEBUG] intercept_ctx_read dispatch_count=0
              rocm:::GPUBusy:device=0 = 0
              rocm:::SQ_WAVES:device=0 = 0
PAPI_read from PAPI_stop
[JR-DEBUG] intercept_ctx_read dispatch_count=1
              rocm:::GPUBusy:device=0 = 100
              rocm:::SQ_WAVES:device=0 = 16384
PASSED!

Note: in the above, the “[JR-DEBUG] intercept_ctx_read dispatch_count={0,1}” lines are a result of adding the following patch to PAPI:

@@ intercept_ctx_read(rocp_ctx_t rocp_ctx, long long **counts)

     unsigned long tid = (*thread_id_fn)();
     int dispatch_count = fetch_dispatch_counter(tid);
+// BEGIN JR TESTING
+    fprintf(stderr, "[JR-DEBUG] intercept_ctx_read dispatch_count=%d\n", dispatch_count);
+// END JR TESTING
     if (dispatch_count == 0) {
         *counts = rocp_ctx->u.intercept.counters;
         goto fn_exit;

Environment configuration prior to launching reproducer:

# Setup PAPI
export PAPI_ROCM_ROOT=${ROCM_PATH}
export ROCP_METRICS=${PAPI_ROCM_ROOT}/rocprofiler/lib/metrics.xml
export HSA_TOOLS_LIB=${PAPI_ROCM_ROOT}/rocprofiler/lib/librocprofiler64.so
# Set PAPI to use intercept instead of default sampling
export ROCP_HSA_INTERCEPT=1

Let me know if there's any issues getting the reproducer going.

@gcongiu gcongiu self-assigned this Aug 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants