[Feature]: add a batched version of hsa_amd_profiling_convert_tick_to_system_domain
.
#243
Labels
hsa_amd_profiling_convert_tick_to_system_domain
.
#243
Suggestion Description
I'm capturing timestamps on device with
__builtin_readsteadycounter
(or extracting them from signals myself) and end up with quite a few of them in large buffers that I'd like to translate without the additional API overhead of callinghsa_amd_profiling_convert_tick_to_system_domain
on each one in a loop. It'd be nice for such cases to have ahsa_amd_profiling_convert_tick_batch_to_system_domain
that accepted a list of ticks and either updated them in-place or in an output buffer.What I noticed is that
GpuAgent::TranslateTime
takes a lock, does some looping math to see if synchronization is required, and potentially synchronizes - in a batched mode that could be done once and the lock needs not be held for the entire duration of the translation (t0/t1 can be reused). Batching has a tradeoff with accuracy as it's possible for the skew to change over the course of a batch but translating them all consistently is better behavior than an outer loop: today it's possible for the timestamps to change base in the middle of translation and produce inconsistent results and that messes up reporting. The user of such an API could choose the batch/flush frequency to balance the drift to work around that and manage it when it makes sense (in-between top-level invocations/frames/etc where there's natural points to rebase).Operating System
No response
GPU
No response
ROCm Component
No response
The text was updated successfully, but these errors were encountered: