FIX for GPU resource data is not present on MI300A #21
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This pull request addresses an issue where GPU resource data, such as power and temperature, was missing from traces on MI300A. The data is typically tracked through ROCm-SMI.
Changes Made
RSMI_TEMP_TYPE_EDGE
toRSMI_TEMP_TYPE_JUNCTION
as per https://rocm.docs.amd.com/projects/rocm_smi_lib/en/latest/doxygen/html/group__PhysQuer.html#ga40e9da04e4c0cfa17a4f38b97ebc9669
to ensure accurate temperature readings for MI300A as the type RSMI_TEMP_TYPE_EDGE failed to return valid measurement on MI300A.
ROCPROFSYS_RSMI_GET(get_settings(m_dev_id).temp, rsmi_dev_temp_metric_get, _dev_id, RSMI_TEMP_TYPE_JUNCTION, RSMI_TEMP_CURRENT, &m_temp);
rsmi_dev_power_ave_get
torsmi_dev_power_get
and specified the power type as
RSMI_CURRENT_POWER
as per https://rocm.docs.amd.com/projects/rocm_smi_lib/en/latest/doxygen/html/group__PowerQuer.html#ga5c6fcd74be46ae056526c96d1a3ac09b
to ensure accurate power readings for MI300A.
RSMI_POWER_TYPE power_type = RSMI_CURRENT_POWER; ROCPROFSYS_RSMI_GET(get_settings(m_dev_id).power, rsmi_dev_power_get, _dev_id, &m_power, &power_type);
Reason for Changes
These changes were made to ensure that GPU resource data, such as power and temperature, is accurately tracked and included in traces for MI300A. The previous code was not providing power and temperature traces on MI300A.
Testing