Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FIX for GPU resource data is not present on MI300A #21

Closed
wants to merge 2 commits into from

Conversation

pranswarup
Copy link
Contributor

This pull request addresses an issue where GPU resource data, such as power and temperature, was missing from traces on MI300A. The data is typically tracked through ROCm-SMI.

Changes Made

  1. Temperature Metric:
    • Changed the temperature metric from RSMI_TEMP_TYPE_EDGE to

RSMI_TEMP_TYPE_JUNCTION

as per https://rocm.docs.amd.com/projects/rocm_smi_lib/en/latest/doxygen/html/group__PhysQuer.html#ga40e9da04e4c0cfa17a4f38b97ebc9669

to ensure accurate temperature readings for MI300A as the type RSMI_TEMP_TYPE_EDGE failed to return valid measurement on MI300A.

ROCPROFSYS_RSMI_GET(get_settings(m_dev_id).temp, rsmi_dev_temp_metric_get, _dev_id,
                    RSMI_TEMP_TYPE_JUNCTION, RSMI_TEMP_CURRENT, &m_temp);
  1. Power Metric:
    • Updated the power metric retrieval method from rsmi_dev_power_ave_get to

rsmi_dev_power_get

and specified the power type as

RSMI_CURRENT_POWER

as per https://rocm.docs.amd.com/projects/rocm_smi_lib/en/latest/doxygen/html/group__PowerQuer.html#ga5c6fcd74be46ae056526c96d1a3ac09b

to ensure accurate power readings for MI300A.

RSMI_POWER_TYPE power_type = RSMI_CURRENT_POWER;                    
ROCPROFSYS_RSMI_GET(get_settings(m_dev_id).power, rsmi_dev_power_get, _dev_id, &m_power, 
                    &power_type);

Reason for Changes

These changes were made to ensure that GPU resource data, such as power and temperature, is accurately tracked and included in traces for MI300A. The previous code was not providing power and temperature traces on MI300A.

Testing

  • Verified that the updated metrics provide accurate and complete data for MI300A & MI300X.
  • Ensured that the changes do not affect the functionality for other GPU models.

@dgaliffiAMD dgaliffiAMD changed the title FIX for SWDEV-492298 [rocprofiler-systems] GPU resource data is not present on MI300A FIX for GPU resource data is not present on MI300A Nov 7, 2024
@dgaliffiAMD
Copy link
Collaborator

Hi @pranswarup , the direction of the PR is reversed. Let's close this one and try again.

@dgaliffiAMD dgaliffiAMD closed this Nov 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants