You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, I'm using NVIDIA GPU Operator to expose GPUs on my OpenShift cluster, and trying to create a PromQL aggregation to correlate GPU PIDs (Process ID) to K8s Pods.
From the exported DCGM metrics, I saw no metric with a label representing GPU PID.
In the DCGM release notes, the following is mentioned:
The following features have been dropped or deprecated starting with DCGM 3.0:
The following field identifiers have been removed:
DCGM_FI_DEV_GRAPHICS_PIDS
DCGM_FI_DEV_COMPUTE_PIDS
...
My question - is there a way to retrieve this info in the current version?
Let me know if I should submit this issue to the DCGM GitHub repo instead.
The workaround I've implemented for the time being is running a custom DaemonSet on all GPU nodes, running the following command to correlate GPU PID and Pod UID, and using this as a custom metric:
Ask your question
Hi, I'm using NVIDIA GPU Operator to expose GPUs on my OpenShift cluster, and trying to create a PromQL aggregation to correlate GPU PIDs (Process ID) to K8s Pods.
From the exported DCGM metrics, I saw no metric with a label representing GPU PID.
In the DCGM release notes, the following is mentioned:
My question - is there a way to retrieve this info in the current version?
Let me know if I should submit this issue to the DCGM GitHub repo instead.
The workaround I've implemented for the time being is running a custom DaemonSet on all GPU nodes, running the following command to correlate GPU PID and Pod UID, and using this as a custom metric:
Versions:
OpenShift: v4.12.35
Kubernetes: v1.25.12+ba5cc25
NVIDIA GPU Operator: v23.3.2
DCGM Exporter: v3.1.7
The text was updated successfully, but these errors were encountered: