cuda: fix check for GPU device availability #2510

rst0git · 2024-11-02T08:43:55Z

The check for /dev/nvidiactl to determine if the CUDA plugin can be used is unreliable because in some cases the default path for driver installation is different ¹. This pull request changes the logic to check if a GPU device is available in /proc/driver/nvidia/gpus/. This approach is similar to torch.cuda.is_available() and it is a more accurate indicator. The subsequent check for support of the cuda-checkpoint --action option would confirm if the driver supports checkpoint/restore.

Fixes: #2509

https://github.com/NVIDIA/gpu-operator ↩

plugins/cuda/cuda_plugin.c

avagin · 2024-11-08T03:44:49Z

LGTM. Thanks.

The check for `/dev/nvidiactl` to determine if the CUDA plugin can be used is unreliable because in some cases the default path for driver installation is different [1]. This patch changes the logic to check if a GPU device is available in `/proc/driver/nvidia/gpus/`. This approach is similar to `torch.cuda.is_available()` and it is a more accurate indicator. The subsequent check for support of the `cuda-checkpoint --action` option would confirm if the driver supports checkpoint/restore. [1] https://github.com/NVIDIA/gpu-operator Fixes: checkpoint-restore#2509 Signed-off-by: Radostin Stoyanov <[email protected]>

rst0git marked this pull request as ready for review November 2, 2024 09:37

rst0git requested review from jesus-ramos and avagin and removed request for jesus-ramos November 2, 2024 09:37

jesus-ramos approved these changes Nov 2, 2024

View reviewed changes

avagin reviewed Nov 4, 2024

View reviewed changes

plugins/cuda/cuda_plugin.c Outdated Show resolved Hide resolved

rst0git force-pushed the 2024-11-02-cuda-check branch 4 times, most recently from 9a50892 to a69ea00 Compare November 4, 2024 23:40

avagin reviewed Nov 8, 2024

View reviewed changes

plugins/cuda/cuda_plugin.c Outdated Show resolved Hide resolved

avagin reviewed Nov 8, 2024

View reviewed changes

plugins/cuda/cuda_plugin.c Outdated Show resolved Hide resolved

rst0git force-pushed the 2024-11-02-cuda-check branch 4 times, most recently from 3d103d0 to 3940ee7 Compare November 10, 2024 17:08

rst0git changed the title ~~cuda: check for libcuda instead of /dev/nvidiactl~~ cuda: fix check for GPU device availability Nov 10, 2024

rst0git force-pushed the 2024-11-02-cuda-check branch from 3940ee7 to de9d552 Compare November 10, 2024 17:13

avagin merged commit 26dcc21 into checkpoint-restore:criu-dev Nov 12, 2024
38 of 41 checks passed

rst0git deleted the 2024-11-02-cuda-check branch November 12, 2024 09:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cuda: fix check for GPU device availability #2510

cuda: fix check for GPU device availability #2510

rst0git commented Nov 2, 2024 •

edited

Loading

avagin commented Nov 8, 2024

cuda: fix check for GPU device availability #2510

cuda: fix check for GPU device availability #2510

Conversation

rst0git commented Nov 2, 2024 • edited Loading

Footnotes

avagin commented Nov 8, 2024

rst0git commented Nov 2, 2024 •

edited

Loading