cuda: enable checkpoint support for paused tasks #2517

rst0git · 2024-11-12T14:35:12Z

If a CUDA process is already in a "locked" or "checkpointed" state during criu dump, the CUDA plugin fails with an error because it attempts an unnecessary "lock" action using the cuda-checkpoint tool. This pull request extends the CUDA plugin to handle such case by first checking the original state of the CUDA process and skipping unnecessary "lock" and "checkpoint" actions if the process was already locked or checkpointed before CRIU was invoked.

plugins/cuda/cuda_plugin.c

jesus-ramos

LGTM

avagin · 2024-11-12T20:31:41Z

Is there any real use-case for that?

rst0git · 2024-11-12T20:51:11Z

Is there any real use-case for that?

The use-case is similar to #2514 -- the CUDA tasks may be in a "locked" or "checkpointed" state before criu dump is invoked to ensure consistent checkpoint/restore, particularly in distributed model training where multiple containers are running across different cluster nodes.

If a CUDA process is already in a "locked" or "checkpointed" state during criu dump, the CUDA plugin currently fails with an error because it attempts an unnecessary "lock" action using the cuda-checkpoint tool. This patch extends the CUDA plugin to handle such cases by first verifying the initial state of the CUDA processes and skipping unnecessary "lock" and "checkpoint" actions when a process has been locked or checkpointed before CRIU is invoked. In particular, CUDA tasks may already be in a "locked" or "checkpointed" state to ensure consistent checkpoint/restore for distributed workloads, such as model training, where multiple containers run across different cluster nodes. Another use case for this functionality is optimizing resource utilization, where CUDA tasks with low-priority are preempted immediately to release GPU resources needed by high-priority tasks, and the paused workloads are later resumed or migrated to another node. Signed-off-by: Radostin Stoyanov <[email protected]>

Signed-off-by: Radostin Stoyanov <[email protected]>

rst0git · 2024-11-13T10:30:20Z

@avagin I've updated the commit message with a brief description of the use-cases for this functionality.

rst0git requested review from jesus-ramos and avagin November 12, 2024 14:35

rst0git force-pushed the 2024-11-12-cuda-checkpointing-paused-tasks branch 6 times, most recently from 03b51f9 to 8f544eb Compare November 12, 2024 17:00

jesus-ramos reviewed Nov 12, 2024

View reviewed changes

plugins/cuda/cuda_plugin.c Outdated Show resolved Hide resolved

rst0git force-pushed the 2024-11-12-cuda-checkpointing-paused-tasks branch from 8f544eb to 030bc9a Compare November 12, 2024 20:11

jesus-ramos approved these changes Nov 12, 2024

View reviewed changes

rst0git force-pushed the 2024-11-12-cuda-checkpointing-paused-tasks branch from 030bc9a to 58de16a Compare November 12, 2024 20:35

rst0git added 2 commits November 13, 2024 10:25

test: add get-state to mocked cuda-checkpoint tool

20a7cfa

Signed-off-by: Radostin Stoyanov <[email protected]>

rst0git force-pushed the 2024-11-12-cuda-checkpointing-paused-tasks branch from 58de16a to 20a7cfa Compare November 13, 2024 10:28

avagin merged commit dd6b580 into checkpoint-restore:criu-dev Nov 13, 2024
37 of 41 checks passed

rst0git deleted the 2024-11-12-cuda-checkpointing-paused-tasks branch November 13, 2024 15:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cuda: enable checkpoint support for paused tasks #2517

cuda: enable checkpoint support for paused tasks #2517

rst0git commented Nov 12, 2024

jesus-ramos left a comment

avagin commented Nov 12, 2024

rst0git commented Nov 12, 2024

rst0git commented Nov 13, 2024

cuda: enable checkpoint support for paused tasks #2517

cuda: enable checkpoint support for paused tasks #2517

Conversation

rst0git commented Nov 12, 2024

jesus-ramos left a comment

Choose a reason for hiding this comment

avagin commented Nov 12, 2024

rst0git commented Nov 12, 2024

rst0git commented Nov 13, 2024