Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cuda: enable checkpoint support for paused tasks #2517

Commits on Nov 13, 2024

  1. cuda: enable checkpoint support for paused tasks

    If a CUDA process is already in a "locked" or "checkpointed" state
    during criu dump, the CUDA plugin currently fails with an error because
    it attempts an unnecessary "lock" action using the cuda-checkpoint tool.
    
    This patch extends the CUDA plugin to handle such cases by first
    verifying the initial state of the CUDA processes and skipping
    unnecessary "lock" and "checkpoint" actions when a process has been
    locked or checkpointed before CRIU is invoked.
    
    In particular, CUDA tasks may already be in a "locked" or "checkpointed"
    state to ensure consistent checkpoint/restore for distributed workloads,
    such as model training, where multiple containers run across different
    cluster nodes.
    
    Another use case for this functionality is optimizing resource
    utilization, where CUDA tasks with low-priority are preempted
    immediately to release GPU resources needed by high-priority
    tasks, and the paused workloads are later resumed or migrated
    to another node.
    
    Signed-off-by: Radostin Stoyanov <[email protected]>
    rst0git committed Nov 13, 2024
    Configuration menu
    Copy the full SHA
    ff750a9 View commit details
    Browse the repository at this point in the history
  2. test: add get-state to mocked cuda-checkpoint tool

    Signed-off-by: Radostin Stoyanov <[email protected]>
    rst0git committed Nov 13, 2024
    Configuration menu
    Copy the full SHA
    20a7cfa View commit details
    Browse the repository at this point in the history