Skip to content

Mujica Checkpoint:LLM fast and low overhead checkpoint and flash recovery

Notifications You must be signed in to change notification settings

Ind1x1/MujicaChk

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MujicaChk

The project hopes to realize efficient large model, large-scale training memory checkpoint and fast fault recovery

Part of the project code references dlrover-flashcheckpoint/Gemini

MujicaChk has no effect on the accuracy of the training and still recommends that users save normally with torch.save at the end of each epoch to ensure training.

Get Started

MujicaChk builds on DeepSpeed and can be easily embedded into training

Example

Save

        >>> model, optimizer, _, lr_scheduler = deepspeed.initialize(...)
        >>> InmemoryCheckpointer = DeepSpeedCheckpointer(engine, save_dir) 
        >>> if args.save_model_step is not None and global_step % args.save_model_step == 0:
        >>>     InmemoryCheckpointer.save_checkpoint(save_dir)

Load

        >>> model, optimizer, _, lr_scheduler = deepspeed.initialize(...)
        >>> InmemoryCheckpointer = DeepSpeedCheckpointer(engine, save_dir)
        >>> InmemoryCheckpointer.load_checkpoint(save_dir)

About

Mujica Checkpoint:LLM fast and low overhead checkpoint and flash recovery

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages