-
Notifications
You must be signed in to change notification settings - Fork 493
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Faster intermediate checkpoints with DCP async save in TorchTune #2006
Changes from all commits
baf00bd
bceebde
47b096b
1da9339
8e9593a
7fb8c6f
e745717
21b6ded
74ae681
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -197,7 +197,7 @@ def load_checkpoint(self, cfg_checkpointer: DictConfig) -> Dict[str, Any]: | |
""" | ||
self._checkpointer = config.instantiate( | ||
cfg_checkpointer, | ||
resume_from_checkpoint=self._resume_from_checkpoint, | ||
should_load_recipe_state=self._resume_from_checkpoint, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this will break every config or system that may be using this function. Although i agree its a better name, i dont think it is worth to change it in this PR. A possibly better alternative is to keep both, handle it in the class, and raise a deprecation warning. We should check what others think. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If users have custom recipes with the resume_from_checkpoint, then it may be better to deprecate the resume_from_checkpoint and introduce should_load_recipe_state, otherwise it will be a backward incompatible change. However if thats not the case, then it seems okay to me to update all of our recipes which is a low risk since there is no functionality change. This PR already updates those. cc @ebsmothers who had this comment in the earlier version of the PR. |
||
) | ||
checkpoint_dict = self._checkpointer.load_checkpoint() | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lets make this an utility to reduce the footprint of checkpointing in the recipe. Wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about it. Challenge is that it would be a slightly bigger refactor. The recipe keys are defined in the training module which causes a circular dependency. We would need to refactor that out of training first and then create an util for this method. I will leave it as it is for now since its an existing method in the recipe.