-
Notifications
You must be signed in to change notification settings - Fork 471
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Log memory usage during checkpoint save #677
Conversation
ghstack-source-id: aa906a002fccbc9e80acfe3c4848febe23d5071f Pull Request resolved: #590
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/677
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 0b4fbe1 with merge base c9d1cdc (): This comment was automatically generated by Dr. CI and updates every 15 minutes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems mostly for debugging right? Why does the user care if this works? I think we should think a bit about logs that are for debugging (mainly for us) and for actual users. Even iof the peak memory is high, I think the % of users who'll understand this is because of the _load_state_dict call is really small. I think we should think about the intent of these logs a bit and see how to help the user on this journey. Going to request changes for now and lets discuss.
Context
save_checkpoint
.Changelog
Test plan