Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Log memory usage during checkpoint save #677

Closed
wants to merge 12 commits into from
Closed

Log memory usage during checkpoint save #677

wants to merge 12 commits into from

Conversation

rohan-varma
Copy link
Member

Context

  • To help debug peak memory usage during checkpoint save, adding logs at the end of save_checkpoint.

Changelog

  • ...

Test plan

  • ....

Copy link

pytorch-bot bot commented Apr 10, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/677

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 0b4fbe1 with merge base c9d1cdc (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 10, 2024
@rohan-varma rohan-varma requested a review from ebsmothers April 10, 2024 00:42
Copy link
Contributor

@kartikayk kartikayk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems mostly for debugging right? Why does the user care if this works? I think we should think a bit about logs that are for debugging (mainly for us) and for actual users. Even iof the peak memory is high, I think the % of users who'll understand this is because of the _load_state_dict call is really small. I think we should think about the intent of these logs a bit and see how to help the user on this journey. Going to request changes for now and lets discuss.

@rohan-varma rohan-varma closed this May 6, 2024
@rohan-varma rohan-varma deleted the log_ckpt branch May 6, 2024 08:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants