-
Notifications
You must be signed in to change notification settings - Fork 471
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix lora single device fine tune checkpoint saving & nan loss when use_dora=True #1909
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1909
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit b6e7239 with merge base e99b890 (): This comment was automatically generated by Dr. CI and updates every 15 minutes. |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1909 +/- ##
==========================================
- Coverage 70.44% 70.31% -0.14%
==========================================
Files 308 308
Lines 16270 16285 +15
==========================================
- Hits 11462 11450 -12
- Misses 4808 4835 +27 ☔ View full report in Codecov by Sentry. |
eadd920
to
33c856a
Compare
Thank you so much for contributing this fix @mirceamironenco. Would you be able to share a small training run showing that this works OK? Another thing that would be great to see would be modifying one of our LoRA recipe tests to make sure the SD loading logic works as expected for DoRA. You could add this here
And even just add a parameterize to the test with but I won't block on this - happy to file an issue for a follow up (and to add a unit test). Evan will have some thoughts here too so we should make sure he's happy. |
Running
Log file:
|
Thanks for running this! I'll try out the vision config. |
My logs with
Still seems to go up just not as much. Note that we are processing more tokens so it's not that unusual, but I'm not familiar enough with how to recipe should behave to say more. |
Thanks for working on this! Looks like CI is red though. Re @SalmanMohammadi's question about memory consumption, we can pull some profiles to figure out what's going on. In my experience DoRA is definitely more memory hungry, but a 2 GB difference seems high for a model of this size. It's not a blocker to land the bug fix, but I'm curious whether the increase in memory scales with model size (if so we could run into problems with 7B+ models). As for the fact that it increases over the first N steps that could be OK (and expected) given batch-dependent sequence lengths, as long as it stabilizes. |
2eea592
to
b6e7239
Compare
I've added a fix that should address the failing CI for dora. This changes the initialization of the magnitude parameter so that a new torchtune/torchtune/modules/peft/dora.py Lines 82 to 91 in b6e7239
Currently the recipes construct the adapter params dict before the initialization:
Because torchtune/torchtune/modules/peft/dora.py Line 73 in e99b890
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @mirceamironenco for fixing this! This looks great to me
(Very) belated discovery: this breaks DoRA distributed training. E.g.
fails after this commit due to magnitude still being on meta device. 22 @SLR722 |
Context
What is the purpose of this PR? Is it to
Fixes #1903 .
Changelog
What are the changes made in this PR?
Test plan
Please make sure to do each of the following if applicable to your PR. If you're unsure about any one of these just ask and we will happily help. We also have a contributing page for some guidance on contributing.
pre-commit install
)pytest tests
pytest tests -m integration_test
UX
If your function changed a public API, please add a dummy example of what the user experience will look like when calling it.
Here is a docstring example
and a tutorial example