-
Notifications
You must be signed in to change notification settings - Fork 494
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Ascend NPU as a backend #1826
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1826
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 1485314 with merge base 24d3579 (): This comment was automatically generated by Dr. CI and updates every 15 minutes. |
ca78eca
to
b6332dd
Compare
Hi @ebsmothers, @RdoubleA: I hope you’re doing well! Could you please help me review my code? I would really appreciate it if you could take a look and share any feedback or suggestions. Thank you so much in advance for your time and support! 😊 Best regards |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @noemotiovon thanks for the PR! And apologies for the delay in getting to the review here. A couple other questions I have that don't really fit neatly anywhere inline:
- Do we expect compile to work? If so, we should test that. If not, we could raise an error
- Do we expect quant-related APIs (e.g. QLoRA or QAT) from torchao to work? Same as point 1: if so we should test or possibly raise an error
- PyTorch has now released 2.5 as stable. In general we do not claim to support anything but the latest stable release of PyTorch -- do you know the contract on torch_npu releases here?
distributed training seems to have problems e.g qat_distributed @noemotiovon |
I would be very happy to! I will contact you via email. |
@noemotiovon through 126 email thanks. Looking forward to your email. |
b6332dd
to
1aad0d2
Compare
Basic Usage Test
|
Basic Usage Test
|
Hi @ebsmothers, Thank you very much for reviewing my code! |
ecbacac
to
5d6cf85
Compare
Hi @ebsmothers, could you please take a moment to review the code |
5d6cf85
to
03f782c
Compare
Hi @noemotiovon sorry for the delay! I will take a look tomorrow if that's alright. Until then I'll tag @RdoubleA and @joecummings in case either of them gets a minute to take a look |
03f782c
to
4478809
Compare
Hi @ebsmothers, when you have a moment, could you take a quick look at the recent changes I made? Your feedback would be greatly appreciated. Thank you! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @noemotiovon for your patience! I left a handful more comments, please let me know if anything is unclear
Thank you for your review! Your feedback is very clear, and I will make the necessary code changes as soon as possible based on your suggestions. |
Hi @ebsmothers, I’ve made the code changes based on your suggestions; could you please review it again? Additionally:
Best regards |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @noemotiovon for the updates! I left a couple more comments but I think this is pretty close now. It looks like a unit test is failing in CI though, can you take a look? Happy to provide any debugging pointers if you need
Hi @ebsmothers, Best regards |
c950861
to
073c3d2
Compare
Hi @noemotiovon looks like CI is green now but there are merge conflicts. Can you pull from latest main and merge with your changes? |
2a6c63b
to
1485314
Compare
Hi @ebsmothers, I’d be happy to! I’ve resolved the conflicts and made the necessary adjustments. Could you please help merge it? Thank you for your continued support, and wishing you all the best in work and life! 😊 |
What does this PR do?
Overview
🚀This PR enables the users of
torhtune
to leverage the Ascend NPU for better performance in inferencing.This PR primarily addresses the initial refactoring of device-independent code. In upcoming changes, we’ll focus on further adjustments, using NPU as an example to refine each recipe and complete the remaining device-independent modifications. For now, this PR only touches on recipe lora_finetune_single_device and full_finetune_single_device.
For more details, see: [#1797].
Environment
Note
To properly install CANN, see [here] for more details.
The version of
torch-npu
should match that oftorch
, see [here] for more details.In addition,
torch_npu
has a pre-release version, 2.4.0 RC1, which is also the basis for this test. For more information, please visit [here].Examples
To start with, the library
torch_npu
should be correctly installed and imported. Part of the codes are showed below:torchtune/utils/_device_support.py
:Plus, there are some other places of the codes might be adjusted, which won't be too much.
Feel free to leave comments to guide me in further improvements 😊.