-
Notifications
You must be signed in to change notification settings - Fork 825
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multi-GPU Training #96
Comments
Hi, I have same problem here. For my case, during the training stage, there is an error saying that it expected all tensors to be on the same device. |
Hello I added some information on the multi-gpu setup in the README. In One thing to note is that accelerate does not use all GPUs optimally. It slices the model on different GPUs allowing training for models that don't fit on one GPU. However, at any given point in time during training, only one GPU will be in use at a time. |
Still got the error saying that |
Thanks for reply. |
Hi all, I was trying to feed the model into 2 3090ti (24G VRAM each). Loading in 4bit, however I got the error message "ValueError: You can't train a model that has been loaded in 8-bit precision on multiple devices.". It seems that Accelerate does not allow 4/8-bit model to be trained on multiple devices. Anyone met same issue? |
This is kind of puzzling to me about how "accelerate" is being used with qlora. If it distributes different layers to different GPUs, then I would expect to be able to pipeline training data so that all the GPUs work at the same time on different stages of training data batches. Is this a design choice that needs to be revisited? I have the training process running now on 4 Titan X and yeah I can see how only one is doing something at a time. It means that a batch has to go through the entire pipeline before the new batch is put in. It makes no sense to me. It runs 4 times slower than it could wrt. throughput. |
Even I am getting the same. Any Idea how to resolve the issue? |
how to solve? |
ddp_find_unused_parameters False |
same error |
same error, @artidoro Can you help us to solve this problem? Thanks. |
I have solve the problem, ddp_find_unused_parameters=False you can learn more with the code: https://github.com/yangjianxin1/Firefly/blob/master/train_qlora.py#L104 |
Repost from this thread : Multi-gpu training example? Finally, it works.
*change qlora.py
Tested in Runpod environment with Python 3.10 and Torch 2.0.0+cu117 when gradient_checkpointing is True, a little bit slow. But it spread all GPU VRAM usage. For example if one GPU, it needs 20 GBs of VRAM. Compared to
when gradient_checkpointing is False, yet it will be faster. But it more consumes more GPU VRAM. For example if one GPU, it needs 20 GBs of VRAM. Got 10 seconds/iter. But it consumes gpu usage multipled by number of GPUs. Compared to the vanilla one (original)
Got 55 seconds/iter. So it is very slow compared previous method. |
@ichsan2895 how does this parallelization exactly works? Does each of the gpu's compute a different part of the process? I understand that the memory of each GPU is independent of the rest for that execution, right (it is not the sum of all the memories)? |
It seems to be parallel since I saw this code Lines 341 to 342 in 7f4e95a
But I'm sorry I don't understand the detailed ways of parallelization works. My focuses is only makes the code works with multi GPUs 👍 |
@ichsan2895 Can we use 3 gpus if we have 4 gpus in our machine to do data parallel ? |
Directly running qlora.py on a machine with multi GPUs will load the model on multiple GPUs, but the training process is conducted on a single GPU only. The training batch size is equal to per_device_train_batch_size.
Is anyone success with Multi-GPU training?
The text was updated successfully, but these errors were encountered: