Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-GPU Training #96

Open
wcy1122 opened this issue May 31, 2023 · 17 comments
Open

Multi-GPU Training #96

wcy1122 opened this issue May 31, 2023 · 17 comments

Comments

@wcy1122
Copy link

wcy1122 commented May 31, 2023

Directly running qlora.py on a machine with multi GPUs will load the model on multiple GPUs, but the training process is conducted on a single GPU only. The training batch size is equal to per_device_train_batch_size.
Is anyone success with Multi-GPU training?

@lukaswangbk
Copy link

Hi, I have same problem here. For my case, during the training stage, there is an error saying that it expected all tensors to be on the same device.

@artidoro
Copy link
Owner

artidoro commented Jun 1, 2023

Hello I added some information on the multi-gpu setup in the README. In qlora.py we use Accelerate. You are correct that per_device_train/eval_batch_size refers to global batch size unlike the name suggests. Let me know if you still have questions.

One thing to note is that accelerate does not use all GPUs optimally. It slices the model on different GPUs allowing training for models that don't fit on one GPU. However, at any given point in time during training, only one GPU will be in use at a time.

@lukaswangbk
Copy link

Still got the error saying that it expected all tensors to be on the same device.cuda:0, cuda:1

@wcy1122
Copy link
Author

wcy1122 commented Jun 1, 2023

Hello I added some information on the multi-gpu setup in the README. In qlora.py we use Accelerate. You are correct that per_device_train/eval_batch_size refers to global batch size unlike the name suggests. Let me know if you still have questions.

One thing to note is that accelerate does not use all GPUs optimally. It slices the model on different GPUs allowing training for models that don't fit on one GPU. However, at any given point in time during training, only one GPU will be in use at a time.

Thanks for reply.
For GPU memory side, naively multiplying the number of GPUs to per_device_train_batch_size works for me. Looks like accelerate put tensor generated during LoRA fine-tuning to all GPUs equally, the same as the pre-trained model.
For training speed side, looks like accelerate do multi-GPU training sequentially, which make the training speed slow.

@dayuyang1999
Copy link

Hi all,

I was trying to feed the model into 2 3090ti (24G VRAM each). Loading in 4bit, however I got the error message "ValueError: You can't train a model that has been loaded in 8-bit precision on multiple devices.".

It seems that Accelerate does not allow 4/8-bit model to be trained on multiple devices. Anyone met same issue?

@phalexo
Copy link

phalexo commented Jun 2, 2023

Hello I added some information on the multi-gpu setup in the README. In qlora.py we use Accelerate. You are correct that per_device_train/eval_batch_size refers to global batch size unlike the name suggests. Let me know if you still have questions.

One thing to note is that accelerate does not use all GPUs optimally. It slices the model on different GPUs allowing training for models that don't fit on one GPU. However, at any given point in time during training, only one GPU will be in use at a time.

This is kind of puzzling to me about how "accelerate" is being used with qlora. If it distributes different layers to different GPUs, then I would expect to be able to pipeline training data so that all the GPUs work at the same time on different stages of training data batches.

Is this a design choice that needs to be revisited? I have the training process running now on 4 Titan X and yeah I can see how only one is doing something at a time. It means that a batch has to go through the entire pipeline before the new batch is put in.

It makes no sense to me. It runs 4 times slower than it could wrt. throughput.

@MrigankRaman
Copy link

Still got the error saying that it expected all tensors to be on the same device.cuda:0, cuda:1

Even I am getting the same. Any Idea how to resolve the issue?

@kevinuserdd
Copy link

iple devices. Anyone met same issue?

how to solve?

@znsoftm
Copy link

znsoftm commented Jun 5, 2023

ddp_find_unused_parameters False

@zyxyxz
Copy link

zyxyxz commented Jun 5, 2023

Hi all,

I was trying to feed the model into 2 3090ti (24G VRAM each). Loading in 4bit, however I got the error message "ValueError: You can't train a model that has been loaded in 8-bit precision on multiple devices.".

It seems that Accelerate does not allow 4/8-bit model to be trained on multiple devices. Anyone met same issue?

same error

@fan-niu
Copy link

fan-niu commented Jun 6, 2023

Hi all,

I was trying to feed the model into 2 3090ti (24G VRAM each). Loading in 4bit, however I got the error message "ValueError: You can't train a model that has been loaded in 8-bit precision on multiple devices.".

It seems that Accelerate does not allow 4/8-bit model to be trained on multiple devices. Anyone met same issue?

same error, @artidoro Can you help us to solve this problem? Thanks.

@yangjianxin1
Copy link

I have solve the problem, ddp_find_unused_parameters=False

you can learn more with the code: https://github.com/yangjianxin1/Firefly/blob/master/train_qlora.py#L104

@oolongoo
Copy link

image
I have 4 GPUs, why is always only one or two of them computing?

@ichsan2895
Copy link

ichsan2895 commented Aug 22, 2023

Repost from this thread : Multi-gpu training example?

Finally, it works.
Now it utilized all GPUs

!pip install bitsandbytes==0.41.1
!pip install transformers==4.31.0
!pip install peft==0.4.0 
!pip install accelerate==0.21.0 einops==0.6.1 evaluate==0.4.0 scikit-learn==1.2.2 sentencepiece==0.1.99

*change qlora.py
device_map='auto' => device_map = {"": "cuda:" + str(int(os.environ.get("LOCAL_RANK") or 0))}

!accelerate launch qlora.py --model_name_or_path="meta-llama/Llama-2-7b-chat-hf" \
    --dataset="/workspace/your_dataset.csv" \
    --do_eval=True --eval_steps=500 --lr_scheduler_type="cosine" \
    --learning_rate=0.0002 --use_auth_token=True \
    --evaluation_strategy=steps --eval_dataset_size=512 --do_mmlu_eval=True \
    --gradient_checkpointing=True --ddp_find_unused_parameters=False

Tested in Runpod environment with Python 3.10 and Torch 2.0.0+cu117

when gradient_checkpointing is True, a little bit slow. But it spread all GPU VRAM usage.

For example if one GPU, it needs 20 GBs of VRAM.
If two GPUs, it needs 20/2=10 GB/GPU,
If three GPUs, it needs 20/3 GB=6,67 GB/GPU.

Got 15 seconds/iters
image

Compared to

!accelerate launch qlora.py --model_name_or_path="meta-llama/Llama-2-7b-chat-hf" \
    --dataset="/workspace/your_dataset.csv" \
    --do_eval=True --eval_steps=500 --lr_scheduler_type="cosine" \
    --learning_rate=0.0002 --use_auth_token=True \
    --evaluation_strategy=steps --eval_dataset_size=512 --do_mmlu_eval=True \
    --gradient_checkpointing=False

when gradient_checkpointing is False, yet it will be faster. But it more consumes more GPU VRAM.

For example if one GPU, it needs 20 GBs of VRAM.
If two GPUs, it needs 20x2=40 GB total,
If three GPUs, it needs 20x3 GB=60 GB total.

Got 10 seconds/iter. But it consumes gpu usage multipled by number of GPUs.
image

Compared to the vanilla one (original)

!python3.10 qlora.py --model_name_or_path="meta-llama/Llama-2-7b-chat-hf" \
    --dataset="/workspace/your_dataset.csv" \
    --do_eval=True --eval_steps=500 --lr_scheduler_type="cosine" \
    --learning_rate=0.0002 --use_auth_token=True \
    --evaluation_strategy=steps --eval_dataset_size=512 --do_mmlu_eval=True

Got 55 seconds/iter. So it is very slow compared previous method.
image

@jsancs
Copy link

jsancs commented Aug 23, 2023

@ichsan2895 how does this parallelization exactly works? Does each of the gpu's compute a different part of the process? I understand that the memory of each GPU is independent of the rest for that execution, right (it is not the sum of all the memories)?

@ichsan2895
Copy link

ichsan2895 commented Aug 24, 2023

@ichsan2895 how does this parallelization exactly works? Does each of the gpu's compute a different part of the process? I understand that the memory of each GPU is independent of the rest for that execution, right (it is not the sum of all the memories)?

It seems to be parallel since I saw this code

qlora/qlora.py

Lines 341 to 342 in 7f4e95a

setattr(model, 'model_parallel', True)
setattr(model, 'is_parallelizable', True)

But I'm sorry I don't understand the detailed ways of parallelization works. My focuses is only makes the code works with multi GPUs 👍

@trannhatquy
Copy link

@ichsan2895 Can we use 3 gpus if we have 4 gpus in our machine to do data parallel ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests