Multi-GPU Training #96

wcy1122 · 2023-05-31T18:19:37Z

Directly running qlora.py on a machine with multi GPUs will load the model on multiple GPUs, but the training process is conducted on a single GPU only. The training batch size is equal to per_device_train_batch_size.
Is anyone success with Multi-GPU training？

lukaswangbk · 2023-06-01T03:17:34Z

Hi, I have same problem here. For my case, during the training stage, there is an error saying that it expected all tensors to be on the same device.

artidoro · 2023-06-01T05:36:39Z

Hello I added some information on the multi-gpu setup in the README. In qlora.py we use Accelerate. You are correct that per_device_train/eval_batch_size refers to global batch size unlike the name suggests. Let me know if you still have questions.

One thing to note is that accelerate does not use all GPUs optimally. It slices the model on different GPUs allowing training for models that don't fit on one GPU. However, at any given point in time during training, only one GPU will be in use at a time.

lukaswangbk · 2023-06-01T07:08:11Z

Still got the error saying that it expected all tensors to be on the same device.cuda:0, cuda:1

wcy1122 · 2023-06-01T09:30:47Z

Hello I added some information on the multi-gpu setup in the README. In qlora.py we use Accelerate. You are correct that per_device_train/eval_batch_size refers to global batch size unlike the name suggests. Let me know if you still have questions.

One thing to note is that accelerate does not use all GPUs optimally. It slices the model on different GPUs allowing training for models that don't fit on one GPU. However, at any given point in time during training, only one GPU will be in use at a time.

Thanks for reply.
For GPU memory side, naively multiplying the number of GPUs to per_device_train_batch_size works for me. Looks like accelerate put tensor generated during LoRA fine-tuning to all GPUs equally, the same as the pre-trained model.
For training speed side, looks like accelerate do multi-GPU training sequentially, which make the training speed slow.

dayuyang1999 · 2023-06-02T16:04:29Z

Hi all,

I was trying to feed the model into 2 3090ti (24G VRAM each). Loading in 4bit, however I got the error message "ValueError: You can't train a model that has been loaded in 8-bit precision on multiple devices.".

It seems that Accelerate does not allow 4/8-bit model to be trained on multiple devices. Anyone met same issue?

phalexo · 2023-06-02T22:58:02Z

Hello I added some information on the multi-gpu setup in the README. In qlora.py we use Accelerate. You are correct that per_device_train/eval_batch_size refers to global batch size unlike the name suggests. Let me know if you still have questions.

One thing to note is that accelerate does not use all GPUs optimally. It slices the model on different GPUs allowing training for models that don't fit on one GPU. However, at any given point in time during training, only one GPU will be in use at a time.

This is kind of puzzling to me about how "accelerate" is being used with qlora. If it distributes different layers to different GPUs, then I would expect to be able to pipeline training data so that all the GPUs work at the same time on different stages of training data batches.

Is this a design choice that needs to be revisited? I have the training process running now on 4 Titan X and yeah I can see how only one is doing something at a time. It means that a batch has to go through the entire pipeline before the new batch is put in.

It makes no sense to me. It runs 4 times slower than it could wrt. throughput.

MrigankRaman · 2023-06-04T08:46:11Z

Still got the error saying that it expected all tensors to be on the same device.cuda:0, cuda:1

Even I am getting the same. Any Idea how to resolve the issue?

kevinuserdd · 2023-06-05T09:23:19Z

iple devices. Anyone met same issue?

how to solve?

znsoftm · 2023-06-05T12:49:40Z

ddp_find_unused_parameters False

zyxyxz · 2023-06-05T17:46:38Z

Hi all,

I was trying to feed the model into 2 3090ti (24G VRAM each). Loading in 4bit, however I got the error message "ValueError: You can't train a model that has been loaded in 8-bit precision on multiple devices.".

It seems that Accelerate does not allow 4/8-bit model to be trained on multiple devices. Anyone met same issue?

same error

fan-niu · 2023-06-06T09:37:32Z

Hi all,

I was trying to feed the model into 2 3090ti (24G VRAM each). Loading in 4bit, however I got the error message "ValueError: You can't train a model that has been loaded in 8-bit precision on multiple devices.".

It seems that Accelerate does not allow 4/8-bit model to be trained on multiple devices. Anyone met same issue?

same error, @artidoro Can you help us to solve this problem? Thanks.

yangjianxin1 · 2023-06-19T03:08:52Z

I have solve the problem, ddp_find_unused_parameters=False

you can learn more with the code: https://github.com/yangjianxin1/Firefly/blob/master/train_qlora.py#L104

oolongoo · 2023-06-23T07:35:26Z

I have 4 GPUs, why is always only one or two of them computing?

ichsan2895 · 2023-08-22T08:03:32Z

Repost from this thread : Multi-gpu training example?

Finally, it works.
Now it utilized all GPUs

!pip install bitsandbytes==0.41.1
!pip install transformers==4.31.0
!pip install peft==0.4.0 
!pip install accelerate==0.21.0 einops==0.6.1 evaluate==0.4.0 scikit-learn==1.2.2 sentencepiece==0.1.99

*change qlora.py
device_map='auto' => device_map = {"": "cuda:" + str(int(os.environ.get("LOCAL_RANK") or 0))}

!accelerate launch qlora.py --model_name_or_path="meta-llama/Llama-2-7b-chat-hf" \
    --dataset="/workspace/your_dataset.csv" \
    --do_eval=True --eval_steps=500 --lr_scheduler_type="cosine" \
    --learning_rate=0.0002 --use_auth_token=True \
    --evaluation_strategy=steps --eval_dataset_size=512 --do_mmlu_eval=True \
    --gradient_checkpointing=True --ddp_find_unused_parameters=False

Tested in Runpod environment with Python 3.10 and Torch 2.0.0+cu117

when gradient_checkpointing is True, a little bit slow. But it spread all GPU VRAM usage.

For example if one GPU, it needs 20 GBs of VRAM.
If two GPUs, it needs 20/2=10 GB/GPU,
If three GPUs, it needs 20/3 GB=6,67 GB/GPU.

Got 15 seconds/iters

Compared to

!accelerate launch qlora.py --model_name_or_path="meta-llama/Llama-2-7b-chat-hf" \
    --dataset="/workspace/your_dataset.csv" \
    --do_eval=True --eval_steps=500 --lr_scheduler_type="cosine" \
    --learning_rate=0.0002 --use_auth_token=True \
    --evaluation_strategy=steps --eval_dataset_size=512 --do_mmlu_eval=True \
    --gradient_checkpointing=False

when gradient_checkpointing is False, yet it will be faster. But it more consumes more GPU VRAM.

For example if one GPU, it needs 20 GBs of VRAM.
If two GPUs, it needs 20x2=40 GB total,
If three GPUs, it needs 20x3 GB=60 GB total.

Got 10 seconds/iter. But it consumes gpu usage multipled by number of GPUs.

Compared to the vanilla one (original)

!python3.10 qlora.py --model_name_or_path="meta-llama/Llama-2-7b-chat-hf" \
    --dataset="/workspace/your_dataset.csv" \
    --do_eval=True --eval_steps=500 --lr_scheduler_type="cosine" \
    --learning_rate=0.0002 --use_auth_token=True \
    --evaluation_strategy=steps --eval_dataset_size=512 --do_mmlu_eval=True

Got 55 seconds/iter. So it is very slow compared previous method.

jsancs · 2023-08-23T12:59:41Z

@ichsan2895 how does this parallelization exactly works? Does each of the gpu's compute a different part of the process? I understand that the memory of each GPU is independent of the rest for that execution, right (it is not the sum of all the memories)?

ichsan2895 · 2023-08-24T00:20:25Z

@ichsan2895 how does this parallelization exactly works? Does each of the gpu's compute a different part of the process? I understand that the memory of each GPU is independent of the rest for that execution, right (it is not the sum of all the memories)?

It seems to be parallel since I saw this code

qlora/qlora.py

Lines 341 to 342 in 7f4e95a

    
           setattr(model, 'model_parallel', True) 
        
           setattr(model, 'is_parallelizable', True)

But I'm sorry I don't understand the detailed ways of parallelization works. My focuses is only makes the code works with multi GPUs 👍

trannhatquy · 2023-09-21T02:35:33Z

@ichsan2895 Can we use 3 gpus if we have 4 gpus in our machine to do data parallel ?

fomalhautb mentioned this issue Jun 18, 2023

Data parallel training #177

Closed

taomanwai mentioned this issue Sep 16, 2023

QLoRA for Multi-GPU Settings Lightning-AI/litgpt#449

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-GPU Training #96

Multi-GPU Training #96

wcy1122 commented May 31, 2023

lukaswangbk commented Jun 1, 2023

artidoro commented Jun 1, 2023 •

edited

Loading

lukaswangbk commented Jun 1, 2023

wcy1122 commented Jun 1, 2023

dayuyang1999 commented Jun 2, 2023

phalexo commented Jun 2, 2023 •

edited

Loading

MrigankRaman commented Jun 4, 2023

kevinuserdd commented Jun 5, 2023

znsoftm commented Jun 5, 2023

zyxyxz commented Jun 5, 2023

fan-niu commented Jun 6, 2023

yangjianxin1 commented Jun 19, 2023

oolongoo commented Jun 23, 2023

ichsan2895 commented Aug 22, 2023 •

edited

Loading

jsancs commented Aug 23, 2023

ichsan2895 commented Aug 24, 2023 •

edited

Loading

trannhatquy commented Sep 21, 2023

Multi-GPU Training #96

Multi-GPU Training #96

Comments

wcy1122 commented May 31, 2023

lukaswangbk commented Jun 1, 2023

artidoro commented Jun 1, 2023 • edited Loading

lukaswangbk commented Jun 1, 2023

wcy1122 commented Jun 1, 2023

dayuyang1999 commented Jun 2, 2023

phalexo commented Jun 2, 2023 • edited Loading

MrigankRaman commented Jun 4, 2023

kevinuserdd commented Jun 5, 2023

znsoftm commented Jun 5, 2023

zyxyxz commented Jun 5, 2023

fan-niu commented Jun 6, 2023

yangjianxin1 commented Jun 19, 2023

oolongoo commented Jun 23, 2023

ichsan2895 commented Aug 22, 2023 • edited Loading

jsancs commented Aug 23, 2023

ichsan2895 commented Aug 24, 2023 • edited Loading

trannhatquy commented Sep 21, 2023

artidoro commented Jun 1, 2023 •

edited

Loading

phalexo commented Jun 2, 2023 •

edited

Loading

ichsan2895 commented Aug 22, 2023 •

edited

Loading

ichsan2895 commented Aug 24, 2023 •

edited

Loading