feat: move to accelerate launch for distributed training #92

kmehant · 2024-03-13T09:26:13Z

Description of the change

Related issue number

Closes #87

How to verify the PR

I have launched a multi gpu training using FSDP and accelerate

accelerate launch --main_process_ip $MASTER_ADDR --main_process_port $MASTER_PORT --config_file config/accelerate_fsdp_config.yaml tuning/sft_trainer.py --peft_method none --tokenizer_name_or_path $MODEL_PATH --model_name_or_path ${MODEL_PATH} --data_path ${DATA_PATH} --output_dir ${OUTPUT_PATH} --num_train_epochs 5 --per_device_train_batch_size 2 --per_device_eval_batch_size 4 --evaluation_strategy "no" --save_strategy "epoch" --learning_rate "1e-5" --warmup_ratio 0.03 --lr_scheduler_type "cosine" --logging_steps 1 --include_tokens_per_second --packing False --response_template "\n### Label:" --dataset_text_field "output" --use_flash_attn False --torch_dtype bfloat16

$MASTER_ADDR and $MASTER_PORT are set by the job launcher (Pytorch Job launcher from kubeflow operator). I have used twitter_complaints.json for the training data and Llama 7B as the base model. Rest of the training arguments used are shown in the above command.

I have used the accelerate config thats being added by this PR (config/accelerate_fsdp_llama_2_procs.yaml).

Some screenshots on the training.

GPU and memory utilization

Training loss curve

Concerns

Saving model

ref - https://huggingface.co/blog/ram-efficient-pytorch-fsdp

set full state dict before saving the model

    trainer.accelerator.state.fsdp_plugin.set_state_dict_type("FULL_STATE_DICT")

Was the PR tested

I have added >=1 unit test(s) for every new method I have added.
I have ensured all unit tests pass

Ssukriti

thank for the PR!

Minor suggestions , unfortunately we had a duplicate created at same time #91

Fabian had also volunteered earlier to add it, but I thought he was busy and then you both crated PR at same time.

To keep things fair, do you mind co-commiting so you both get credit? You could rebase commits or open a new PR

I reviewed both your PRs and added content here that was different and could be useful.

config/accelerate_fsdp_llama_2_procs.yaml

README.md

Signed-off-by: Mehant Kammakomati <[email protected]>

Signed-off-by: Yu Chin Fabian Lim <[email protected]> Signed-off-by: Mehant Kammakomati <[email protected]>

kmehant · 2024-03-15T06:11:10Z

@Ssukriti I have addressed your comments and cherry picked Fabian's commit from his PR to here (automatically credits him at commit 4aa0153). Looking forward to your review, thanks.

fabianlim

@kmehant @Ssukriti I have some suggestions. I feel the PR can be further streamlined.

README.md

examples/prompt_tuning_twitter_complaints/README.md

README.md

Ssukriti

will approve and merge when Fabian's comments are addressed. I had another small comment on the name of training_data_path variable.

Just on the environment variables , I would suggest keeping them as examples to avoid confusion on how to set them.

`# Please set the environment variables:
# MODEL_PATH = llama-7b-hf # Huggingface model id or path to a checkpoint
# TRAIN_DATA_PATH=twitter_complaints.json # Path to the train dataset
# OUTPUT_PATH=out # Path to the output folder where the checkpoints are saved`

Will wait for other comments from Fabian to be addressed. Thank you!

Co-authored-by: Yu Chin Fabian Lim <[email protected]> Signed-off-by: Mehant Kammakomati <[email protected]>

Signed-off-by: Mehant Kammakomati <[email protected]>

kmehant · 2024-03-15T17:22:43Z

@Ssukriti @fabianlim I have collated all your review comments and have updated the PR, please review, thank you.

Signed-off-by: Mehant Kammakomati <[email protected]>

kmehant · 2024-03-17T06:19:12Z

@Ssukriti The PR is ready, thank you.

README.md

examples/prompt_tuning_twitter_complaints/README.md

Signed-off-by: Sukriti Sharma <[email protected]>

README.md

Signed-off-by: Sukriti Sharma <[email protected]>

Ssukriti

thank you!

…model-stack#92) * fix: ac=false Signed-off-by: Mehant Kammakomati <[email protected]> * feat: add accelerate config Signed-off-by: Mehant Kammakomati <[email protected]> * feat: move to accelerate for distributed training launch Signed-off-by: Mehant Kammakomati <[email protected]> * update README, replaced .json with accelerate.yaml Signed-off-by: Yu Chin Fabian Lim <[email protected]> Signed-off-by: Mehant Kammakomati <[email protected]> * use master port env variable as is Co-authored-by: Yu Chin Fabian Lim <[email protected]> Signed-off-by: Mehant Kammakomati <[email protected]> * remove details on configuring fsdp unit Co-authored-by: Yu Chin Fabian Lim <[email protected]> Signed-off-by: Mehant Kammakomati <[email protected]> * remove details on multi node setup Co-authored-by: Yu Chin Fabian Lim <[email protected]> Signed-off-by: Mehant Kammakomati <[email protected]> * remove details on the multi node training setup Co-authored-by: Yu Chin Fabian Lim <[email protected]> Signed-off-by: Mehant Kammakomati <[email protected]> * remove details on the multi node training in the example usecase Co-authored-by: Yu Chin Fabian Lim <[email protected]> Signed-off-by: Mehant Kammakomati <[email protected]> * fix: add back env variables and comment them Signed-off-by: Mehant Kammakomati <[email protected]> * fix: llama-7b-hf to meta-llama/Llama-2-7b-hf Signed-off-by: Mehant Kammakomati <[email protected]> * Apply suggestions from code review Signed-off-by: Sukriti Sharma <[email protected]> * Update README.md Signed-off-by: Sukriti Sharma <[email protected]> --------- Signed-off-by: Mehant Kammakomati <[email protected]> Signed-off-by: Yu Chin Fabian Lim <[email protected]> Signed-off-by: Mehant Kammakomati <[email protected]> Signed-off-by: Sukriti Sharma <[email protected]> Co-authored-by: Yu Chin Fabian Lim <[email protected]> Co-authored-by: Yu Chin Fabian Lim <[email protected]> Co-authored-by: Sukriti Sharma <[email protected]>

kmehant requested review from anhuong, Ssukriti and alex-jw-brooks as code owners March 13, 2024 09:26

kmehant mentioned this pull request Mar 13, 2024

bug: Boolean values are represented as strings in default fsdp config translates to True #80

Open

kmehant force-pushed the move2accelerate branch from 9ee99a7 to 338130a Compare March 13, 2024 09:51

Ssukriti requested changes Mar 15, 2024

View reviewed changes

config/accelerate_fsdp_llama_2_procs.yaml Outdated Show resolved Hide resolved

README.md Outdated Show resolved Hide resolved

README.md Show resolved Hide resolved

Ssukriti mentioned this pull request Mar 15, 2024

Multi-GPU switch from TorchRun to Accelerate #91

Closed

kmehant force-pushed the move2accelerate branch from fa26101 to 7bf390a Compare March 15, 2024 05:59

kmehant and others added 4 commits March 15, 2024 11:31

fix: ac=false

ab5cde8

Signed-off-by: Mehant Kammakomati <[email protected]>

feat: add accelerate config

9172589

Signed-off-by: Mehant Kammakomati <[email protected]>

feat: move to accelerate for distributed training launch

d08ea0a

Signed-off-by: Mehant Kammakomati <[email protected]>

update README, replaced .json with accelerate.yaml

4aa0153

Signed-off-by: Yu Chin Fabian Lim <[email protected]> Signed-off-by: Mehant Kammakomati <[email protected]>

kmehant force-pushed the move2accelerate branch from 7bf390a to 4aa0153 Compare March 15, 2024 06:03

kmehant requested a review from Ssukriti March 15, 2024 06:04

kmehant changed the title ~~Move to accelerate launch for distributed training~~ feat: Move to accelerate launch for distributed training Mar 15, 2024

kmehant changed the title ~~feat: Move to accelerate launch for distributed training~~ feat: move to accelerate launch for distributed training Mar 15, 2024

fabianlim reviewed Mar 15, 2024

View reviewed changes

Ssukriti reviewed Mar 15, 2024

View reviewed changes

README.md Outdated Show resolved Hide resolved

Ssukriti reviewed Mar 15, 2024

View reviewed changes

kmehant and others added 6 commits March 15, 2024 22:43

use master port env variable as is

7a572a8

Co-authored-by: Yu Chin Fabian Lim <[email protected]> Signed-off-by: Mehant Kammakomati <[email protected]>

remove details on configuring fsdp unit

63ebad5

Co-authored-by: Yu Chin Fabian Lim <[email protected]> Signed-off-by: Mehant Kammakomati <[email protected]>

remove details on multi node setup

8df7608

Co-authored-by: Yu Chin Fabian Lim <[email protected]> Signed-off-by: Mehant Kammakomati <[email protected]>

remove details on the multi node training setup

da05207

Co-authored-by: Yu Chin Fabian Lim <[email protected]> Signed-off-by: Mehant Kammakomati <[email protected]>

remove details on the multi node training in the example usecase

3463a58

Co-authored-by: Yu Chin Fabian Lim <[email protected]> Signed-off-by: Mehant Kammakomati <[email protected]>

fix: add back env variables and comment them

62b78cc

Signed-off-by: Mehant Kammakomati <[email protected]>

kmehant requested review from fabianlim and Ssukriti March 15, 2024 17:21

fix: llama-7b-hf to meta-llama/Llama-2-7b-hf

d23a4ec

Signed-off-by: Mehant Kammakomati <[email protected]>