Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: move to accelerate launch for distributed training #92

Merged
merged 13 commits into from
Mar 18, 2024

Conversation

kmehant
Copy link
Collaborator

@kmehant kmehant commented Mar 13, 2024

Description of the change

Related issue number

Closes #87

How to verify the PR

I have launched a multi gpu training using FSDP and accelerate

accelerate launch --main_process_ip $MASTER_ADDR --main_process_port $MASTER_PORT --config_file config/accelerate_fsdp_config.yaml tuning/sft_trainer.py --peft_method none --tokenizer_name_or_path $MODEL_PATH --model_name_or_path ${MODEL_PATH} --data_path ${DATA_PATH} --output_dir ${OUTPUT_PATH} --num_train_epochs 5 --per_device_train_batch_size 2 --per_device_eval_batch_size 4 --evaluation_strategy "no" --save_strategy "epoch" --learning_rate "1e-5" --warmup_ratio 0.03 --lr_scheduler_type "cosine" --logging_steps 1 --include_tokens_per_second --packing False --response_template "\n### Label:" --dataset_text_field "output" --use_flash_attn False --torch_dtype bfloat16

$MASTER_ADDR and $MASTER_PORT are set by the job launcher (Pytorch Job launcher from kubeflow operator). I have used twitter_complaints.json for the training data and Llama 7B as the base model. Rest of the training arguments used are shown in the above command.

I have used the accelerate config thats being added by this PR (config/accelerate_fsdp_llama_2_procs.yaml).

Some screenshots on the training.

GPU and memory utilization

Screenshot 2024-03-13 at 1 44 39 PM Screenshot 2024-03-13 at 1 58 49 PM Screenshot 2024-03-13 at 1 58 58 PM

Training loss curve

Screenshot 2024-03-13 at 1 56 14 PM

Concerns

Saving model

ref - https://huggingface.co/blog/ram-efficient-pytorch-fsdp

set full state dict before saving the model

    trainer.accelerator.state.fsdp_plugin.set_state_dict_type("FULL_STATE_DICT")

Was the PR tested

  • I have added >=1 unit test(s) for every new method I have added.
  • I have ensured all unit tests pass

Copy link
Collaborator

@Ssukriti Ssukriti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank for the PR!

Minor suggestions , unfortunately we had a duplicate created at same time #91

Fabian had also volunteered earlier to add it, but I thought he was busy and then you both crated PR at same time.

To keep things fair, do you mind co-commiting so you both get credit? You could rebase commits or open a new PR

I reviewed both your PRs and added content here that was different and could be useful.

config/accelerate_fsdp_llama_2_procs.yaml Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Show resolved Hide resolved
kmehant and others added 4 commits March 15, 2024 11:31
Signed-off-by: Mehant Kammakomati <[email protected]>
Signed-off-by: Mehant Kammakomati <[email protected]>
Signed-off-by: Yu Chin Fabian Lim <[email protected]>
Signed-off-by: Mehant Kammakomati <[email protected]>
@kmehant kmehant requested a review from Ssukriti March 15, 2024 06:04
@kmehant
Copy link
Collaborator Author

kmehant commented Mar 15, 2024

@Ssukriti I have addressed your comments and cherry picked Fabian's commit from his PR to here (automatically credits him at commit 4aa0153). Looking forward to your review, thanks.

@kmehant kmehant changed the title Move to accelerate launch for distributed training feat: Move to accelerate launch for distributed training Mar 15, 2024
@kmehant kmehant changed the title feat: Move to accelerate launch for distributed training feat: move to accelerate launch for distributed training Mar 15, 2024
Copy link
Collaborator

@fabianlim fabianlim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kmehant @Ssukriti I have some suggestions. I feel the PR can be further streamlined.

README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
examples/prompt_tuning_twitter_complaints/README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
Copy link
Collaborator

@Ssukriti Ssukriti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will approve and merge when Fabian's comments are addressed. I had another small comment on the name of training_data_path variable.

Just on the environment variables , I would suggest keeping them as examples to avoid confusion on how to set them.

`# Please set the environment variables:
# MODEL_PATH = llama-7b-hf # Huggingface model id or path to a checkpoint
# TRAIN_DATA_PATH=twitter_complaints.json # Path to the train dataset
# OUTPUT_PATH=out # Path to the output folder where the checkpoints are saved`

Will wait for other comments from Fabian to be addressed. Thank you!

kmehant and others added 6 commits March 15, 2024 22:43
Co-authored-by: Yu Chin Fabian Lim <[email protected]>
Signed-off-by: Mehant Kammakomati <[email protected]>
Co-authored-by: Yu Chin Fabian Lim <[email protected]>
Signed-off-by: Mehant Kammakomati <[email protected]>
Co-authored-by: Yu Chin Fabian Lim <[email protected]>
Signed-off-by: Mehant Kammakomati <[email protected]>
Co-authored-by: Yu Chin Fabian Lim <[email protected]>
Signed-off-by: Mehant Kammakomati <[email protected]>
Co-authored-by: Yu Chin Fabian Lim <[email protected]>
Signed-off-by: Mehant Kammakomati <[email protected]>
@kmehant kmehant requested review from fabianlim and Ssukriti March 15, 2024 17:21
@kmehant
Copy link
Collaborator Author

kmehant commented Mar 15, 2024

@Ssukriti @fabianlim I have collated all your review comments and have updated the PR, please review, thank you.

@kmehant
Copy link
Collaborator Author

kmehant commented Mar 17, 2024

@Ssukriti The PR is ready, thank you.

README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
Signed-off-by: Sukriti Sharma <[email protected]>
Copy link
Collaborator

@Ssukriti Ssukriti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you!

@Ssukriti Ssukriti merged commit bb787dc into foundation-model-stack:main Mar 18, 2024
3 checks passed
jbusche pushed a commit to jbusche/fms-hf-tuning that referenced this pull request Mar 25, 2024
…model-stack#92)

* fix: ac=false

Signed-off-by: Mehant Kammakomati <[email protected]>

* feat: add accelerate config

Signed-off-by: Mehant Kammakomati <[email protected]>

* feat: move to accelerate for distributed training launch

Signed-off-by: Mehant Kammakomati <[email protected]>

* update README, replaced .json with accelerate.yaml

Signed-off-by: Yu Chin Fabian Lim <[email protected]>
Signed-off-by: Mehant Kammakomati <[email protected]>

* use master port env variable as is

Co-authored-by: Yu Chin Fabian Lim <[email protected]>
Signed-off-by: Mehant Kammakomati <[email protected]>

* remove details on configuring fsdp unit

Co-authored-by: Yu Chin Fabian Lim <[email protected]>
Signed-off-by: Mehant Kammakomati <[email protected]>

* remove details on multi node setup

Co-authored-by: Yu Chin Fabian Lim <[email protected]>
Signed-off-by: Mehant Kammakomati <[email protected]>

* remove details on the multi node training setup

Co-authored-by: Yu Chin Fabian Lim <[email protected]>
Signed-off-by: Mehant Kammakomati <[email protected]>

* remove details on the multi node training in the example usecase

Co-authored-by: Yu Chin Fabian Lim <[email protected]>
Signed-off-by: Mehant Kammakomati <[email protected]>

* fix: add back env variables and comment them

Signed-off-by: Mehant Kammakomati <[email protected]>

* fix: llama-7b-hf to meta-llama/Llama-2-7b-hf

Signed-off-by: Mehant Kammakomati <[email protected]>

* Apply suggestions from code review

Signed-off-by: Sukriti Sharma <[email protected]>

* Update README.md

Signed-off-by: Sukriti Sharma <[email protected]>

---------

Signed-off-by: Mehant Kammakomati <[email protected]>
Signed-off-by: Yu Chin Fabian Lim <[email protected]>
Signed-off-by: Mehant Kammakomati <[email protected]>
Signed-off-by: Sukriti Sharma <[email protected]>
Co-authored-by: Yu Chin Fabian Lim <[email protected]>
Co-authored-by: Yu Chin Fabian Lim <[email protected]>
Co-authored-by: Sukriti Sharma <[email protected]>
anhuong pushed a commit to anhuong/fms-hf-tuning that referenced this pull request Apr 3, 2024
…model-stack#92)

* fix: ac=false

Signed-off-by: Mehant Kammakomati <[email protected]>

* feat: add accelerate config

Signed-off-by: Mehant Kammakomati <[email protected]>

* feat: move to accelerate for distributed training launch

Signed-off-by: Mehant Kammakomati <[email protected]>

* update README, replaced .json with accelerate.yaml

Signed-off-by: Yu Chin Fabian Lim <[email protected]>
Signed-off-by: Mehant Kammakomati <[email protected]>

* use master port env variable as is

Co-authored-by: Yu Chin Fabian Lim <[email protected]>
Signed-off-by: Mehant Kammakomati <[email protected]>

* remove details on configuring fsdp unit

Co-authored-by: Yu Chin Fabian Lim <[email protected]>
Signed-off-by: Mehant Kammakomati <[email protected]>

* remove details on multi node setup

Co-authored-by: Yu Chin Fabian Lim <[email protected]>
Signed-off-by: Mehant Kammakomati <[email protected]>

* remove details on the multi node training setup

Co-authored-by: Yu Chin Fabian Lim <[email protected]>
Signed-off-by: Mehant Kammakomati <[email protected]>

* remove details on the multi node training in the example usecase

Co-authored-by: Yu Chin Fabian Lim <[email protected]>
Signed-off-by: Mehant Kammakomati <[email protected]>

* fix: add back env variables and comment them

Signed-off-by: Mehant Kammakomati <[email protected]>

* fix: llama-7b-hf to meta-llama/Llama-2-7b-hf

Signed-off-by: Mehant Kammakomati <[email protected]>

* Apply suggestions from code review

Signed-off-by: Sukriti Sharma <[email protected]>

* Update README.md

Signed-off-by: Sukriti Sharma <[email protected]>

---------

Signed-off-by: Mehant Kammakomati <[email protected]>
Signed-off-by: Yu Chin Fabian Lim <[email protected]>
Signed-off-by: Mehant Kammakomati <[email protected]>
Signed-off-by: Sukriti Sharma <[email protected]>
Co-authored-by: Yu Chin Fabian Lim <[email protected]>
Co-authored-by: Yu Chin Fabian Lim <[email protected]>
Co-authored-by: Sukriti Sharma <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Switch to accelerate for Multi GPU
3 participants