-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: move to accelerate launch for distributed training #92
feat: move to accelerate launch for distributed training #92
Conversation
9ee99a7
to
338130a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank for the PR!
Minor suggestions , unfortunately we had a duplicate created at same time #91
Fabian had also volunteered earlier to add it, but I thought he was busy and then you both crated PR at same time.
To keep things fair, do you mind co-commiting so you both get credit? You could rebase commits or open a new PR
I reviewed both your PRs and added content here that was different and could be useful.
fa26101
to
7bf390a
Compare
Signed-off-by: Mehant Kammakomati <[email protected]>
Signed-off-by: Mehant Kammakomati <[email protected]>
Signed-off-by: Mehant Kammakomati <[email protected]>
Signed-off-by: Yu Chin Fabian Lim <[email protected]> Signed-off-by: Mehant Kammakomati <[email protected]>
7bf390a
to
4aa0153
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will approve and merge when Fabian's comments are addressed. I had another small comment on the name of training_data_path
variable.
Just on the environment variables , I would suggest keeping them as examples to avoid confusion on how to set them.
`# Please set the environment variables:
# MODEL_PATH = llama-7b-hf # Huggingface model id or path to a checkpoint
# TRAIN_DATA_PATH=twitter_complaints.json # Path to the train dataset
# OUTPUT_PATH=out # Path to the output folder where the checkpoints are saved`
Will wait for other comments from Fabian to be addressed. Thank you!
Co-authored-by: Yu Chin Fabian Lim <[email protected]> Signed-off-by: Mehant Kammakomati <[email protected]>
Co-authored-by: Yu Chin Fabian Lim <[email protected]> Signed-off-by: Mehant Kammakomati <[email protected]>
Co-authored-by: Yu Chin Fabian Lim <[email protected]> Signed-off-by: Mehant Kammakomati <[email protected]>
Co-authored-by: Yu Chin Fabian Lim <[email protected]> Signed-off-by: Mehant Kammakomati <[email protected]>
Co-authored-by: Yu Chin Fabian Lim <[email protected]> Signed-off-by: Mehant Kammakomati <[email protected]>
Signed-off-by: Mehant Kammakomati <[email protected]>
@Ssukriti @fabianlim I have collated all your review comments and have updated the PR, please review, thank you. |
Signed-off-by: Mehant Kammakomati <[email protected]>
@Ssukriti The PR is ready, thank you. |
Signed-off-by: Sukriti Sharma <[email protected]>
Signed-off-by: Sukriti Sharma <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank you!
…model-stack#92) * fix: ac=false Signed-off-by: Mehant Kammakomati <[email protected]> * feat: add accelerate config Signed-off-by: Mehant Kammakomati <[email protected]> * feat: move to accelerate for distributed training launch Signed-off-by: Mehant Kammakomati <[email protected]> * update README, replaced .json with accelerate.yaml Signed-off-by: Yu Chin Fabian Lim <[email protected]> Signed-off-by: Mehant Kammakomati <[email protected]> * use master port env variable as is Co-authored-by: Yu Chin Fabian Lim <[email protected]> Signed-off-by: Mehant Kammakomati <[email protected]> * remove details on configuring fsdp unit Co-authored-by: Yu Chin Fabian Lim <[email protected]> Signed-off-by: Mehant Kammakomati <[email protected]> * remove details on multi node setup Co-authored-by: Yu Chin Fabian Lim <[email protected]> Signed-off-by: Mehant Kammakomati <[email protected]> * remove details on the multi node training setup Co-authored-by: Yu Chin Fabian Lim <[email protected]> Signed-off-by: Mehant Kammakomati <[email protected]> * remove details on the multi node training in the example usecase Co-authored-by: Yu Chin Fabian Lim <[email protected]> Signed-off-by: Mehant Kammakomati <[email protected]> * fix: add back env variables and comment them Signed-off-by: Mehant Kammakomati <[email protected]> * fix: llama-7b-hf to meta-llama/Llama-2-7b-hf Signed-off-by: Mehant Kammakomati <[email protected]> * Apply suggestions from code review Signed-off-by: Sukriti Sharma <[email protected]> * Update README.md Signed-off-by: Sukriti Sharma <[email protected]> --------- Signed-off-by: Mehant Kammakomati <[email protected]> Signed-off-by: Yu Chin Fabian Lim <[email protected]> Signed-off-by: Mehant Kammakomati <[email protected]> Signed-off-by: Sukriti Sharma <[email protected]> Co-authored-by: Yu Chin Fabian Lim <[email protected]> Co-authored-by: Yu Chin Fabian Lim <[email protected]> Co-authored-by: Sukriti Sharma <[email protected]>
…model-stack#92) * fix: ac=false Signed-off-by: Mehant Kammakomati <[email protected]> * feat: add accelerate config Signed-off-by: Mehant Kammakomati <[email protected]> * feat: move to accelerate for distributed training launch Signed-off-by: Mehant Kammakomati <[email protected]> * update README, replaced .json with accelerate.yaml Signed-off-by: Yu Chin Fabian Lim <[email protected]> Signed-off-by: Mehant Kammakomati <[email protected]> * use master port env variable as is Co-authored-by: Yu Chin Fabian Lim <[email protected]> Signed-off-by: Mehant Kammakomati <[email protected]> * remove details on configuring fsdp unit Co-authored-by: Yu Chin Fabian Lim <[email protected]> Signed-off-by: Mehant Kammakomati <[email protected]> * remove details on multi node setup Co-authored-by: Yu Chin Fabian Lim <[email protected]> Signed-off-by: Mehant Kammakomati <[email protected]> * remove details on the multi node training setup Co-authored-by: Yu Chin Fabian Lim <[email protected]> Signed-off-by: Mehant Kammakomati <[email protected]> * remove details on the multi node training in the example usecase Co-authored-by: Yu Chin Fabian Lim <[email protected]> Signed-off-by: Mehant Kammakomati <[email protected]> * fix: add back env variables and comment them Signed-off-by: Mehant Kammakomati <[email protected]> * fix: llama-7b-hf to meta-llama/Llama-2-7b-hf Signed-off-by: Mehant Kammakomati <[email protected]> * Apply suggestions from code review Signed-off-by: Sukriti Sharma <[email protected]> * Update README.md Signed-off-by: Sukriti Sharma <[email protected]> --------- Signed-off-by: Mehant Kammakomati <[email protected]> Signed-off-by: Yu Chin Fabian Lim <[email protected]> Signed-off-by: Mehant Kammakomati <[email protected]> Signed-off-by: Sukriti Sharma <[email protected]> Co-authored-by: Yu Chin Fabian Lim <[email protected]> Co-authored-by: Yu Chin Fabian Lim <[email protected]> Co-authored-by: Sukriti Sharma <[email protected]>
Description of the change
Related issue number
Closes #87
How to verify the PR
I have launched a multi gpu training using FSDP and accelerate
$MASTER_ADDR
and$MASTER_PORT
are set by the job launcher (Pytorch Job launcher from kubeflow operator). I have usedtwitter_complaints.json
for the training data and Llama 7B as the base model. Rest of the training arguments used are shown in the above command.I have used the accelerate config thats being added by this PR (config/accelerate_fsdp_llama_2_procs.yaml).
Some screenshots on the training.
GPU and memory utilization
Training loss curve
Concerns
Saving model
ref - https://huggingface.co/blog/ram-efficient-pytorch-fsdp
set full state dict before saving the model
Was the PR tested