Multi-GPU switch from TorchRun to Accelerate #91

fabianlim · 2024-03-13T09:04:03Z

Description of the change

The current codebase recommends torchrun for multi-gpu training. While torchrun is the de-facto way to launch distributed jobs, if used wrongly, it may be incompatible with huggingface trainers and models. The huggingface recommended approach is to use accelerate (a layer built on top of torchrun) to launch multi-gpu jpbs.

Related issue number

Addresses #87 . Also in part addresses #80.

How to verify the PR

Test the run instructions given in the updated README.md

Was the PR tested

Had @VassilisVassiliadis and @fabianlim tested the scripts.

Signed-off-by: Yu Chin Fabian Lim <[email protected]>

Ssukriti · 2024-03-15T00:31:00Z

reviewed this and added any content from here that was missing in duplicate PR #92 . Have requested Mehant to co-commit so you both get credit. Thank you!

Ssukriti · 2024-03-18T18:26:47Z

merged as part of other PR

fabianlim requested review from anhuong, Ssukriti and alex-jw-brooks as code owners March 13, 2024 09:04

fabianlim force-pushed the flim/accelerate branch 2 times, most recently from 90968af to 7ce6a16 Compare March 13, 2024 09:05

fabianlim mentioned this pull request Mar 13, 2024

bug: Boolean values are represented as strings in default fsdp config translates to True #80

Open

update README, replaced .json with accelerate.yaml

5056095

Signed-off-by: Yu Chin Fabian Lim <[email protected]>

fabianlim force-pushed the flim/accelerate branch from 7ce6a16 to 5056095 Compare March 13, 2024 12:31

Ssukriti mentioned this pull request Mar 15, 2024

feat: move to accelerate launch for distributed training #92

Merged

2 tasks

Ssukriti closed this Mar 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-GPU switch from TorchRun to Accelerate #91

Multi-GPU switch from TorchRun to Accelerate #91

fabianlim commented Mar 13, 2024 •

edited

Loading

Ssukriti commented Mar 15, 2024

Ssukriti commented Mar 18, 2024

Multi-GPU switch from TorchRun to Accelerate #91

Multi-GPU switch from TorchRun to Accelerate #91

Conversation

fabianlim commented Mar 13, 2024 • edited Loading

Description of the change

Related issue number

How to verify the PR

Was the PR tested

Ssukriti commented Mar 15, 2024

Ssukriti commented Mar 18, 2024

fabianlim commented Mar 13, 2024 •

edited

Loading