Enable multi-gpu training when "torch" is chosen as the RETURNN backend #445

Judyxujj · 2023-08-18T15:12:56Z

Enable to launch the torch DDP training using torchrun

albertz · 2023-08-20T01:20:49Z

I'm curious: What happens if you just use mpirun? I would have expected that this works as well, and torchrun internally just does the same. So, this is not the case?

albertz · 2023-08-20T01:22:49Z

I'm not sure if it is a good idea to just always do this for the PyTorch backend. Maybe we should just introduce a separate option for this, to make it explicit? There might be valid cases to use mpirun with PyTorch?

returnn/training.py

Judyxujj · 2023-08-21T14:45:44Z

I'm curious: What happens if you just use mpirun? I would have expected that this works as well, and torchrun internally just does the same. So, this is not the case?

I tried to use mpirun to launch the torch distributed data parallel (DDP) training but it gives ValueError: Error initializing torch.distributed using env:// rendezvous: environment variable RANK expected, but not set.

albertz · 2023-08-22T15:16:36Z

environment variable RANK expected, but not set.

It sounds like the env var RANK is expected but not set. Did you try to set it? Maybe you need to add it in DEFAULT_ENVIRONMENT_KEEP in the Sisyphus settings or so? Or maybe try the Sisyphus setting CLEANUP_ENVIRONMENT = False. Did you try that?

Co-authored-by: michelwi <[email protected]>

returnn/training.py

Co-authored-by: Albert Zeyer <[email protected]>

returnn/training.py

Co-authored-by: Eugen Beck <[email protected]>

christophmluscher · 2023-10-20T07:28:48Z

@JackTemaki friendly ping :)

"Nick not in the topic"

update

4175850

Judyxujj linked an issue Aug 18, 2023 that may be closed by this pull request

Enable multi-gpu training when "torch" is chosen as the RETURNN backend #444

Closed

Judyxujj requested review from JackTemaki and albertz August 18, 2023 15:13

michelwi reviewed Aug 21, 2023

View reviewed changes

returnn/training.py Outdated Show resolved Hide resolved

Judyxujj and others added 2 commits August 24, 2023 13:59

Update returnn/training.py

95c1fe8

Co-authored-by: michelwi <[email protected]>

multiple nodes

d918065

Judyxujj marked this pull request as draft August 24, 2023 12:08

albertz reviewed Aug 24, 2023

View reviewed changes

returnn/training.py Outdated Show resolved Hide resolved

Judyxujj added 2 commits August 24, 2023 15:28

update

4fe2e0b

update

4f13183

albertz reviewed Aug 24, 2023

View reviewed changes

returnn/training.py Outdated Show resolved Hide resolved

Judyxujj and others added 2 commits August 24, 2023 18:49

Update returnn/training.py

a5959bb

Co-authored-by: Albert Zeyer <[email protected]>

update

a10c3c0

Judyxujj closed this Aug 24, 2023

Judyxujj reopened this Aug 24, 2023

Judyxujj marked this pull request as ready for review August 24, 2023 17:07

curufinwe approved these changes Sep 21, 2023

View reviewed changes

JackTemaki previously requested changes Sep 26, 2023

View reviewed changes

returnn/training.py Outdated Show resolved Hide resolved

albertz reviewed Sep 26, 2023

View reviewed changes

returnn/training.py Show resolved Hide resolved

Judyxujj added 2 commits October 6, 2023 12:03

update

56d04f4

black

fa6aa6c

Judyxujj requested a review from JackTemaki October 6, 2023 10:08

curufinwe reviewed Oct 19, 2023

View reviewed changes

Judyxujj and others added 2 commits October 19, 2023 11:32

Update returnn/training.py

e0b3f63

Co-authored-by: Eugen Beck <[email protected]>

Update returnn/training.py

11de671

Co-authored-by: Eugen Beck <[email protected]>

Judyxujj and others added 6 commits October 19, 2023 11:37

Update returnn/training.py

4183f1e

Co-authored-by: Eugen Beck <[email protected]>

Update returnn/training.py

03a0f00

Co-authored-by: Eugen Beck <[email protected]>

Update returnn/training.py

9046d29

Co-authored-by: Eugen Beck <[email protected]>

Update returnn/training.py

b72c7cf

Co-authored-by: Eugen Beck <[email protected]>

Update returnn/training.py

93f694d

Co-authored-by: Eugen Beck <[email protected]>

Update returnn/training.py

74c70e4

Co-authored-by: Eugen Beck <[email protected]>

albertz approved these changes Oct 19, 2023

View reviewed changes

christophmluscher approved these changes Oct 20, 2023

View reviewed changes

christophmluscher mentioned this pull request Oct 20, 2023

Rename variable horovod_num_processes in ReturnnTrainingJob and ReturnnRasrTrainingJob #456

Open

curufinwe merged commit 2dbfb8c into main Oct 26, 2023
4 checks passed

curufinwe deleted the jing-torch-multi-gpu-training branch October 26, 2023 09:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable multi-gpu training when "torch" is chosen as the RETURNN backend #445

Enable multi-gpu training when "torch" is chosen as the RETURNN backend #445

Judyxujj commented Aug 18, 2023

albertz commented Aug 20, 2023

albertz commented Aug 20, 2023

Judyxujj commented Aug 21, 2023

albertz commented Aug 22, 2023

christophmluscher commented Oct 20, 2023

Enable multi-gpu training when "torch" is chosen as the RETURNN backend #445

Enable multi-gpu training when "torch" is chosen as the RETURNN backend #445

Conversation

Judyxujj commented Aug 18, 2023

albertz commented Aug 20, 2023

albertz commented Aug 20, 2023

Judyxujj commented Aug 21, 2023

albertz commented Aug 22, 2023

christophmluscher commented Oct 20, 2023