Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable multi-gpu training when "torch" is chosen as the RETURNN backend #445

Merged
merged 17 commits into from
Oct 26, 2023

Conversation

Judyxujj
Copy link
Contributor

Enable to launch the torch DDP training using torchrun

@albertz
Copy link
Member

albertz commented Aug 20, 2023

I'm curious: What happens if you just use mpirun? I would have expected that this works as well, and torchrun internally just does the same. So, this is not the case?

@albertz
Copy link
Member

albertz commented Aug 20, 2023

I'm not sure if it is a good idea to just always do this for the PyTorch backend. Maybe we should just introduce a separate option for this, to make it explicit? There might be valid cases to use mpirun with PyTorch?

returnn/training.py Outdated Show resolved Hide resolved
@Judyxujj
Copy link
Contributor Author

I'm curious: What happens if you just use mpirun? I would have expected that this works as well, and torchrun internally just does the same. So, this is not the case?

I tried to use mpirun to launch the torch distributed data parallel (DDP) training but it gives ValueError: Error initializing torch.distributed using env:// rendezvous: environment variable RANK expected, but not set.

@albertz
Copy link
Member

albertz commented Aug 22, 2023

environment variable RANK expected, but not set.

It sounds like the env var RANK is expected but not set. Did you try to set it? Maybe you need to add it in DEFAULT_ENVIRONMENT_KEEP in the Sisyphus settings or so? Or maybe try the Sisyphus setting CLEANUP_ENVIRONMENT = False. Did you try that?

@Judyxujj Judyxujj marked this pull request as draft August 24, 2023 12:08
returnn/training.py Outdated Show resolved Hide resolved
returnn/training.py Outdated Show resolved Hide resolved
Judyxujj and others added 2 commits August 24, 2023 18:49
@Judyxujj Judyxujj closed this Aug 24, 2023
@Judyxujj Judyxujj reopened this Aug 24, 2023
@Judyxujj Judyxujj marked this pull request as ready for review August 24, 2023 17:07
returnn/training.py Outdated Show resolved Hide resolved
@Judyxujj Judyxujj requested a review from JackTemaki October 6, 2023 10:08
returnn/training.py Outdated Show resolved Hide resolved
returnn/training.py Outdated Show resolved Hide resolved
returnn/training.py Outdated Show resolved Hide resolved
returnn/training.py Outdated Show resolved Hide resolved
returnn/training.py Outdated Show resolved Hide resolved
returnn/training.py Outdated Show resolved Hide resolved
returnn/training.py Outdated Show resolved Hide resolved
returnn/training.py Outdated Show resolved Hide resolved
Judyxujj and others added 2 commits October 19, 2023 11:32
Judyxujj and others added 6 commits October 19, 2023 11:37
@christophmluscher
Copy link
Contributor

@JackTemaki friendly ping :)

@curufinwe curufinwe dismissed JackTemaki’s stale review October 26, 2023 09:40

"Nick not in the topic"

@curufinwe curufinwe merged commit 2dbfb8c into main Oct 26, 2023
4 checks passed
@curufinwe curufinwe deleted the jing-torch-multi-gpu-training branch October 26, 2023 09:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Enable multi-gpu training when "torch" is chosen as the RETURNN backend
6 participants