-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable multi-gpu training when "torch" is chosen as the RETURNN backend #445
Conversation
I'm curious: What happens if you just use |
I'm not sure if it is a good idea to just always do this for the PyTorch backend. Maybe we should just introduce a separate option for this, to make it explicit? There might be valid cases to use |
I tried to use mpirun to launch the torch distributed data parallel (DDP) training but it gives ValueError: Error initializing torch.distributed using env:// rendezvous: environment variable RANK expected, but not set. |
It sounds like the env var |
Co-authored-by: michelwi <[email protected]>
Co-authored-by: Albert Zeyer <[email protected]>
Co-authored-by: Eugen Beck <[email protected]>
Co-authored-by: Eugen Beck <[email protected]>
Co-authored-by: Eugen Beck <[email protected]>
Co-authored-by: Eugen Beck <[email protected]>
Co-authored-by: Eugen Beck <[email protected]>
Co-authored-by: Eugen Beck <[email protected]>
Co-authored-by: Eugen Beck <[email protected]>
Co-authored-by: Eugen Beck <[email protected]>
@JackTemaki friendly ping :) |
Enable to launch the torch DDP training using torchrun