Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't run with MPI #4

Open
FlyingWolFox opened this issue Nov 7, 2023 · 0 comments
Open

Can't run with MPI #4

FlyingWolFox opened this issue Nov 7, 2023 · 0 comments

Comments

@FlyingWolFox
Copy link

Hello. I'm trying to use MPI to speedup the pre-training, but the program crashes when syncing grads.

Running without mpi (or with 1 process) is fine, but when trying with more than 1 process (mpirun -np 2 python run.py --config-name skimo_maze run_prefix=test2 gpu=0 wandb=true) I get this traceback:

Error executing job with overrides: ['run_prefix=test2', 'gpu=0', 'wandb=true']
Traceback (most recent call last):
  File "/home/flyingwolfox/tcc-src-2/skimo/run.py", line 39, in main
    SkillRLRun(cfg).run()
  File "/home/flyingwolfox/tcc-src-2/skimo/rolf/rolf/main.py", line 56, in run
    trainer.train()
  File "/home/flyingwolfox/tcc-src-2/skimo/skill_trainer.py", line 41, in train
    self._pretrain()
  File "/home/flyingwolfox/tcc-src-2/skimo/skill_trainer.py", line 76, in _pretrain
    _train_info = self._agent.pretrain()
  File "/home/flyingwolfox/tcc-src-2/skimo/skimo_agent.py", line 713, in pretrain
    _train_info = self._pretrain(batch)
  File "/home/flyingwolfox/tcc-src-2/skimo/skimo_agent.py", line 847, in _pretrain
    joint_grad_norm = self.joint_optim.step(hl_loss + ll_loss)
  File "/home/flyingwolfox/tcc-src-2/skimo/rolf/rolf/utils/pytorch.py", line 466, in step
    sync_grad(self._model, self._device)
  File "/home/flyingwolfox/tcc-src-2/skimo/rolf/rolf/utils/pytorch.py", line 152, in sync_grad
    flat_grads, grads_shape = _get_flat_grads(network)
  File "/home/flyingwolfox/tcc-src-2/skimo/rolf/rolf/utils/pytorch.py", line 175, in _get_flat_grads
    for key_name, value in network.named_parameters():
AttributeError: 'list' object has no attribute 'named_parameters'

I tried to transform the network list into torch.nn.Sequential (before _get_flat_grads() call), but that didn't work either, getting the grad of the reward and critic modules fails (https://pastebin.com/Wu7fp0sP)

Is it possible to run with mpi? If so, how can I make it? Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant