Can't run with MPI #4

FlyingWolFox · 2023-11-07T03:10:00Z

Hello. I'm trying to use MPI to speedup the pre-training, but the program crashes when syncing grads.

Running without mpi (or with 1 process) is fine, but when trying with more than 1 process (mpirun -np 2 python run.py --config-name skimo_maze run_prefix=test2 gpu=0 wandb=true) I get this traceback:

Error executing job with overrides: ['run_prefix=test2', 'gpu=0', 'wandb=true']
Traceback (most recent call last):
  File "/home/flyingwolfox/tcc-src-2/skimo/run.py", line 39, in main
    SkillRLRun(cfg).run()
  File "/home/flyingwolfox/tcc-src-2/skimo/rolf/rolf/main.py", line 56, in run
    trainer.train()
  File "/home/flyingwolfox/tcc-src-2/skimo/skill_trainer.py", line 41, in train
    self._pretrain()
  File "/home/flyingwolfox/tcc-src-2/skimo/skill_trainer.py", line 76, in _pretrain
    _train_info = self._agent.pretrain()
  File "/home/flyingwolfox/tcc-src-2/skimo/skimo_agent.py", line 713, in pretrain
    _train_info = self._pretrain(batch)
  File "/home/flyingwolfox/tcc-src-2/skimo/skimo_agent.py", line 847, in _pretrain
    joint_grad_norm = self.joint_optim.step(hl_loss + ll_loss)
  File "/home/flyingwolfox/tcc-src-2/skimo/rolf/rolf/utils/pytorch.py", line 466, in step
    sync_grad(self._model, self._device)
  File "/home/flyingwolfox/tcc-src-2/skimo/rolf/rolf/utils/pytorch.py", line 152, in sync_grad
    flat_grads, grads_shape = _get_flat_grads(network)
  File "/home/flyingwolfox/tcc-src-2/skimo/rolf/rolf/utils/pytorch.py", line 175, in _get_flat_grads
    for key_name, value in network.named_parameters():
AttributeError: 'list' object has no attribute 'named_parameters'

I tried to transform the network list into torch.nn.Sequential (before _get_flat_grads() call), but that didn't work either, getting the grad of the reward and critic modules fails (https://pastebin.com/Wu7fp0sP)

Is it possible to run with mpi? If so, how can I make it? Thanks

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can't run with MPI #4

Can't run with MPI #4

FlyingWolFox commented Nov 7, 2023

Can't run with MPI #4

Can't run with MPI #4

Comments

FlyingWolFox commented Nov 7, 2023