You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello. I'm trying to use MPI to speedup the pre-training, but the program crashes when syncing grads.
Running without mpi (or with 1 process) is fine, but when trying with more than 1 process (mpirun -np 2 python run.py --config-name skimo_maze run_prefix=test2 gpu=0 wandb=true) I get this traceback:
Error executing job with overrides: ['run_prefix=test2', 'gpu=0', 'wandb=true']
Traceback (most recent call last):
File "/home/flyingwolfox/tcc-src-2/skimo/run.py", line 39, in main
SkillRLRun(cfg).run()
File "/home/flyingwolfox/tcc-src-2/skimo/rolf/rolf/main.py", line 56, in run
trainer.train()
File "/home/flyingwolfox/tcc-src-2/skimo/skill_trainer.py", line 41, in train
self._pretrain()
File "/home/flyingwolfox/tcc-src-2/skimo/skill_trainer.py", line 76, in _pretrain
_train_info = self._agent.pretrain()
File "/home/flyingwolfox/tcc-src-2/skimo/skimo_agent.py", line 713, in pretrain
_train_info = self._pretrain(batch)
File "/home/flyingwolfox/tcc-src-2/skimo/skimo_agent.py", line 847, in _pretrain
joint_grad_norm = self.joint_optim.step(hl_loss + ll_loss)
File "/home/flyingwolfox/tcc-src-2/skimo/rolf/rolf/utils/pytorch.py", line 466, in step
sync_grad(self._model, self._device)
File "/home/flyingwolfox/tcc-src-2/skimo/rolf/rolf/utils/pytorch.py", line 152, in sync_grad
flat_grads, grads_shape = _get_flat_grads(network)
File "/home/flyingwolfox/tcc-src-2/skimo/rolf/rolf/utils/pytorch.py", line 175, in _get_flat_grads
for key_name, value in network.named_parameters():
AttributeError: 'list' object has no attribute 'named_parameters'
I tried to transform the network list into torch.nn.Sequential (before _get_flat_grads() call), but that didn't work either, getting the grad of the reward and critic modules fails (https://pastebin.com/Wu7fp0sP)
Is it possible to run with mpi? If so, how can I make it? Thanks
The text was updated successfully, but these errors were encountered:
Hello. I'm trying to use MPI to speedup the pre-training, but the program crashes when syncing grads.
Running without mpi (or with 1 process) is fine, but when trying with more than 1 process (
mpirun -np 2 python run.py --config-name skimo_maze run_prefix=test2 gpu=0 wandb=true
) I get this traceback:I tried to transform the network list into torch.nn.Sequential (before
_get_flat_grads()
call), but that didn't work either, getting the grad of the reward and critic modules fails (https://pastebin.com/Wu7fp0sP)Is it possible to run with mpi? If so, how can I make it? Thanks
The text was updated successfully, but these errors were encountered: