-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
when use 2 Tesla V100 GPU cards, it gives this below error which said No locks available, what do you think the issue is ? #25
Comments
while the 2 GPU cards are all available: bash-4.4$ nvidia-smi +-----------------------------------------------------------------------------+ |
While when I use only one GPU card, then it suddenly works !!!! why it can not work on 2 cards ? |
Hi, Please feel free to delete the lines related to lock. Those were meant to work my own training infrastructure which apparently don't generalize. Best regards, |
(Pix2NeRF) bash-4.4$ CUDA_VISIBLE_DEVICES=0,1 python3 train_con.py --curriculum=celeba --output_dir='/data/yshan/data/Img/pix2NerfOutput' --dataset_dir='/data/yshan/data/Img/img_align_celeba' --encoder_type='CCS' --recon_lambda=5 --ssim_lambda=1 --vgg_lambda=1 --pos_lambda_gen=15 --lambda_e_latent=1 --lambda_e_pos=1 --cond_lambda=1 --load_encoder=1
Namespace(n_epochs=3000, sample_interval=1000, output_dir='/data/yshan/data/Img/pix2NerfOutput', load_dir='/data/yshan/data/Img/pix2NerfOutput', curriculum='celeba', eval_freq=5000, port='12354', set_step=None, model_save_interval=200, pretrained_dir='', wandb_name='', wandb_entity='', wandb_project='', recon_lambda=5.0, ssim_lambda=1.0, vgg_lambda=1.0, dataset_dir='/data/yshan/data/Img/img_align_celeba', pos_lambda_gen=15.0, sn=0, lambda_e_latent=1.0, lambda_e_pos=1.0, encoder_type='CCS', cond_lambda=1.0, ema=1, load_encoder=1)
Lock not found
Traceback (most recent call last):
File "/data/yshan/Pix2NeRF/train_con.py", line 684, in
mp.spawn(train, args=(num_gpus, opt), nprocs=num_gpus, join=True)
File "/home/yshan/anaconda3/envs/Pix2NeRF/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 239, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/home/yshan/anaconda3/envs/Pix2NeRF/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 197, in start_processes
while not context.join():
File "/home/yshan/anaconda3/envs/Pix2NeRF/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 160, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:
-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/home/yshan/anaconda3/envs/Pix2NeRF/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
fn(i, *args)
File "/data/yshan/Pix2NeRF/train_con.py", line 85, in train
setup(rank, world_size, opt.port, opt.output_dir)
File "/data/yshan/Pix2NeRF/train_con.py", line 46, in setup
dist.init_process_group('gloo', init_method=file_lock, rank=rank, world_size=world_size)
File "/home/yshan/anaconda3/envs/Pix2NeRF/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 907, in init_process_group
default_pg = _new_process_group_helper(
File "/home/yshan/anaconda3/envs/Pix2NeRF/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1009, in _new_process_group_helper
backend_class = ProcessGroupGloo(backend_prefix_store, group_rank, group_size, timeout=timeout)
RuntimeError: flock: No locks available
The text was updated successfully, but these errors were encountered: