Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LoRA weight merging giving torch distributed error on single-node single-gpu #255

Open
ayushbits opened this issue Dec 11, 2024 · 0 comments

Comments

@ayushbits
Copy link

I am running this notebook. However, when I try to merge LoRA and model weights before exporting to TensorRTLLM (python /opt/NeMo/scripts/nlp_language_modeling/merge_lora_weights/merge.py). I received the following error:

Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1
[W socket.cpp:464] [c10d] The server socket has failed to bind to [::]:53747 (errno: 98 - Address already in use).
[W socket.cpp:464] [c10d] The server socket has failed to bind to ?UNKNOWN? (errno: 98 - Address already in use).
[E socket.cpp:500] [c10d] The server socket has failed to listen on any local network address.
Error executing job with overrides: ['trainer.accelerator=gpu', 'tensor_model_parallel_size=1', 'pipeline_model_parallel_size=1', 'gpt_model_file=gemma_2b_pt.nemo', 'lora_model_path=nemo_experiments/gemma_lora_pubmedqa/checkpoints/gemma_lora_pubmedqa.nemo', 'merged_model_path=gemma_lora_pubmedqa_merged.nemo']
Traceback (most recent call last):
 File "/opt/NeMo/scripts/nlp_language_modeling/merge_lora_weights/merge.py", line 171, in main
   model = MegatronGPTModel.restore_from(
 File "/usr/local/lib/python3.10/dist-packages/nemo/collections/nlp/models/nlp_model.py", line 478, in restore_from
   return super().restore_from(
 File "/usr/local/lib/python3.10/dist-packages/nemo/core/classes/modelPT.py", line 468, in restore_from
   instance = cls._save_restore_connector.restore_from(
 File "/usr/local/lib/python3.10/dist-packages/nemo/collections/nlp/parts/nlp_overrides.py", line 1306, in restore_from
   trainer.strategy.setup_environment()
 File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/strategies/ddp.py", line 154, in setup_environment
   self.setup_distributed()
 File "/usr/local/lib/python3.10/dist-packages/nemo/collections/nlp/parts/nlp_overrides.py", line 244, in setup_distributed
   super().setup_distributed()
 File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/strategies/ddp.py", line 203, in setup_distributed
   _init_dist_connection(self.cluster_environment, self._process_group_backend, timeout=self._timeout)
 File "/usr/local/lib/python3.10/dist-packages/lightning_fabric/utilities/distributed.py", line 297, in _init_dist_connection
   torch.distributed.init_process_group(torch_distributed_backend, rank=global_rank, world_size=world_size, **kwargs)
 File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 86, in wrapper
   func_return = func(*args, **kwargs)
 File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 1172, in init_process_group
   store, rank, world_size = next(rendezvous_iterator)
 File "/usr/local/lib/python3.10/dist-packages/torch/distributed/rendezvous.py", line 244, in _env_rendezvous_handler
   store = _create_c10d_store(master_addr, master_port, rank, world_size, timeout, use_libuv)
 File "/usr/local/lib/python3.10/dist-packages/torch/distributed/rendezvous.py", line 172, in _create_c10d_store
   return TCPStore(
torch.distributed.DistNetworkError: The server socket has failed to listen on any local network address. The server socket has failed to bind to [::]:53747 (errno: 98 - Address already in use). The server socket has failed to bind to ?UNKNOWN? (errno: 98 - Address already in use).

Setup Information:

torch: 2.2.0a0+81ea7a4
nemo: 2.0
Container: nvcr.io/nvidia/nemo:24.01.gemma
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant