You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am running this notebook. However, when I try to merge LoRA and model weights before exporting to TensorRTLLM (python /opt/NeMo/scripts/nlp_language_modeling/merge_lora_weights/merge.py). I received the following error:
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1
[W socket.cpp:464] [c10d] The server socket has failed to bind to [::]:53747 (errno: 98 - Address already in use).
[W socket.cpp:464] [c10d] The server socket has failed to bind to ?UNKNOWN? (errno: 98 - Address already in use).
[E socket.cpp:500] [c10d] The server socket has failed to listen on any local network address.
Error executing job with overrides: ['trainer.accelerator=gpu', 'tensor_model_parallel_size=1', 'pipeline_model_parallel_size=1', 'gpt_model_file=gemma_2b_pt.nemo', 'lora_model_path=nemo_experiments/gemma_lora_pubmedqa/checkpoints/gemma_lora_pubmedqa.nemo', 'merged_model_path=gemma_lora_pubmedqa_merged.nemo']
Traceback (most recent call last):
File "/opt/NeMo/scripts/nlp_language_modeling/merge_lora_weights/merge.py", line 171, in main
model = MegatronGPTModel.restore_from(
File "/usr/local/lib/python3.10/dist-packages/nemo/collections/nlp/models/nlp_model.py", line 478, in restore_from
return super().restore_from(
File "/usr/local/lib/python3.10/dist-packages/nemo/core/classes/modelPT.py", line 468, in restore_from
instance = cls._save_restore_connector.restore_from(
File "/usr/local/lib/python3.10/dist-packages/nemo/collections/nlp/parts/nlp_overrides.py", line 1306, in restore_from
trainer.strategy.setup_environment()
File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/strategies/ddp.py", line 154, in setup_environment
self.setup_distributed()
File "/usr/local/lib/python3.10/dist-packages/nemo/collections/nlp/parts/nlp_overrides.py", line 244, in setup_distributed
super().setup_distributed()
File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/strategies/ddp.py", line 203, in setup_distributed
_init_dist_connection(self.cluster_environment, self._process_group_backend, timeout=self._timeout)
File "/usr/local/lib/python3.10/dist-packages/lightning_fabric/utilities/distributed.py", line 297, in _init_dist_connection
torch.distributed.init_process_group(torch_distributed_backend, rank=global_rank, world_size=world_size, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 86, in wrapper
func_return = func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 1172, in init_process_group
store, rank, world_size = next(rendezvous_iterator)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/rendezvous.py", line 244, in _env_rendezvous_handler
store = _create_c10d_store(master_addr, master_port, rank, world_size, timeout, use_libuv)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/rendezvous.py", line 172, in _create_c10d_store
return TCPStore(
torch.distributed.DistNetworkError: The server socket has failed to listen on any local network address. The server socket has failed to bind to [::]:53747 (errno: 98 - Address already in use). The server socket has failed to bind to ?UNKNOWN? (errno: 98 - Address already in use).
I am running this notebook. However, when I try to merge LoRA and model weights before exporting to TensorRTLLM (
python /opt/NeMo/scripts/nlp_language_modeling/merge_lora_weights/merge.py
). I received the following error:Setup Information:
The text was updated successfully, but these errors were encountered: