You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
steps: 0%| | 0/234 [00:00<?, ?it/s]
epoch 1/3
[rank1]: Traceback (most recent call last):
[rank1]: File "/workspace/lora-scripts/./scripts/stable/train_db.py", line 529, in
[rank1]: train(args)
[rank1]: File "/workspace/lora-scripts/./scripts/stable/train_db.py", line 343, in train
[rank1]: encoder_hidden_states = train_util.get_hidden_states(
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/workspace/lora-scripts/scripts/stable/library/train_util.py", line 4427, in get_hidden_states
[rank1]: encoder_hidden_states = text_encoder.text_model.final_layer_norm(encoder_hidden_states)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/root/miniconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1729, in getattr
[rank1]: raise AttributeError(f"'{type(self).name}' object has no attribute '{name}'")
[rank1]: AttributeError: 'DistributedDataParallel' object has no attribute 'text_model'
[rank0]: Traceback (most recent call last):
[rank0]: File "/workspace/lora-scripts/./scripts/stable/train_db.py", line 529, in
[rank0]: train(args)
[rank0]: File "/workspace/lora-scripts/./scripts/stable/train_db.py", line 343, in train
[rank0]: encoder_hidden_states = train_util.get_hidden_states(
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/workspace/lora-scripts/scripts/stable/library/train_util.py", line 4427, in get_hidden_states
[rank0]: encoder_hidden_states = text_encoder.text_model.final_layer_norm(encoder_hidden_states)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/root/miniconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1729, in getattr
[rank0]: raise AttributeError(f"'{type(self).name}' object has no attribute '{name}'")
[rank0]: AttributeError: 'DistributedDataParallel' object has no attribute 'text_model'
steps: 0%| | 0/234 [00:01<?, ?it/s]
W1010 02:41:57.596000 135260147754816 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 3147 closing signal SIGTERM
E1010 02:41:57.711000 135260147754816 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 1 (pid: 3148) of binary: /root/miniconda3/bin/python
Traceback (most recent call last):
File "", line 198, in _run_module_as_main
File "", line 88, in _run_code
File "/root/miniconda3/lib/python3.12/site-packages/accelerate/commands/launch.py", line 1116, in
main()
File "/root/miniconda3/lib/python3.12/site-packages/accelerate/commands/launch.py", line 1112, in main
launch_command(args)
File "/root/miniconda3/lib/python3.12/site-packages/accelerate/commands/launch.py", line 1097, in launch_command
multi_gpu_launcher(args)
File "/root/miniconda3/lib/python3.12/site-packages/accelerate/commands/launch.py", line 734, in multi_gpu_launcher
distrib_run.run(args)
File "/root/miniconda3/lib/python3.12/site-packages/torch/distributed/run.py", line 892, in run
elastic_launch(
File "/root/miniconda3/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 133, in call
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
steps: 0%| | 0/234 [00:00<?, ?it/s]
epoch 1/3
[rank1]: Traceback (most recent call last):
[rank1]: File "/workspace/lora-scripts/./scripts/stable/train_db.py", line 529, in
[rank1]: train(args)
[rank1]: File "/workspace/lora-scripts/./scripts/stable/train_db.py", line 343, in train
[rank1]: encoder_hidden_states = train_util.get_hidden_states(
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/workspace/lora-scripts/scripts/stable/library/train_util.py", line 4427, in get_hidden_states
[rank1]: encoder_hidden_states = text_encoder.text_model.final_layer_norm(encoder_hidden_states)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/root/miniconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1729, in getattr
[rank1]: raise AttributeError(f"'{type(self).name}' object has no attribute '{name}'")
[rank1]: AttributeError: 'DistributedDataParallel' object has no attribute 'text_model'
[rank0]: Traceback (most recent call last):
[rank0]: File "/workspace/lora-scripts/./scripts/stable/train_db.py", line 529, in
[rank0]: train(args)
[rank0]: File "/workspace/lora-scripts/./scripts/stable/train_db.py", line 343, in train
[rank0]: encoder_hidden_states = train_util.get_hidden_states(
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/workspace/lora-scripts/scripts/stable/library/train_util.py", line 4427, in get_hidden_states
[rank0]: encoder_hidden_states = text_encoder.text_model.final_layer_norm(encoder_hidden_states)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/root/miniconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1729, in getattr
[rank0]: raise AttributeError(f"'{type(self).name}' object has no attribute '{name}'")
[rank0]: AttributeError: 'DistributedDataParallel' object has no attribute 'text_model'
steps: 0%| | 0/234 [00:01<?, ?it/s]
W1010 02:41:57.596000 135260147754816 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 3147 closing signal SIGTERM
E1010 02:41:57.711000 135260147754816 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 1 (pid: 3148) of binary: /root/miniconda3/bin/python
Traceback (most recent call last):
File "", line 198, in _run_module_as_main
File "", line 88, in _run_code
File "/root/miniconda3/lib/python3.12/site-packages/accelerate/commands/launch.py", line 1116, in
main()
File "/root/miniconda3/lib/python3.12/site-packages/accelerate/commands/launch.py", line 1112, in main
launch_command(args)
File "/root/miniconda3/lib/python3.12/site-packages/accelerate/commands/launch.py", line 1097, in launch_command
multi_gpu_launcher(args)
File "/root/miniconda3/lib/python3.12/site-packages/accelerate/commands/launch.py", line 734, in multi_gpu_launcher
distrib_run.run(args)
File "/root/miniconda3/lib/python3.12/site-packages/torch/distributed/run.py", line 892, in run
elastic_launch(
File "/root/miniconda3/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 133, in call
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
./scripts/stable/train_db.py FAILED
Failures:
<NO_OTHER_FAILURES>
Root Cause (first observed failure):
[0]:
time : 2024-10-10_02:41:57
host : ubuntu-Super-Server
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 3148)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
02:41:58-146594 ERROR Training failed / 训练失败
The text was updated successfully, but these errors were encountered: