Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ubuntu上双卡多卡Dreambooth训练报错 #543

Open
changqingla opened this issue Oct 10, 2024 · 1 comment
Open

ubuntu上双卡多卡Dreambooth训练报错 #543

changqingla opened this issue Oct 10, 2024 · 1 comment

Comments

@changqingla
Copy link

steps: 0%| | 0/234 [00:00<?, ?it/s]
epoch 1/3
[rank1]: Traceback (most recent call last):
[rank1]: File "/workspace/lora-scripts/./scripts/stable/train_db.py", line 529, in
[rank1]: train(args)
[rank1]: File "/workspace/lora-scripts/./scripts/stable/train_db.py", line 343, in train
[rank1]: encoder_hidden_states = train_util.get_hidden_states(
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/workspace/lora-scripts/scripts/stable/library/train_util.py", line 4427, in get_hidden_states
[rank1]: encoder_hidden_states = text_encoder.text_model.final_layer_norm(encoder_hidden_states)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/root/miniconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1729, in getattr
[rank1]: raise AttributeError(f"'{type(self).name}' object has no attribute '{name}'")
[rank1]: AttributeError: 'DistributedDataParallel' object has no attribute 'text_model'
[rank0]: Traceback (most recent call last):
[rank0]: File "/workspace/lora-scripts/./scripts/stable/train_db.py", line 529, in
[rank0]: train(args)
[rank0]: File "/workspace/lora-scripts/./scripts/stable/train_db.py", line 343, in train
[rank0]: encoder_hidden_states = train_util.get_hidden_states(
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/workspace/lora-scripts/scripts/stable/library/train_util.py", line 4427, in get_hidden_states
[rank0]: encoder_hidden_states = text_encoder.text_model.final_layer_norm(encoder_hidden_states)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/root/miniconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1729, in getattr
[rank0]: raise AttributeError(f"'{type(self).name}' object has no attribute '{name}'")
[rank0]: AttributeError: 'DistributedDataParallel' object has no attribute 'text_model'
steps: 0%| | 0/234 [00:01<?, ?it/s]
W1010 02:41:57.596000 135260147754816 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 3147 closing signal SIGTERM
E1010 02:41:57.711000 135260147754816 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 1 (pid: 3148) of binary: /root/miniconda3/bin/python
Traceback (most recent call last):
File "", line 198, in _run_module_as_main
File "", line 88, in _run_code
File "/root/miniconda3/lib/python3.12/site-packages/accelerate/commands/launch.py", line 1116, in
main()
File "/root/miniconda3/lib/python3.12/site-packages/accelerate/commands/launch.py", line 1112, in main
launch_command(args)
File "/root/miniconda3/lib/python3.12/site-packages/accelerate/commands/launch.py", line 1097, in launch_command
multi_gpu_launcher(args)
File "/root/miniconda3/lib/python3.12/site-packages/accelerate/commands/launch.py", line 734, in multi_gpu_launcher
distrib_run.run(args)
File "/root/miniconda3/lib/python3.12/site-packages/torch/distributed/run.py", line 892, in run
elastic_launch(
File "/root/miniconda3/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 133, in call
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

./scripts/stable/train_db.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2024-10-10_02:41:57
host : ubuntu-Super-Server
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 3148)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

02:41:58-146594 ERROR Training failed / 训练失败

@hben35096
Copy link

focus on

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants