-
Notifications
You must be signed in to change notification settings - Fork 8.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
在云上运行FCN网络的时候使用GPU进行训练会报这个错:FileNotFoundError: [Errno 2] No such file or directory srun: error: gpu03: task 0: Exited with exit code 1 #801
Comments
请问您的问题解决了吗,我也是这个问题 |
解决了,还是目录的位置选择不对,需要再修改一下
木南
***@***.***
…------------------ 原始邮件 ------------------
发件人: ***@***.***>;
发送时间: 2024年6月10日(星期一) 下午2:59
收件人: ***@***.***>;
抄送: ***@***.***>; ***@***.***>;
主题: Re: [WZMIAOMIAO/deep-learning-for-image-processing] 在云上运行FCN网络的时候使用GPU进行训练会报这个错:FileNotFoundError: [Errno 2] No such file or directory srun: error: gpu03: task 0: Exited with exit code 1 (Issue #801)
请问您的问题解决了吗,我也是这个问题
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you authored the thread.Message ID: ***@***.***>
|
您说的是数据集的位置吗,我的数据集就在个人账号的根目录下,与程序目录同级,但是我跑起来就是这个问题 |
还包括你运行文件的路径,建议改成绝对路径试一下
木南
***@***.***
…------------------ 原始邮件 ------------------
发件人: ***@***.***>;
发送时间: 2024年6月10日(星期一) 下午3:14
收件人: ***@***.***>;
抄送: ***@***.***>; ***@***.***>;
主题: Re: [WZMIAOMIAO/deep-learning-for-image-processing] 在云上运行FCN网络的时候使用GPU进行训练会报这个错:FileNotFoundError: [Errno 2] No such file or directory srun: error: gpu03: task 0: Exited with exit code 1 (Issue #801)
解决了,还是目录的位置选择不对,需要再修改一下 木南 @.***
…
------------------ 原始邮件 ------------------ 发件人: @.>; 发送时间: 2024年6月10日(星期一) 下午2:59 收件人: @.>; 抄送: @.>; @.>; 主题: Re: [WZMIAOMIAO/deep-learning-for-image-processing] 在云上运行FCN网络的时候使用GPU进行训练会报这个错:FileNotFoundError: [Errno 2] No such file or directory srun: error: gpu03: task 0: Exited with exit code 1 (Issue #801) 请问您的问题解决了吗,我也是这个问题 — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>
您说的是数据集的位置吗,我的数据集就在个人账号的根目录下,与程序目录同级,但是我跑起来就是这个问题
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you authored the thread.Message ID: ***@***.***>
|
我已经解决,就是数据集文件位置的原因,谢谢 |
这是完整的报错,网上搜了,很多讲的是进程之间通信的问题,这个问题要怎么解决呀?应该在代码中修改哪些位置?
Epoch: [0] [ 0/366] eta: 0:31:04 lr: 0.000000 loss: 2.1887 (2.1887) time: 5.0952 data: 0.7384
Epoch: [0] [ 10/366] eta: 0:15:24 lr: 0.000003 loss: 0.5890 (2.3867) time: 2.5974 data: 0.0681
Epoch: [0] [ 20/366] eta: 0:14:01 lr: 0.000006 loss: 0.2813 (1.7838) time: 2.2994 data: 0.0011
Epoch: [0] [ 30/366] eta: 0:13:21 lr: 0.000009 loss: 2.2992 (1.4588) time: 2.2671 data: 0.0010
Epoch: [0] [ 40/366] eta: 0:12:44 lr: 0.000011 loss: 1.2415 (1.4418) time: 2.2521 data: 0.0010
Epoch: [0] [ 50/366] eta: 0:12:14 lr: 0.000014 loss: 1.4934 (1.4652) time: 2.2295 data: 0.0010
Epoch: [0] [ 60/366] eta: 0:11:49 lr: 0.000017 loss: 0.5944 (1.4093) time: 2.2702 data: 0.0010
Epoch: [0] [ 70/366] eta: 0:11:23 lr: 0.000019 loss: 0.6704 (1.4132) time: 2.2722 data: 0.0010
Epoch: [0] [ 80/366] eta: 0:11:04 lr: 0.000022 loss: 0.3548 (1.3494) time: 2.3282 data: 0.0010
Epoch: [0] [ 90/366] eta: 0:10:39 lr: 0.000025 loss: 0.3015 (1.2649) time: 2.3509 data: 0.0011
Epoch: [0] [100/366] eta: 0:10:14 lr: 0.000028 loss: 0.6640 (1.2471) time: 2.2596 data: 0.0011
Epoch: [0] [110/366] eta: 0:09:51 lr: 0.000030 loss: 2.1179 (1.2050) time: 2.2716 data: 0.0010
Epoch: [0] [120/366] eta: 0:09:27 lr: 0.000033 loss: 2.0124 (1.2004) time: 2.3035 data: 0.0010
Epoch: [0] [130/366] eta: 0:09:04 lr: 0.000036 loss: 1.1753 (1.1981) time: 2.2837 data: 0.0010
Epoch: [0] [140/366] eta: 0:08:39 lr: 0.000039 loss: 2.3567 (1.2141) time: 2.2321 data: 0.0010
Epoch: [0] [150/366] eta: 0:08:18 lr: 0.000041 loss: 0.5729 (1.1973) time: 2.3115 data: 0.0010
Epoch: [0] [160/366] eta: 0:07:54 lr: 0.000044 loss: 0.4893 (1.2001) time: 2.3283 data: 0.0011
Epoch: [0] [170/366] eta: 0:07:30 lr: 0.000047 loss: 0.7241 (1.1839) time: 2.2304 data: 0.0011
Epoch: [0] [180/366] eta: 0:07:06 lr: 0.000050 loss: 1.3635 (1.1723) time: 2.2145 data: 0.0010
Traceback (most recent call last):
File "/public/home/2023020919/FCN/train.py", line 206, in
main(args)
File "/public/home/2023020919/FCN/train.py", line 141, in main
mean_loss, lr = train_one_epoch(model, optimizer, train_loader, device, epoch,
File "/public/home/2023020919/FCN/train_utils/train_and_evals.py", line 42, in train_one_epoch
for image, target in metric_logger.log_every(data_loader, print_freq, header):
File "/public/home/2023020919/FCN/train_utils/distrributed_utils.py", line 189, in log_every
for obj in iterable:
File "/public/home/2023020919/.conda/envs/pytorch_3.9_gpu/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 631, in next
data = self._next_data()
File "/public/home/2023020919/.conda/envs/pytorch_3.9_gpu/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1329, in _next_data
idx, data = self._get_data()
File "/public/home/2023020919/.conda/envs/pytorch_3.9_gpu/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1295, in _get_data
success, data = self._try_get_data()
File "/public/home/2023020919/.conda/envs/pytorch_3.9_gpu/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1133, in _try_get_data
data = self._data_queue.get(timeout=timeout)
File "/public/home/2023020919/.conda/envs/pytorch_3.9_gpu/lib/python3.9/multiprocessing/queues.py", line 122, in get
return _ForkingPickler.loads(res)
File "/public/home/2023020919/.conda/envs/pytorch_3.9_gpu/lib/python3.9/site-packages/torch/multiprocessing/reductions.py", line 495, in rebuild_storage_fd
fd = df.detach()
File "/public/home/2023020919/.conda/envs/pytorch_3.9_gpu/lib/python3.9/multiprocessing/resource_sharer.py", line 57, in detach
with _resource_sharer.get_connection(self._id) as conn:
File "/public/home/2023020919/.conda/envs/pytorch_3.9_gpu/lib/python3.9/multiprocessing/resource_sharer.py", line 86, in get_connection
c = Client(address, authkey=process.current_process().authkey)
File "/public/home/2023020919/.conda/envs/pytorch_3.9_gpu/lib/python3.9/multiprocessing/connection.py", line 502, in Client
c = SocketClient(address)
File "/public/home/2023020919/.conda/envs/pytorch_3.9_gpu/lib/python3.9/multiprocessing/connection.py", line 630, in SocketClient
s.connect(address)
FileNotFoundError: [Errno 2] No such file or directory
srun: error: gpu03: task 0: Exited with exit code 1
The text was updated successfully, but these errors were encountered: