Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ConnectionRefusedError: [Errno 111] Connection refused #128

Open
Glutton-zh opened this issue Apr 23, 2020 · 1 comment
Open

ConnectionRefusedError: [Errno 111] Connection refused #128

Glutton-zh opened this issue Apr 23, 2020 · 1 comment

Comments

@Glutton-zh
Copy link

training loss at iteration 79735: 5.6166815757751465
focal loss at iteration 79735: 5.0547027587890625
pull loss at iteration 79735: 0.0331345796585083
push loss at iteration 79735: 0.30962249636650085
regr loss at iteration 79735: 0.219222292304039
training loss at iteration 79740: 3.3387136459350586
focal loss at iteration 79740: 2.8270068168640137
pull loss at iteration 79740: 0.02639671042561531
push loss at iteration 79740: 0.2322157919406891
regr loss at iteration 79740: 0.25309425592422485
44%|█████████████▎ | 79741/180000 [36:08:34<45:26:33, 1.63s/it]Exception in thread Thread-3:
Traceback (most recent call last):
File "/home/zhanghan/anaconda3/envs/CornerNet_Lite/lib/python3.7/threading.py", line 917, in _bootstrap_inner
self.run()
File "/home/zhanghan/anaconda3/envs/CornerNet_Lite/lib/python3.7/threading.py", line 865, in run
self._target(*self._args, **self._kwargs)
File "train.py", line 51, in pin_memory
data = data_queue.get()
File "/home/zhanghan/anaconda3/envs/CornerNet_Lite/lib/python3.7/multiprocessing/queues.py", line 113, in get
return _ForkingPickler.loads(res)
File "/home/zhanghan/anaconda3/envs/CornerNet_Lite/lib/python3.7/site-packages/torch/multiprocessing/reductions.py", line 256, in rebuild_storage_fd
fd = df.detach()
File "/home/zhanghan/anaconda3/envs/CornerNet_Lite/lib/python3.7/multiprocessing/resource_sharer.py", line 57, in detach
with _resource_sharer.get_connection(self._id) as conn:
File "/home/zhanghan/anaconda3/envs/CornerNet_Lite/lib/python3.7/multiprocessing/resource_sharer.py", line 87, in get_connection
c = Client(address, authkey=process.current_process().authkey)
File "/home/zhanghan/anaconda3/envs/CornerNet_Lite/lib/python3.7/multiprocessing/connection.py", line 492, in Client
c = SocketClient(address)
File "/home/zhanghan/anaconda3/envs/CornerNet_Lite/lib/python3.7/multiprocessing/connection.py", line 619, in SocketClient
s.connect(address)
ConnectionRefusedError: [Errno 111] Connection refused

training loss at iteration 79745: 1.7480967044830322
focal loss at iteration 79745: 1.15070378780365
pull loss at iteration 79745: 0.019453493878245354
push loss at iteration 79745: 0.3843255937099457
regr loss at iteration 79745: 0.19361379742622375
44%|█████████████▎ | 79748/180000 [36:08:45<45:26:22, 1.63s/it]

^CTraceback (most recent call last):
File "train.py", line 203, in
Process Process-5:
Process Process-2:
Process Process-1:
Process Process-4:

@Glutton-zh
Copy link
Author

i use CenterNet to train VOC2007,but it's break at 79748/180000 (at 64th epoch). i try again and break at 68364/180000 again. my gpu memory-usage is 8051mib/5116mib. and the error is:

training loss at iteration 68355: 5.786685466766357
focal loss at iteration 68355: 5.192009925842285
pull loss at iteration 68355: 0.008522081188857555
push loss at iteration 68355: 0.3189387023448944
regr loss at iteration 68355: 0.2672148048877716
38%|███████████▍ | 68357/180000 [27:26:39<44:49:23, 1.45s/it]Exception in thread Thread-3:
Traceback (most recent call last):
File "/home/zhanghan/anaconda3/envs/CornerNet_Lite/lib/python3.7/threading.py", line 917, in _bootstrap_inner
self.run()
File "/home/zhanghan/anaconda3/envs/CornerNet_Lite/lib/python3.7/threading.py", line 865, in run
self._target(*self._args, **self._kwargs)
File "train.py", line 51, in pin_memory
data = data_queue.get()
File "/home/zhanghan/anaconda3/envs/CornerNet_Lite/lib/python3.7/multiprocessing/queues.py", line 113, in get
return _ForkingPickler.loads(res)
File "/home/zhanghan/anaconda3/envs/CornerNet_Lite/lib/python3.7/site-packages/torch/multiprocessing/reductions.py", line 256, in rebuild_storage_fd
fd = df.detach()
File "/home/zhanghan/anaconda3/envs/CornerNet_Lite/lib/python3.7/multiprocessing/resource_sharer.py", line 57, in detach
with _resource_sharer.get_connection(self._id) as conn:
File "/home/zhanghan/anaconda3/envs/CornerNet_Lite/lib/python3.7/multiprocessing/resource_sharer.py", line 87, in get_connection
c = Client(address, authkey=process.current_process().authkey)
File "/home/zhanghan/anaconda3/envs/CornerNet_Lite/lib/python3.7/multiprocessing/connection.py", line 492, in Client
c = SocketClient(address)
File "/home/zhanghan/anaconda3/envs/CornerNet_Lite/lib/python3.7/multiprocessing/connection.py", line 619, in SocketClient
s.connect(address)
ConnectionRefusedError: [Errno 111] Connection refused

training loss at iteration 68360: 6.084456443786621
focal loss at iteration 68360: 5.576683521270752
pull loss at iteration 68360: 0.04028501734137535
push loss at iteration 68360: 0.2413397580385208
regr loss at iteration 68360: 0.22614836692810059
38%|███████████▍ | 68364/180000 [27:26:49<44:49:13, 1.45s/it]

And then the program doesn't run anymore
please help me

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant