You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I found that if I use 4 GPUs out of 8 GPUs , this will cause training failed.
#caffe train --solver=models/bvlc_googlenet/solver_fp16_4.prototxt -gpu=4,5,6,7
Error message: F0717 00:12:37.478988 604 blob.cpp:289] Check failed: error == cudaSuccess (700 vs. 0) an illegal memory access was encountered (2)
The alternative workaround is add "CUDA_VISIBLE_DEVICES=4,5,6,7" before "caffe train ..."
Note: I have checked there's no out of memory , because if I choose "-gpu=0,1,2,3" , it works fine.
I hope someone could check this issue. Thanks in advance.
I found that if I use 4 GPUs out of 8 GPUs , this will cause training failed.
#caffe train --solver=models/bvlc_googlenet/solver_fp16_4.prototxt -gpu=4,5,6,7
Error message:
F0717 00:12:37.478988 604 blob.cpp:289] Check failed: error == cudaSuccess (700 vs. 0) an illegal memory access was encountered (2)
The alternative workaround is add "CUDA_VISIBLE_DEVICES=4,5,6,7" before "caffe train ..."
Note: I have checked there's no out of memory , because if I choose "-gpu=0,1,2,3" , it works fine.
I hope someone could check this issue. Thanks in advance.
Info:
NVIDIA Docker: Caffe:19.06
NVCaffe: 0.17.3
CuDNN: 7.6.0
NCCL : 2.4.7
Model : bvlc_googlenet
Batch size : 256
More logs:
I0717 00:12:36.297857 545 data_layer.cpp:107] [n0.d4.r0] Transformer threads: 4 (auto)
I0717 00:12:36.389331 609 internal_thread.cpp:78] Started internal thread 609 on device 4, rank 0
I0717 00:12:36.389572 609 db_lmdb.cpp:36] Opened lmdb examples/imagenet/ilsvrc12_train_lmdb
I0717 00:12:36.399473 600 internal_thread.cpp:78] Started internal thread 600 on device 4, rank 0
I0717 00:12:36.405875 599 internal_thread.cpp:78] Started internal thread 599 on device 4, rank 0
I0717 00:12:36.408145 598 internal_thread.cpp:78] Started internal thread 598 on device 4, rank 0
I0717 00:12:36.409735 601 internal_thread.cpp:78] Started internal thread 601 on device 4, rank 0
F0717 00:12:37.478988 604 blob.cpp:289] Check failed: error == cudaSuccess (700 vs. 0) an illegal memory access was encountered (2)
*** Check failure stack trace: ***
I0717 00:12:37.488199 597 blocking_queue.cpp:40] Waiting for datum
F0717 00:12:37.490514 589 syncedmem.cpp:18] Check failed: error == cudaSuccess (700 vs. 0) an illegal memory access was encountered
*** Check failure stack trace: ***
@ 0x7fa8cf9345cd google::LogMessage::Fail()
@ 0x7fa8cf9345cd google::LogMessage::Fail()
@ 0x7fa8cf936433 google::LogMessage::SendToLog()
@ 0x7fa8cf936433 google::LogMessage::SendToLog()
@ 0x7fa8cf93415b google::LogMessage::Flush()
@ 0x7fa8cf93415b google::LogMessage::Flush()
@ 0x7fa8cf936e1e google::LogMessageFatal::~LogMessageFatal()
@ 0x7fa8cf936e1e google::LogMessageFatal::~LogMessageFatal()
F0717 00:12:37.490514 589 syncedmem.cpp:18] Check failed: error == cudaSuccess (700 vs. 0) an illegal memory access was encounteredF0717 00:12:37.506527 593 syncedmem.cpp:18] Check failed: error == cudaSuccess (700 vs. 0) an illegal memory access was encountered
*** Check failure stack trace: ***
@ 0x7fa8cf9345cd google::LogMessage::Fail()
@ 0x7fa8d0359052 caffe::Blob::CopyFrom()
@ 0x7fa8cf936433 google::LogMessage::SendToLog()
@ 0x7fa8cf93415b google::LogMessage::Flush()
@ 0x7fa8d07d50d2 caffe::SyncedMemory::MallocHost()
@ 0x7fa8cf936e1e google::LogMessageFatal::~LogMessageFatal()
@ 0x7fa8d07dbbcb caffe::BatchTransformer<>::InternalThreadEntry()
@ 0x7fa8d07d50d2 caffe::SyncedMemory::MallocHost()
@ 0x7fa8d07d5140 caffe::SyncedMemory::to_cpu()
@ 0x7fa8d07d5140 caffe::SyncedMemory::to_cpu()
@ 0x7fa8d02bdbb2 caffe::InternalThread::entry()
@ 0x7fa8d07d64fd caffe::SyncedMemory::mutable_cpu_data()
@ 0x7fa8d07d64fd caffe::SyncedMemory::mutable_cpu_data()
@ 0x7fa8d02bfc2f boost::detail::thread_data<>::run()
@ 0x7fa8cdcaf5d5 (unknown)
@ 0x7fa8d071ff23 caffe::DataLayer<>::load_batch()
@ 0x7fa8d071ff23 caffe::DataLayer<>::load_batch()
@ 0x7fa8cd5686ba start_thread
@ 0x7fa8cdfcb41d clone
@ (nil) (unknown)
The text was updated successfully, but these errors were encountered: