训练时仅计算1个epoch的结果就停止训练的问题 #121

EvanHan09 · 2019-06-03T06:46:36Z

请问，楼主有没有遇到过在训练时python run_cnn.py train 开始后，只训练计算得到1个epoch 结果，就停止训练了？
我检查了显卡的显存占用，发现没有出现内存泄露问题。继而又尝试了两种显存的分配方式，①分配了0.4的显存 ②自动适应分配。得到的结果和上面一样，均只训练一个epoch就停止了。
Configuring TensorBoard and Saver... Loading training and validation data... Time usage: 0:00:11 2019-06-03 11:40:30.224462: I c:\users\user\source\repos\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1405] Found device 0 with properties: name: GeForce RTX 2060 major: 7 minor: 5 memoryClockRate(GHz): 1.71 pciBusID: 0000:01:00.0 totalMemory: 6.00GiB freeMemory: 4.89GiB 2019-06-03 11:40:30.237900: I c:\users\user\source\repos\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1484] Adding visible gpu devices: 0 2019-06-03 11:40:30.996786: I c:\users\user\source\repos\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:965] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-06-03 11:40:31.005045: I c:\users\user\source\repos\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:971] 0 2019-06-03 11:40:31.010727: I c:\users\user\source\repos\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:984] 0: N 2019-06-03 11:40:31.015885: I c:\users\user\source\repos\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 2457 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2060, pci bus id: 0000:01:00.0, compute capability: 7.5) Training and evaluating... Epoch: 1 Iter: 0, Train Loss: 2.3, Train Acc: 10.94%, Val Loss: 2.3, Val Acc: 10.02%, Time: 0:00:02 *
能给解答一下吗？

EvanHan09 · 2019-06-03T07:27:59Z

我debug发现，到下面代码第一行这里，就没有继续运行下去了，这个运行优化是选取模型优化方法吗？新手理解可能不到位？
` session.run(model.optim, feed_dict=feed_dict) # 运行优化
total_batch += 1

        if total_batch - last_improved > require_improvement:
            # 验证集正确率长期不提升，提前结束训练
            print("No optimization for a long time, auto-stopping...")
            flag = True
            break  # 跳出循环`

gaussic · 2019-06-04T13:23:43Z

把这一段注释掉就不会停了

EvanHan09 · 2019-06-05T07:39:59Z

把这一段注释掉就不会停了

嗯呐，我后来解决了，原因是配置问题，我把CUDA驱动更新到10.0且相应tensorflow==1.12.0，就可以正常进行训练了，只是还有一个小问题，经常在运行的时候，会报错提示无法初始化：
UnknownError (see above for traceback): Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above. [[node Conv2D (defined at <ipython-input-1-1eec26e598ba>:22) = Conv2D[T=DT_FLOAT, data_format="NCHW", dilations=[1, 1, 1, 1], padding="SAME", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](gradients/Conv2D_grad/Conv2DBackpropFilter-0-TransposeNHWCToNCHW-LayoutOptimizer, Variable_1/read)]]

gaussic · 2019-06-06T00:14:26Z

这个问题倒没有碰到过

fanruifeng · 2019-08-21T02:25:38Z

把这一段注释掉就不会停了

嗯呐，我后来解决了，原因是配置问题，我把CUDA驱动更新到10.0且相应tensorflow==1.12.0，就可以正常进行训练了，只是还有一个小问题，经常在运行的时候，会报错提示无法初始化：
UnknownError (see above for traceback): Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above. [[node Conv2D (defined at <ipython-input-1-1eec26e598ba>:22) = Conv2D[T=DT_FLOAT, data_format="NCHW", dilations=[1, 1, 1, 1], padding="SAME", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](gradients/Conv2D_grad/Conv2DBackpropFilter-0-TransposeNHWCToNCHW-LayoutOptimizer, Variable_1/read)]]

我现在也遇到这个问题请问您解决了嘛

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

训练时仅计算1个epoch的结果就停止训练的问题 #121

训练时仅计算1个epoch的结果就停止训练的问题 #121

EvanHan09 commented Jun 3, 2019

EvanHan09 commented Jun 3, 2019

gaussic commented Jun 4, 2019

EvanHan09 commented Jun 5, 2019

gaussic commented Jun 6, 2019

fanruifeng commented Aug 21, 2019

训练时仅计算1个epoch的结果就停止训练的问题 #121

训练时仅计算1个epoch的结果就停止训练的问题 #121

Comments

EvanHan09 commented Jun 3, 2019

EvanHan09 commented Jun 3, 2019

gaussic commented Jun 4, 2019

EvanHan09 commented Jun 5, 2019

gaussic commented Jun 6, 2019

fanruifeng commented Aug 21, 2019