Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

what could have lead to CUDNN_STATUS_INTERNAL_ERROR ? #16

Open
yarikoptic opened this issue Nov 11, 2019 · 2 comments
Open

what could have lead to CUDNN_STATUS_INTERNAL_ERROR ? #16

yarikoptic opened this issue Nov 11, 2019 · 2 comments

Comments

@yarikoptic
Copy link

It used to work on my laptop, but no longer. I fear it is due to some interaction with GPU being used as an actual graphics card as well, and thus Xorg consuming too much memory (but requested ~1.3GB is less than available free ~2GB) or something like that

nvidia-smi
$> nvidia-smi
Mon Nov 11 09:55:21 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.50       Driver Version: 430.50       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Quadro T2000        Off  | 00000000:01:00.0 Off |                  N/A |
| N/A   43C    P8     3W /  N/A |   2297MiB /  3911MiB |     19%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     21824      G   /usr/lib/xorg/Xorg                           141MiB |
|    0     25467      G   /usr/lib/xorg/Xorg                          1670MiB |
|    0     25596      G   /usr/bin/gnome-shell                         180MiB |
|    0     27333      G   ...uest-channel-token=14439694130078186709   232MiB |
|    0     28802      G   /usr/lib/xorg/Xorg                             6MiB |
|    0     28899      G   /usr/bin/gnome-shell                           5MiB |
+-----------------------------------------------------------------------------+
the actual run via singularity
$> singularity run -e -B /usr/lib/x86_64-linux-gnu/libnvidia-fatbinaryloader.so.430.50 -B /usr/lib/x86_64-linux-gnu/libcuda.so.1 neuronets-kwyk--version-0.4-gpu.sing raiders/sub-rid000005/anat/sub-rid000005_run-01_T1w.nii.gz out
Bayesian dropout functions have been loaded.
Your version: v0.4 Latest version: 0.4
++ Conforming volume to 1mm^3 voxels and size 256x256x256.
/opt/kwyk/freesurfer/bin/mri_convert: line 2: /opt/kwyk/freesurfer/sources.sh: No such file or directory
mri_convert.bin --conform raiders/sub-rid000005/anat/sub-rid000005_run-01_T1w.nii.gz /tmp/tmpwtickiw9.nii.gz 
$Id: mri_convert.c,v 1.226 2016/02/26 16:15:24 mreuter Exp $
reading from raiders/sub-rid000005/anat/sub-rid000005_run-01_T1w.nii.gz...
TR=10.00, TE=0.00, TI=0.00, flip angle=0.00
i_ras = (0, -1, 0)
j_ras = (0, 0, 1)
k_ras = (1, 0, 0)
changing data type from float to uchar (noscale = 0)...
MRIchangeType: Building histogram 
Reslicing using trilinear interpolation 
writing to /tmp/tmpwtickiw9.nii.gz...
++ Running forward pass of model.
2019-11-11 14:57:43.820728: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-11-11 14:57:43.916219: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:964] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-11-11 14:57:43.916394: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties: 
name: Quadro T2000 major: 7 minor: 5 memoryClockRate(GHz): 1.5
pciBusID: 0000:01:00.0
totalMemory: 3.82GiB freeMemory: 1.41GiB
2019-11-11 14:57:43.916409: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-11-11 14:57:44.267550: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-11-11 14:57:44.267570: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0 
2019-11-11 14:57:44.267575: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N 
2019-11-11 14:57:44.267684: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 1246 MB memory) -> physical GPU (device: 0, name: Quadro T2000, pci bus id: 0000:01:00.0, compute capability: 7.5)
Normalizer being used <function zscore at 0x7fe98eac4ea0>
-5.8382284e-08
1.0000015
 0/64 [..............................] - ETA: 0s2019-11-11 14:57:46.303925: E tensorflow/stream_executor/cuda/cuda_dnn.cc:373] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2019-11-11 14:57:46.314172: E tensorflow/stream_executor/cuda/cuda_dnn.cc:373] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1334, in _do_call
    return fn(*args)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1319, in _run_fn
    options, feed_dict, fetch_list, target_list, run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1407, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.UnknownError: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
	 [[{{node layer_1/conv3d/Conv3D}} = Conv3D[T=DT_FLOAT, data_format="NDHWC", dilations=[1, 1, 1, 1, 1], padding="SAME", strides=[1, 1, 1, 1, 1], _device="/job:localhost/replica:0/task:0/device:GPU:0"](_arg_Placeholder_0_0/_85, layer_1/conv3d/kernel_m)]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/bin/kwyk", line 11, in <module>
    load_entry_point('kwyk', 'console_scripts', 'kwyk')()
  File "/usr/local/lib/python3.5/dist-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.5/dist-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.5/dist-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/opt/kwyk/kwyk/cli.py", line 92, in predict
    normalizer=zscore)
  File "/usr/local/lib/python3.5/dist-packages/nobrainer/predict.py", line 348, in predict_from_filepath
    batch_size=batch_size)
  File "/usr/local/lib/python3.5/dist-packages/nobrainer/predict.py", line 275, in predict_from_img
    batch_size=batch_size)
  File "/usr/local/lib/python3.5/dist-packages/nobrainer/predict.py", line 186, in predict_from_array
    new_prediction = predictor( {'volume': features[j:j + batch_size]})
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/predictor/predictor.py", line 77, in __call__
    return self._session.run(fetches=self.fetch_tensors, feed_dict=feed_dict)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 929, in run
    run_metadata_ptr)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1152, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1328, in _do_run
    run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1348, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.UnknownError: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
	 [[node layer_1/conv3d/Conv3D (defined at /usr/local/lib/python3.5/dist-packages/tensorflow/contrib/predictor/saved_model_predictor.py:153)  = Conv3D[T=DT_FLOAT, data_format="NDHWC", dilations=[1, 1, 1, 1, 1], padding="SAME", strides=[1, 1, 1, 1, 1], _device="/job:localhost/replica:0/task:0/device:GPU:0"](_arg_Placeholder_0_0/_85, layer_1/conv3d/kernel_m)]]

Caused by op 'layer_1/conv3d/Conv3D', defined at:
  File "/usr/local/bin/kwyk", line 11, in <module>
    load_entry_point('kwyk', 'console_scripts', 'kwyk')()
  File "/usr/local/lib/python3.5/dist-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.5/dist-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.5/dist-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/opt/kwyk/kwyk/cli.py", line 83, in predict
    predictor = _get_predictor(savedmodel_path)
  File "/usr/local/lib/python3.5/dist-packages/nobrainer/predict.py", line 406, in _get_predictor
    predictor = tf.contrib.predictor.from_saved_model(str(path))
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/predictor/predictor_factories.py", line 153, in from_saved_model
    config=config)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/predictor/saved_model_predictor.py", line 153, in __init__
    loader.load(self._session, tags.split(','), export_dir)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/saved_model/loader_impl.py", line 197, in load
    return loader.load(sess, tags, import_scope, **saver_kwargs)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/saved_model/loader_impl.py", line 350, in load
    **saver_kwargs)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/saved_model/loader_impl.py", line 278, in load_graph
    meta_graph_def, import_scope=import_scope, **saver_kwargs)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/saver.py", line 1696, in _import_meta_graph_with_return_elements
    **kwargs))
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/meta_graph.py", line 806, in import_scoped_meta_graph_with_return_elements
    return_elements=return_elements)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/util/deprecation.py", line 488, in new_func
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/importer.py", line 442, in import_graph_def
    _ProcessNewOps(graph)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/importer.py", line 234, in _ProcessNewOps
    for new_op in graph._add_new_tf_operations(compute_devices=False):  # pylint: disable=protected-access
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 3440, in _add_new_tf_operations
    for c_op in c_api_util.new_tf_operations(self)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 3440, in <listcomp>
    for c_op in c_api_util.new_tf_operations(self)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 3299, in _create_op_from_tf_operation
    ret = Operation(c_op, self)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 1770, in __init__
    self._traceback = tf_stack.extract_stack()

UnknownError (see above for traceback): Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
	 [[node layer_1/conv3d/Conv3D (defined at /usr/local/lib/python3.5/dist-packages/tensorflow/contrib/predictor/saved_model_predictor.py:153)  = Conv3D[T=DT_FLOAT, data_format="NDHWC", dilations=[1, 1, 1, 1, 1], padding="SAME", strides=[1, 1, 1, 1, 1], _device="/job:localhost/replica:0/task:0/device:GPU:0"](_arg_Placeholder_0_0/_85, layer_1/conv3d/kernel_m)]]
@satra
Copy link
Collaborator

satra commented Nov 11, 2019

instead of this:

singularity run -e -B /usr/lib/x86_64-linux-gnu/libnvidia-fatbinaryloader.so.430.50 \
-B /usr/lib/x86_64-linux-gnu/libcuda.so.1 neuronets-kwyk--version-0.4-gpu.sing \
raiders/sub-rid000005/anat/sub-rid000005_run-01_T1w.nii.gz out

can you try:

singularity run -e --nv neuronets-kwyk--version-0.4-gpu.sing \
raiders/sub-rid000005/anat/sub-rid000005_run-01_T1w.nii.gz out

@yarikoptic
Copy link
Author

with --nv it used to halt, now (there is a bit more of free memory) it proceeds to the same crash.

I found http://tuxvoid.blogspot.com/2017/08/tensorflow-could-not-create-cudnn.html referenced from
tensorflow/tensorflow#14048 suggesting that instructing tensor flow to allow_grouth

config = tf.ConfigProto()
config.gpu_options.allow_growth = True
sess = tf.Session(config=config)

might help, but I could not figure out where in kwyk or nobrainer to tune that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants