Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to load any of {libcudnn_ops.so.9.1.0, libcudnn_ops.so.9.1, libcudnn_ops.so.9, libcudnn_ops.so} Invalid handle. Cannot load symbol cudnnCreateTensorDescriptor #259

Open
IGaganpreetSingh opened this issue Oct 23, 2024 · 12 comments

Comments

@IGaganpreetSingh
Copy link

please help

@MahmoudAshraf97
Copy link
Owner

Please state what versions of torch and ctranslate2 are you using

@roboatLee
Copy link

same issue here

torch version

torch==2.5.0+cu121

ctranslate2 version

ctranslate2==4.5.0

@MahmoudAshraf97
Copy link
Owner

if you are using colab, downgrade to 4.4.0

@roboatLee
Copy link

This is because you didn't add the libcudnn_ops.so.9.1.0 to your path, you can use command below

export LD_LIBRARY_PATH = $LD_LIBRARY_PATH:/path/to/you

@IGaganpreetSingh
Copy link
Author

It was working perfectly, but suddenly this error appeared on Tuesday. Btw I am using it on runpod instance.
I will try this command

@roboatLee
Copy link

Same as me, by the way,</path/to/you> is in your anaconda python environment, you can use

find / |grep libcudnn_ops 

find it, where it is

@IGaganpreetSingh
Copy link
Author

thanks @roboatLee @MahmoudAshraf97 for help. I found the solution.
apt-get install libcudnn9-cuda-12 9.5.1.17-1 (for cuda 12)
apt-get install libcudnn9-cuda-11 9.5.1.17-1 (for cuda 11)
find / |grep libcudnn_ops
export LD_LIBRARY_PATH = $LD_LIBRARY_PATH:/path/to/you
or simply copy the libcudnn_ops.so.9 to the destination where script actually looking for it.

@kirahman2
Copy link

kirahman2 commented Oct 29, 2024

When running the code block below, I am getting the error RuntimeError: cuDNN error: CUDNN_STATUS_SUBLIBRARY_VERSION_MISMATCH

# Initialize NeMo MSDD diarization model
msdd_model = NeuralDiarizer(cfg=create_config(temp_path)).to("cuda")
msdd_model.diarize()

del msdd_model
torch.cuda.empty_cache()

Here are my python library versions

ctranslate2==4.5.0
torch==2.5.0
CUDA 12.4
nvidia-cuda-cupti-cu12      12.4.127
nvidia-cuda-nvrtc-cu12      12.4.127
nvidia-cuda-runtime-cu12    12.4.127

This is the libcudnn_ops location

root@53a5b6645406:/container/work/whisper-diarization2/whisper-diarization# find / |grep libcudnn_ops
/usr/local/lib/python3.10/dist-packages/nvidia/cudnn/lib/libcudnn_ops.so.9
/usr/lib/x86_64-linux-gnu/libcudnn_ops.so.9.5.1
/usr/lib/x86_64-linux-gnu/libcudnn_ops.so.9

I ran apt-get install libcudnn9-cuda-12 to get to this point.

I added the export path
export LD_LIBRARY_PATH=/usr/local/cuda:/usr/lib/x86_64-linux-gnu:$LD_LIBRARY_PATH

Here is the full output

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[15], line 2
      1 # Initialize NeMo MSDD diarization model
----> 2 msdd_model = NeuralDiarizer(cfg=create_config(temp_path)).to("cuda")
      3 msdd_model.diarize()
      5 del msdd_model

File [/usr/local/lib/python3.10/dist-packages/nemo/collections/asr/models/msdd_models.py:994](http://localhost:8888/usr/local/lib/python3.10/dist-packages/nemo/collections/asr/models/msdd_models.py#line=993), in NeuralDiarizer.__init__(self, cfg)
    989 self.max_pred_length = cfg.diarizer.msdd_model.parameters.get('max_pred_length', 0)
    990 self.diar_eval_settings = cfg.diarizer.msdd_model.parameters.get(
    991     'diar_eval_settings', [(0.25, True), (0.25, False), (0.0, False)]
    992 )
--> 994 self._init_msdd_model(cfg)
    995 self.diar_window_length = cfg.diarizer.msdd_model.parameters.diar_window_length
    996 self.msdd_model.cfg = self.transfer_diar_params_to_model_params(self.msdd_model, cfg)

File [/usr/local/lib/python3.10/dist-packages/nemo/collections/asr/models/msdd_models.py:1096](http://localhost:8888/usr/local/lib/python3.10/dist-packages/nemo/collections/asr/models/msdd_models.py#line=1095), in NeuralDiarizer._init_msdd_model(self, cfg)
   1094         logging.warning(f"requested {model_path} model name not available in pretrained models, instead")
   1095     logging.info("Loading pretrained {} model from NGC".format(model_path))
-> 1096     self.msdd_model = EncDecDiarLabelModel.from_pretrained(model_name=model_path, map_location=cfg.device)
   1097 # Load speaker embedding model state_dict which is loaded from the MSDD checkpoint.
   1098 if self.use_speaker_model_from_ckpt:

File [/usr/local/lib/python3.10/dist-packages/nemo/core/classes/common.py:754](http://localhost:8888/usr/local/lib/python3.10/dist-packages/nemo/core/classes/common.py#line=753), in Model.from_pretrained(cls, model_name, refresh_cache, override_config_path, map_location, strict, return_config, trainer, save_restore_connector)
    748 else:
    749     # NGC source
    750     class_, nemo_model_file_in_cache = cls._get_ngc_pretrained_model_info(
    751         model_name=model_name, refresh_cache=refresh_cache
    752     )
--> 754 instance = class_.restore_from(
    755     restore_path=nemo_model_file_in_cache,
    756     override_config_path=override_config_path,
    757     map_location=map_location,
    758     strict=strict,
    759     return_config=return_config,
    760     trainer=trainer,
    761     save_restore_connector=save_restore_connector,
    762 )
    763 return instance

File [/usr/local/lib/python3.10/dist-packages/nemo/core/classes/modelPT.py:464](http://localhost:8888/usr/local/lib/python3.10/dist-packages/nemo/core/classes/modelPT.py#line=463), in ModelPT.restore_from(cls, restore_path, override_config_path, map_location, strict, return_config, save_restore_connector, trainer)
    461 app_state.model_restore_path = restore_path
    463 cls.update_save_restore_connector(save_restore_connector)
--> 464 instance = cls._save_restore_connector.restore_from(
    465     cls, restore_path, override_config_path, map_location, strict, return_config, trainer
    466 )
    467 if isinstance(instance, ModelPT):
    468     instance._save_restore_connector = save_restore_connector

File [/usr/local/lib/python3.10/dist-packages/nemo/core/connectors/save_restore_connector.py:255](http://localhost:8888/usr/local/lib/python3.10/dist-packages/nemo/core/connectors/save_restore_connector.py#line=254), in SaveRestoreConnector.restore_from(self, calling_cls, restore_path, override_config_path, map_location, strict, return_config, trainer)
    230 """
    231 Restores model instance (weights and configuration) into .nemo file
    232 
   (...)
    251     An instance of type cls or its underlying config (if return_config is set).
    252 """
    253 # Get path where the command is executed - the artifacts will be "retrieved" there
    254 # (original .nemo behavior)
--> 255 loaded_params = self.load_config_and_state_dict(
    256     calling_cls, restore_path, override_config_path, map_location, strict, return_config, trainer,
    257 )
    258 if not isinstance(loaded_params, tuple) or return_config is True:
    259     return loaded_params

File [/usr/local/lib/python3.10/dist-packages/nemo/core/connectors/save_restore_connector.py:179](http://localhost:8888/usr/local/lib/python3.10/dist-packages/nemo/core/connectors/save_restore_connector.py#line=178), in SaveRestoreConnector.load_config_and_state_dict(self, calling_cls, restore_path, override_config_path, map_location, strict, return_config, trainer)
    177 calling_cls._set_model_restore_state(is_being_restored=True, folder=tmpdir)
    178 instance = calling_cls.from_config_dict(config=conf, trainer=trainer)
--> 179 instance = instance.to(map_location)
    180 # add load_state_dict override
    181 if app_state.model_parallel_size is not None and app_state.model_parallel_size > 1:

File [/usr/local/lib/python3.10/dist-packages/lightning_fabric/utilities/device_dtype_mixin.py:55](http://localhost:8888/usr/local/lib/python3.10/dist-packages/lightning_fabric/utilities/device_dtype_mixin.py#line=54), in _DeviceDtypeModuleMixin.to(self, *args, **kwargs)
     53 device, dtype = torch._C._nn._parse_to(*args, **kwargs)[:2]
     54 _update_properties(self, device=device, dtype=dtype)
---> 55 return super().to(*args, **kwargs)

File [/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:1340](http://localhost:8888/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py#line=1339), in Module.to(self, *args, **kwargs)
   1337         else:
   1338             raise
-> 1340 return self._apply(convert)

File [/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:900](http://localhost:8888/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py#line=899), in Module._apply(self, fn, recurse)
    898 if recurse:
    899     for module in self.children():
--> 900         module._apply(fn)
    902 def compute_should_use_set_data(tensor, tensor_applied):
    903     if torch._has_compatible_shallow_copy_type(tensor, tensor_applied):
    904         # If the new tensor has compatible tensor type as the existing tensor,
    905         # the current behavior is to change the tensor in-place using `.data =`,
   (...)
    910         # global flag to let the user control whether they want the future
    911         # behavior of overwriting the existing tensor or not.

File [/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:900](http://localhost:8888/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py#line=899), in Module._apply(self, fn, recurse)
    898 if recurse:
    899     for module in self.children():
--> 900         module._apply(fn)
    902 def compute_should_use_set_data(tensor, tensor_applied):
    903     if torch._has_compatible_shallow_copy_type(tensor, tensor_applied):
    904         # If the new tensor has compatible tensor type as the existing tensor,
    905         # the current behavior is to change the tensor in-place using `.data =`,
   (...)
    910         # global flag to let the user control whether they want the future
    911         # behavior of overwriting the existing tensor or not.

File [/usr/local/lib/python3.10/dist-packages/torch/nn/modules/rnn.py:288](http://localhost:8888/usr/local/lib/python3.10/dist-packages/torch/nn/modules/rnn.py#line=287), in RNNBase._apply(self, fn, recurse)
    283 ret = super()._apply(fn, recurse)
    285 # Resets _flat_weights
    286 # Note: be v. careful before removing this, as 3rd party device types
    287 # likely rely on this behavior to properly .to() modules like LSTM.
--> 288 self._init_flat_weights()
    290 return ret

File [/usr/local/lib/python3.10/dist-packages/torch/nn/modules/rnn.py:215](http://localhost:8888/usr/local/lib/python3.10/dist-packages/torch/nn/modules/rnn.py#line=214), in RNNBase._init_flat_weights(self)
    208 self._flat_weights = [
    209     getattr(self, wn) if hasattr(self, wn) else None
    210     for wn in self._flat_weights_names
    211 ]
    212 self._flat_weight_refs = [
    213     weakref.ref(w) if w is not None else None for w in self._flat_weights
    214 ]
--> 215 self.flatten_parameters()

File [/usr/local/lib/python3.10/dist-packages/torch/nn/modules/rnn.py:269](http://localhost:8888/usr/local/lib/python3.10/dist-packages/torch/nn/modules/rnn.py#line=268), in RNNBase.flatten_parameters(self)
    267 if self.proj_size > 0:
    268     num_weights += 1
--> 269 torch._cudnn_rnn_flatten_weight(
    270     self._flat_weights,
    271     num_weights,
    272     self.input_size,
    273     rnn.get_cudnn_mode(self.mode),
    274     self.hidden_size,
    275     self.proj_size,
    276     self.num_layers,
    277     self.batch_first,
    278     bool(self.bidirectional),
    279 )

RuntimeError: cuDNN error: CUDNN_STATUS_SUBLIBRARY_VERSION_MISMATCH

To get to this point, I ran the tge requirements on a fresh ubuntu docker container. I pulled the Dockerfile from here https://github.com/SYSTRAN/faster-whisper/blob/master/docker/Dockerfile

How do I fix this error?

@MahmoudAshraf97
Copy link
Owner

You need to have matching cudnn versions,
You have 9.5.1 installed but I don't know which one torch uses
torch.backends.cudnn.version()
if torch installed nvidia-cudnn-cu12 then you need to remove one of the two installations

@kirahman2
Copy link

kirahman2 commented Nov 1, 2024

I've been touched by god.
I was stuck because I was using the docker container
nvidia/cuda:12.1.0-runtime-ubuntu22.04
When I should have been using
nvidia/cuda:12.1.0-cudnn8-devel-ubuntu22.04
So that mistake cost me about 40-50 hours.

For anyone who was stuck on this error and utilizing docker, here is the solution. This solution uses whisperx only because my notebook still utilizes whisperx.

# pull the docker image below from nvidias website 
# https://hub.docker.com/r/nvidia/cuda/tags?page=2&name=12.1 
sudo docker pull nvidia/cuda:12.1.0-cudnn8-devel-ubuntu22.04
########################################
# standard installs for all containers #
########################################
apt update
apt install -y python3.10
apt install -y python3.10-dev
python3.10 --version
# install sudo
apt update
apt install sudo -y
# install pip and pip3
sudo apt update
sudo apt install python3-pip -y
# install git
sudo apt update
sudo apt install git -y
# install
sudo apt update
sudo apt install nano -y
# install wget
sudo apt update
sudo apt install wget -y
pip install --upgrade pip
pip install ipykernel
pip install jupyterlab
pip install numpy
pip install ipywidgets
#python -m ipykernel install --user --name $ENV_NAME
# mahmoud whisper commands
sudo apt update && sudo apt install cython3 --yes
sudo apt update && sudo apt install ffmpeg --yes
###############################################
# end of standard installs for all containers #
###############################################

# commenting out requirements.txt command because we will 
# install the libraries based from the pytorch website. 
# This is based on what google colab is using. 
# pip install -c constraints.txt -r requirements.txt

# torch installation packages based on google colab as of 10/31/24
# https://download.pytorch.org/whl/torch/ 
pip install https://download.pytorch.org/whl/cu121_full/torch-2.5.0%2Bcu121-cp310-cp310-linux_x86_64.whl
pip install https://download.pytorch.org/whl/cu121_full/torchaudio-2.5.0%2Bcu121-cp310-cp310-linux_x86_64.whl
pip install torchsummary==1.5.1
pip install https://download.pytorch.org/whl/cu121_full/torchvision-0.20.0%2Bcu121-cp310-cp310-linux_x86_64.whl

# Since we are using a docker container with cuddn from nvidia
# we can skip the commands below

# add to bashrc
# whereis cuda
# nano ~/.bashrc
# export PATH=/usr/local/cuda/bin${PATH:+:${PATH}}
# export LD_LIBRARY_PATH=/usr/local/cuda/lib64:${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}
# source ~/.bashrc

# check cuda version
# nvcc --version

# These pip installs come directly from Mahmoud's google 
# colab notebook 
# https://colab.research.google.com/github/MahmoudAshraf97/whisper-diarization/blob/main/Whisper_Transcription_%2B_NeMo_Diarization.ipynb#scrollTo=ye1FJVFRO30B 
# we are running these commands to our docker container to try and emulate
# google colab's environment for the purpose of using the 
# whisper-diarization repo + the original whisperx

pip install git+https://github.com/SYSTRAN/faster-whisper.git ctranslate2==4.4.0
pip install "nemo-toolkit[asr]>=2.dev"
pip install git+https://github.com/MahmoudAshraf97/demucs.git
pip install git+https://github.com/oliverguhr/deepmultilingualpunctuation.git
pip install git+https://github.com/MahmoudAshraf97/ctc-forced-aligner.git

# we still need to install whisperx for our original environment
pip install git+https://github.com/m-bain/whisperX.git@78dcfaab51005aa703ee21375f81ed31bc248560

# launch jupyter lab
jupyter lab

Additional Notes:
If you are connecting your GPU to a docker container, please update this info to your local machine.

# if you are installing a newly hosted ubuntu image then run 
# these commands so that docker can access the GPU 
sudo apt update
sudo apt install -y nvidia-container-toolkit
sudo nano /etc/docker/daemon.json
# add this to daemon.json
{
  "runtimes": {
    "nvidia": {
      "path": "nvidia-container-runtime",
      "runtimeArgs": []
    }
  }
}
# then reboot docker
sudo systemctl restart docker
# to access GPU from the docker container run 
sudo docker run --gpus all -it -p 8889:8889 \
    -v /home/khalid/work:/container/work \
    -v /home/khalid/whisper_data:/container/whisper_data \
    jupyter:latest /bin/bash

jupyter lab --ip=0.0.0.0 --port=8889 --allow-root --no-browser

Update:
Here is the Dockerfile

# Start with NVIDIA's CUDA base image
FROM nvidia/cuda:12.1.0-cudnn8-devel-ubuntu22.04

# Install Python 3.10 and required packages
RUN apt update && \
    apt install -y python3.10 python3.10-dev sudo && \
    python3.10 --version

# Install pip, git, nano, wget, and upgrade pip
RUN apt update && \
    apt install -y python3-pip git nano wget && \
    pip install --upgrade pip

# Install standard Python libraries and Jupyter tools
RUN pip install ipykernel jupyterlab numpy ipywidgets

# Install Cython and FFmpeg
RUN apt update && \
    apt install -y cython3 ffmpeg

# Install PyTorch and related packages for CUDA 12.1
RUN pip install https://download.pytorch.org/whl/cu121_full/torch-2.5.0%2Bcu121-cp310-cp310-linux_x86_64.whl && \
    pip install https://download.pytorch.org/whl/cu121_full/torchaudio-2.5.0%2Bcu121-cp310-cp310-linux_x86_64.whl && \
    pip install torchsummary==1.5.1 && \
    pip install https://download.pytorch.org/whl/cu121_full/torchvision-0.20.0%2Bcu121-cp310-cp310-linux_x86_64.whl

# Install Whisper and related libraries from GitHub
RUN pip install git+https://github.com/SYSTRAN/faster-whisper.git ctranslate2==4.4.0 && \
    pip install "nemo-toolkit[asr]>=2.dev" && \
    pip install git+https://github.com/MahmoudAshraf97/demucs.git && \
    pip install git+https://github.com/oliverguhr/deepmultilingualpunctuation.git && \
    pip install git+https://github.com/MahmoudAshraf97/ctc-forced-aligner.git && \
    pip install git+https://github.com/m-bain/whisperX.git@78dcfaab51005aa703ee21375f81ed31bc248560

# Set the working directory (optional)
WORKDIR /workspace

# Set the default command to bash (optional)
CMD ["/bin/bash"]

@JokanaanR
Copy link

Im getting the same error in google colab:
Unable to load any of {libcudnn_ops.so.9.1.0, libcudnn_ops.so.9.1, libcudnn_ops.so.9, libcudnn_ops.so} Invalid handle. Cannot load symbol cudnnCreateTensorDescriptor

To be honest I don't quite understand the solution to get it working again. What should I do exactly?

@kirahman2
Copy link

@JokanaanR refer to this post OpenNMT/CTranslate2#1806 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants