-
Notifications
You must be signed in to change notification settings - Fork 12
GPU node preparation
The latest cuda driver 510 only support those cards, if you have older card, e.g., K80, install 470.
remove existing one and install the latest
sudo apt clean
sudo apt update
sudo apt purge nvidia-*
sudo apt autoremove
sudo apt-get install nvidia-driver-510 -y
If you have some broken package, purge nvidia-*
failed, do it mannually, like dpkg --force-all -P cuda
dpkg --force-all -P [name of package with unmet dependency]
To verify driver version
ls /usr/src | grep nvidia
use nvidia-smi
to do the final check
If get Error 1: Failed to initialize NVML: Driver/library version mismatch
, reboot and retry
If get Error 2: NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
, the cause may be the card is too old. Latest driver does not support.
Follow the commands from nvidia website
However, if your cards are Kepler or older, which are not supported by driver verision 510, don't directly install cuda like the following command mentioned in above website.
sudo apt-get install cuda=11.4.4-1
The cuda package will automatically remove older driver, and install 510, which will lead to Error 2 above.
Instead of installing cuda, we can install cuda-toolkit, which won't automatically update driver version. Refer to this page for more info.
sudo apt install cuda-toolkit-11-4
Lastly, add cuda path to your path
export PATH=$PATH:/usr/local/cuda/bin
To verify installed cuda version
nvcc --version
Nvidia contaienr runtime is a modified version of runc adding a custom pre-start hook to all containers. By using it, all the gpu devices are exposed to container. Refer to Nvidia's github and installation sites.
distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
&& curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add - \
&& curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt-get update
sudo apt-get install -y nvidia-docker2
Note: If install nvidia-docker2 causing umet dependency, you may want to upgrade docker version. With Ubuntu 18.04, nvidia-docker2=2.9.1-1 has issues with docker=19.XX, after upgarde docker to 20.10.1X, nvidia-docker2 can be auto installed.
After install nvidia docker runtime, you can verify the gpu access from contaienrs with the following command.
docker run –rm –runtime=nvidia nvidia/cuda:11.0-base nvidia-smi
We can also configure nvidia docker runtime as default (replace runc) by add the followings to the file /etc/docker/daemon.json and restart docker systemctl restart docker
{
"default-runtime":"nvidia",
"runtimes": {
"nvidia": {
"path": "/usr/bin/nvidia-container-runtime",
"runtimeArgs": []
}
}
}
if systemd driver is configured on master, add this value "exec-opts": ["native.cgroupdriver=systemd"]
WARNINFS: if use nvidia docker container and Nvidia device plugin in the Kubernetes, and users dont set the GPU requirements in their yaml file, all the gpus will be exposed, which may cause resource management conflicts.
Install cuDNN on the host, only if you want to directly run training on the host machine and you are using tensorflow (for pytorch cudnn is built into its binary). Often time, we use containers with pre-built images which come with this library already.
The NVIDIA CUDA® Deep Neural Network library (cuDNN) is a GPU-accelerated library of primitives for deep neural networks. cuDNN provides highly tuned implementations for standard routines such as forward and backward convolution, pooling, normalization, and activation layers.
Deep learning researchers and framework developers worldwide rely on cuDNN for high-performance GPU acceleration. If not installed, errors like those Could not load dynamic library 'libcudnn.so.8'
will show.
You can download the deb file from Nvidia website, need to fill out a survey first.
Runtime library should be enough, for example
cuDNN Runtime Library for Ubuntu18.04 x86_64 (Deb)
.
Install is like this.
sudo dpkg -i libcudnn8_8.2.4.15-1+cuda11.4_amd64.deb