-
Notifications
You must be signed in to change notification settings - Fork 99
Installing MVAPICH2 GDR
To install MVAPICH2-GDR, the easiest thing to do is to grab the rpm and install it in your home directory.
E.g., on x86, with CUDA 9.2 and OFED 4.3 this would be the appropriate rpm (download links available here)
wget http://mvapich.cse.ohio-state.edu/download/mvapich/gdr/2.3a/mofed4.3/mvapich2-gdr-mcast.cuda9.2.mofed4.3.gnu4.8.5-2.3a-2.el7.x86_64.rpm
For example, you can then unpack the rpm into your home directory
rpm2cpio mvapich2-gdr-mcast.cuda9.2.mofed4.3.gnu4.8.5-2.3a-2.el7.x86_64.rpm | cpio -id
When this approach is taken, one has to manually edit the mpicc
, mpicxx
, mpifort
wrapper scripts to point to the correct locations since these assume that MVAPICH has been installed in /opt/...
and not in ${HOME}/opt/...
. (we can use a different delimiter for sed so as not to have to escape /
, e.g. :
)
sed -i -e 's:/opt/mvapich2/gdr/2.3a/mcast/no-openacc/cuda9.2/mofed4.3/mpirun/gnu4.8.5:${HOME}/opt/mvapich2/gdr/2.3a/mcast/no-openacc/cuda9.2/mofed4.3/mpirun/gnu4.8.5:g' mpicc mpicxx mpic++ mpif90 mpifort
Moreover, these wrappers also expect CUDA to be installed in /usr/local/cuda-9.2
but this may not be the case on all platforms, e.g., where modules are used. With this in mind, it is best to changes this to an environment variable that is easily overwritten
sed -i -e 's:/usr/local/cuda-9.2:${CUDA_HOME}:g' mpicc mpicxx mpic++ mpif90 mpifort
Now to use this, we can just add the bin
and lib64
directories to our PATH
and LD_LIBRARY_PATH
, and we can compile as normal.
export MPI_HOME=${HOME}/opt/mvapich2/gdr/2.3a/mcast/no-openacc/cuda9.2/mofed4.3/mpirun/gnu4.8.5
export PATH=${MPI_HOME}/bin:$PATH
export LD_LIBRARY_PATH=${MPI_HOME}/lib64:$LD_LIBRARY_PATH
Finally, it has been observed on systems that don't include the CUDA library path automatically in the user's LD_LIBRARY_PATH
, e.g., on systems with modules, that a shared library link error can occur when trying to use mpirun with multi-node.
hydra_pmi_proxy: error while loading shared libraries: libcudart.so.9.2: cannot open shared object file: No such file or directory
To fix this, the user's LD_LIBRARY_PATH
should be set to include ${CUDA_HOME}/lib64
in .bashrc
to ensure that all remote logins explicitly have this.
While MVAPICH2 supports the GDR Copy library for extremely low latency for small messages, this isn't actually need for running. Moreover, since this is only applicable for very small messages, for most LQCD runs, this is likely not needed (perhaps it will be beneficial for multigrid). If you don't have it installed, you can disable it by setting the environment variable export MV2_USE_GPUDIRECT_GDRCOPY=0
.