saturn

Saturn at ICL

Recommended modules, circa 12-2020 (in ~/.bash_profile):

module purge
module load git
module load cmake/3.18.2  # latest
module load gcc/7.3.0     # GNU gcc & g++. CUDA is picky about gcc version.
module load llvm          # clang & clang++
module load icc/2018      # Intel icc & icpc. SLATE fails with icc/2019.
module load cuda/11.1.0   # latest
module load intel-mpi
module load intel-mkl
module load openblas
module load python        # python 3

Installation

Using GNU compiler

Load the following modules:

 module load gcc/7.3.0
 module load cuda/11.1.0
 module load intel-mpi
 module load intel-mkl

Set make.inc with GNU compilers:

 CXX          = mpicxx
 FC           = mpif90
 blas         = mkl
 blas_fortran = gfortran  # default
 mkl_blacs    = intelmpi  # default
 cuda_arch    = pascal    # default (gtx1060 are pascal)

Note mpi=1, cuda=1, openmp=1 should be set automatically.

Using Intel compiler

Load the following modules:

 module load gcc/7.3.0
 module load icc/2018
 module load cuda/11.1.0
 module load intel-mpi    # was mpi/intel/2018
 module load intel-mkl

Set make.inc with Intel compilers:

 CXX          = mpiicpc
 FC           = mpiifort
 LIBS         = -lifcore
 blas         = mkl
 blas_fortran = ifort     # was mkl_intel = 1
 mkl_blacs    = intelmpi  # default
 cuda_arch    = pascal    # default (gtx1060 are pascal)

Note mpi=1, cuda=1, openmp=1 should be set automatically.

Unfortunately, it seems -std=c++17 breaks the Intel compiler. Editing the GNUmakefile to use -std=c++11 allows most files to be compiled, but there is an error in omp taskloop in listBcastMT.

Compiling

On the head node, use nice make -j4. Faster to compile in an interactive job, not on the head node, e.g.:

# Get node with gtx1060 GPU (b01 - b04) for 240 minutes.
[saturn ~]$ salloc -N 1 -C gtx1060 -t 240 srun --pty bash

# Compile and run interactively on that node.
[b01 ~/slate]$ nice make -j20
[b01 ~/slate]$ ./test/tester gemm

Running

Submission command on saturn b nodes, assuming Intel MPI and MKL BLAS.

salloc -N 4 -w b[01-04] --tasks-per-node 1 env OMP_NUM_THREADS=20 OMP_NESTED=true \
OMP_DISPLAY_ENV=true MKL_NUM_THREADS=1 MKL_VERBOSE=0 \
mpirun -n 4 -env I_MPI_DEBUG=3 ./test/tester --type  --nb 352 --dim $[1024*4] \
--grid 2x2 --target d --lookahead 1 --ref n --check n --repeat 2 gemm

--tasks-per-node 1  otherwise the processes may get bound to the same node
-env I_MPI_DEBUG=3 so that Intel MPI prints out process-to-node mapping
OMP_NUM_THREADS=20 to avoid the hyperthreading
MKL_VERBOSE=1 gives lots of output from MKL per function (threads used, etc)
OMP_DISPLAY_ENV=true to show that OMP is setup correctly
Output can be disabled if the process/thread binding is occurring properly

The saturn b nodes have gaming-level NVidia GPUs. Single-precision performance is reasonable, however double-precision performance is slow. These nodes are good for development and debugging, but performance bottlenecks may only show up when testing on faster GPUs (e.g. NVidia V100 GPUs on Summit at ORNL).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly