Skip to content
Mark Gates edited this page Jul 13, 2023 · 1 revision

Saturn at ICL

Recommended modules, circa 12-2020 (in ~/.bash_profile):

module purge
module load git
module load cmake/3.18.2  # latest
module load gcc/7.3.0     # GNU gcc & g++. CUDA is picky about gcc version.
module load llvm          # clang & clang++
module load icc/2018      # Intel icc & icpc. SLATE fails with icc/2019.
module load cuda/11.1.0   # latest
module load intel-mpi
module load intel-mkl
module load openblas
module load python        # python 3

Installation

Using GNU compiler

  1. Load the following modules:

     module load gcc/7.3.0
     module load cuda/11.1.0
     module load intel-mpi
     module load intel-mkl
    
  2. Set make.inc with GNU compilers:

     CXX          = mpicxx
     FC           = mpif90
     blas         = mkl
     blas_fortran = gfortran  # default
     mkl_blacs    = intelmpi  # default
     cuda_arch    = pascal    # default (gtx1060 are pascal)
    

Note mpi=1, cuda=1, openmp=1 should be set automatically.

Using Intel compiler

  1. Load the following modules:

     module load gcc/7.3.0
     module load icc/2018
     module load cuda/11.1.0
     module load intel-mpi    # was mpi/intel/2018
     module load intel-mkl
    
  2. Set make.inc with Intel compilers:

     CXX          = mpiicpc
     FC           = mpiifort
     LIBS         = -lifcore
     blas         = mkl
     blas_fortran = ifort     # was mkl_intel = 1
     mkl_blacs    = intelmpi  # default
     cuda_arch    = pascal    # default (gtx1060 are pascal)
    

Note mpi=1, cuda=1, openmp=1 should be set automatically.

  1. Unfortunately, it seems -std=c++17 breaks the Intel compiler. Editing the GNUmakefile to use -std=c++11 allows most files to be compiled, but there is an error in omp taskloop in listBcastMT.

Compiling

On the head node, use nice make -j4. Faster to compile in an interactive job, not on the head node, e.g.:

# Get node with gtx1060 GPU (b01 - b04) for 240 minutes.
[saturn ~]$ salloc -N 1 -C gtx1060 -t 240 srun --pty bash

# Compile and run interactively on that node.
[b01 ~/slate]$ nice make -j20
[b01 ~/slate]$ ./test/tester gemm

Running

Submission command on saturn b nodes, assuming Intel MPI and MKL BLAS.

salloc -N 4 -w b[01-04] --tasks-per-node 1 env OMP_NUM_THREADS=20 OMP_NESTED=true \
OMP_DISPLAY_ENV=true MKL_NUM_THREADS=1 MKL_VERBOSE=0 \
mpirun -n 4 -env I_MPI_DEBUG=3 ./test/tester --type  --nb 352 --dim $[1024*4] \
--grid 2x2 --target d --lookahead 1 --ref n --check n --repeat 2 gemm

--tasks-per-node 1  otherwise the processes may get bound to the same node
-env I_MPI_DEBUG=3 so that Intel MPI prints out process-to-node mapping
OMP_NUM_THREADS=20 to avoid the hyperthreading
MKL_VERBOSE=1 gives lots of output from MKL per function (threads used, etc)
OMP_DISPLAY_ENV=true to show that OMP is setup correctly
Output can be disabled if the process/thread binding is occurring properly

The saturn b nodes have gaming-level NVidia GPUs. Single-precision performance is reasonable, however double-precision performance is slow. These nodes are good for development and debugging, but performance bottlenecks may only show up when testing on faster GPUs (e.g. NVidia V100 GPUs on Summit at ORNL).

Performance

Clone this wiki locally