Slurm is an open-source job scheduling system for Linux clusters, most frequently used for high-performance computing (HPC) applications. This guide will cover some of the basics to get started using slurm as a user. For more information, the Slurm Docs are a good place to start.
After slurm is deployed on a cluster, a slurmd daemon should be running on each compute system. Users do not log directly into each compute system to do their work. Instead, they execute slurm commands (ex: srun, sinfo, scancel, scontrol, etc) from a slurm login node. These commands communicate with the slurmd daemons on each host to perform work.
To "see" the cluster, ssh to the slurm login node for your cluster and run the sinfo
command:
sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
batch* up infinite 9 idle dgx[1-9]
There are 9 nodes available on this system, all in an idle
state. If a node is busy, its state will change from idle
to alloc
:
sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
batch* up infinite 1 alloc dgx1
batch* up infinite 8 idle dgx[2-9]
The sinfo
command can be used to output a lot more information about the cluster. Check out the sinfo doc for more.
To run a job, use the srun
command:
srun hostname
dgx1
What happened here? With the srun
command we instructed slurm to find the first available node and run hostname
on it. It returned the result in our command prompt. It's just as easy to run a different command that runs a python script or a container using srun.
Most of the time, scheduling on a full system is not necessary and it's better to request only a certain portion of the GPUs:
srun --gres=gpu:2 env | grep CUDA
CUDA_VISIBLE_DEVICES=0,1
Or, conversely, sometimes it's necessary to run on multiple systems:
srun --ntasks 2 -l hostname
dgx1
dgx2
Especially when developing and experimenting, it's helpful to run an interactive job, which requests a resource and provides a command prompt as an interface to it:
slurm-login:~srun --pty /bin/bash
dgx1:~hostname
dgx
dgx1:~exit
During interactive mode, the resource is being reserved for use until the prompt is exited (as shown above). Commands can be run in succession.
Note: before starting an interactive session with srun it may be helpful to create a session on the login node with a tool like tmux or
screen
. This will prevent a user from losing interactive jobs if there is a network outage or the terminal is closed.
While the srun
command blocks any other execution in the terminal, sbatch
can be run to queue a job for execution once resources are available in the cluster. Also, a batch job will let you queue up several jobs that run as nodes become available. It's therefore good practice to encapsulate everything that needs to be run into a script and then execute with sbatch
vs with srun
:
cat script.sh
#!/bin/bash
/bin/hostname
sleep 30
sbatch script.sh
To see which jobs are running in the cluster, use the squeue
command:
squeue -a -l
Tue Nov 17 19:08:18 2020
JOBID PARTITION NAME USER STATE TIME TIME_LIMI NODES NODELIST(REASON)
9 batch bash user01 RUNNING 5:43 UNLIMITED 1 dgx1
To see just the running jobs for a particular user USERNAME
:
squeue -l -u USERNAME
To cancel a job, use the squeue
command to look up the JOBID and the scancel
command to cancel it:
squeue
scancel JOBID
To run a deep learning job with multiple processes, use MPI:
srun -p PARTITION --pty /bin/bash
singularity pull docker://nvcr.io/nvidia/tensorflow:19.05-py3
singularity run docker://nvcr.io/nvidia/tensorflow:19.05-py3
cd /opt/tensorflow/nvidia-examples/cnn/
mpiexec --allow-run-as-root -np 2 python resnet.py --layers=50 --batch_size=32 --precision=fp16 --num_iter=50