Scripts for getting started on Northeastern's high performance computing cluster
- Main site
- Documentation
- Tutorial (Requires NU ID)
Discovery uses Slurm to schedule jobs.
A simple sbatch script from the introductory tutorial:
#!/bin/bash
#SBATCH --partition=express
#SBATCH --job-name=test
#SBATCH --time=00:05:00
#SBATCH -N 1
#SBATCH -n 1
#SBATCH --output=%j.output
#SBATCH --error=%j.error
echo "HELLO WORLD!"
Packages that aren't available as a module by default can be installed with Spack to your home directory.
Running Spark requires scheduling time on two different types of node (a Driver and one or more Workers). Spark 3 is also not currently available as a module. So running Spark 3 demonstrates several features of the environment.
First install Spack according to the documentation.
git clone https://github.com/spack/spack.git
# Schedule an environment that can handle a larger workload.
srun -p short --pty --export=ALL -N 1 -n 28 --exclusive /bin/bash
export SPACK_ROOT=/home/<yourusername>/spack
. $SPACK_ROOT/share/spack/setup-env.sh
Install Spark 3 with Hadoop and OpenJDK 11.
spack install [email protected] +hadoop ^[email protected]
spack install [email protected]
Upload or move the sample sbatch script and pi.py to your home directory and run it:
sbatch spark_with_slurm.sh
You should see output from Slurm that the job has been scheduled with a job ID, and then <job ID>.error
and <job ID>.output
in your home directory. The .error file should have a Spark log that includes how long the job took to run. You can see a list of your jobs at ood.discovery.neu.edu (NU ID required).
Increase the number of nodes at the top of the sbatch file:
.
.
.
#SBATCH --partition=express
#SBATCH --job-name=spark-cluster
#SBATCH --nodes=3
.
.
.
pi.py contains a perfectly parallel algorithm for estimating pi using a Monte Carlo method. Running the job with more nodes should increase performance.
If you find that a simple, parallelizable job is not scaling, it likely means you are running several Driver nodes instead of one Driver and several Workers. See spark_with_slurm.sh for details.