generated from mhausenblas/mkdocs-template
-
Notifications
You must be signed in to change notification settings - Fork 1
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
f10a41b
commit 62f7c53
Showing
15 changed files
with
260 additions
and
255 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
# Good code practices | ||
|
||
- portability, noting software dependencies/distribute envs or containers | ||
- software and scripts should be like a lab-book. Can't reproduce, can't use. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
# Git and GitHub | ||
- git | ||
- github | ||
- How to get git to work on servers (SSH key) | ||
- building containers (maybe through github actions too) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
# Other | ||
- RSlurm | ||
- How to use common tools |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,2 @@ | ||
# Snakemake | ||
Declarative, nice with checkpoints, integrates directly with slurm! |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
# Web portal access | ||
|
||
## OpenOndemand | ||
Coming up when set up. https://curc.readthedocs.io/en/latest/gateways/OnDemand.html |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,28 @@ | ||
# Introduction to SLURM | ||
SLURM (Simple Linux Utility for Resource Management) is a highly flexible and powerful job scheduler for managing and scheduling computational workloads on high-performance computing (HPC) clusters. SLURM is designed to efficiently allocate resources and manage job execution on clusters of any size, from a single server to tens of thousands. SLURM manages resources on an HPC cluster by dividing them into partitions. Users submit jobs to these partitions from a login-node, and then the SLURM controller schedules and allocates resources to those jobs based on available resources and user-defined constraints. SLURM also stores detailed usage information of all users' jobs in a usage accounting database, which allows enforcement of fair-share policies and priorities for job scheduling for each partition. The BioCloud servers are currently divided into two partitions with the same usage policies (currently no limit FIFO, first-in-first-out): the `biocloud-cpu` for CPU intensive jobs and the `biocloud-gpu` for jobs that benefit from a GPU. | ||
|
||
**Overview figure here** | ||
|
||
## Getting an overview | ||
To start with it's always nice to get an overview of the cluster, it's partitions, and how many ressources that are currently allocated. This is achieved with the `sinfo` command, example output: | ||
|
||
``` | ||
$ sinfo | ||
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST | ||
biocloud-cpu* up 14-00:00:0 1 idle bio-oscloud04 | ||
``` | ||
|
||
To get an overview of running jobs use `squeue`, example output: | ||
``` | ||
# everything | ||
$ squeue | ||
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) | ||
24 biocloud- interact ksa@bio. R 2:15 1 bio-oscloud04 | ||
# specific user (usually yourself) | ||
$ squeue -u $(whoami) | ||
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) | ||
35 biocloud- interact ksa@bio. R 8:28 1 bio-oscloud04 | ||
``` | ||
|
||
To get live information about the whole cluster, the ressource utilization of individual nodes, number of SLURM jobs running etc, visit the [Grafana dashboard](http://bio-ospikachu04.srv.aau.dk:3000/). |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,110 @@ | ||
# Job control and usage accounting | ||
Below are some nice to know commands for controlling and checking up on running jobs, current and past. | ||
|
||
## Get job status info | ||
Use [`squeue`](https://slurm.schedmd.com/squeue.html), for example: | ||
``` | ||
$ squeue | ||
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) | ||
24 biocloud- interact ksa@bio. R 2:15 1 bio-oscloud04 | ||
``` | ||
|
||
??? "Job state codes (ST)" | ||
| Status Code | Explaination | | ||
| --- | --- | | ||
| COMPLETED | CD | The job has completed successfully. | | ||
| COMPLETING | CG | The job is finishing but some processes are still active. | | ||
| FAILED | F | The job terminated with a non-zero exit code and failed to execute. | | ||
| PENDING | PD | The job is waiting for resource allocation. It will eventually run. | | ||
| PREEMPTED | PR | The job was terminated because of preemption by another job. | | ||
| RUNNING | R | The job currently is allocated to a node and is running. | | ||
| SUSPENDED | S | A running job has been stopped with its cores released to other jobs. | | ||
| STOPPED | ST | A running job has been stopped with its cores retained. | | ||
|
||
A complete list can be found in SLURM's [documentation](https://slurm.schedmd.com/squeue.html#lbAG) | ||
|
||
??? "Job reason codes (REASON )" | ||
| Reason Code | Explaination | | ||
| --- | --- | | ||
| Priority | One or more higher priority jobs is in queue for running. Your job will eventually run. | | ||
| Dependency | This job is waiting for a dependent job to complete and will run afterwards. | | ||
| Resources | The job is waiting for resources to become available and will eventually run. | | ||
| InvalidAccount | The job’s account is invalid. Cancel the job and rerun with correct account. | | ||
| InvaldQoS | The job’s QoS is invalid. Cancel the job and rerun with correct account. | | ||
| QOSGrpCpuLimit | All CPUs assigned to your job’s specified QoS are in use; job will run eventually. | | ||
| QOSGrpMaxJobsLimit | Maximum number of jobs for your job’s QoS have been met; job will run eventually. | | ||
| QOSGrpNodeLimit | All nodes assigned to your job’s specified QoS are in use; job will run eventually. | | ||
| PartitionCpuLimit | All CPUs assigned to your job’s specified partition are in use; job will run eventually. | | ||
| PartitionMaxJobsLimit | Maximum number of jobs for your job’s partition have been met; job will run eventually. | | ||
| PartitionNodeLimit | All nodes assigned to your job’s specified partition are in use; job will run eventually. | | ||
| AssociationCpuLimit | All CPUs assigned to your job’s specified association are in use; job will run eventually. | | ||
| AssociationMaxJobsLimit | Maximum number of jobs for your job’s association have been met; job will run eventually. | | ||
| AssociationNodeLimit | All nodes assigned to your job’s specified association are in use; job will run eventually. | | ||
|
||
A complete list can be found in SLURM's [documentation](https://slurm.schedmd.com/squeue.html#lbAF) | ||
|
||
## Cancel a job | ||
With `sbatch` you won't be able to just hit CTRL+c to stop what's running like you're used to in a terminal. Instead you must use `scancel`. Get the job ID from `squeue -u $(whoami)`, then use [`scancel`](https://slurm.schedmd.com/scancel.html) to cancel a running job, for example: | ||
``` | ||
$ scancel 24 | ||
``` | ||
|
||
If the particular job doesn't stop and doesn't respond, consider using [`skill`](https://slurm.schedmd.com/skill.html) instead. | ||
|
||
## Pause or resume a job | ||
Use [`scontrol`](https://slurm.schedmd.com/scontrol.html) to control your own jobs, for example suspend a running job: | ||
``` | ||
$ scontrol suspend 24 | ||
``` | ||
|
||
Resume again with | ||
``` | ||
$ scontrol resume 24 | ||
``` | ||
|
||
## Adjust allocated ressources | ||
It's also possible to adjust allocated ressources to free them up for others to use without having to stop anything, for example: | ||
``` | ||
$ scontrol update JobId=24 NumNodes=1 NumTasks=1 CPUsPerTask=1 | ||
``` | ||
|
||
## Job status information | ||
Use [`sstat`](https://slurm.schedmd.com/sstat.html) to show the status and usage accounting information of running jobs. | ||
``` | ||
$ sstat | ||
``` | ||
|
||
With additional details: | ||
``` | ||
sstat --jobs=your_job-id --format=jobid,avecpu,maxrss,ntasks | ||
``` | ||
|
||
??? "Useful format variables" | ||
| Variable | Description | | ||
| --- | --- | | ||
| avecpu | Average CPU time of all tasks in job. | | ||
| averss | Average resident set size of all tasks. | | ||
| avevmsize | Average virtual memory of all tasks in a job. | | ||
| jobid | The id of the Job. | | ||
| maxrss | Maximum number of bytes read by all tasks in the job. | | ||
| maxvsize | Maximum number of bytes written by all tasks in the job. | | ||
| ntasks | Number of tasks in a job. | | ||
|
||
For all variables see the [SLURM documentation](https://slurm.schedmd.com/sstat.html#SECTION_Job-Status-Fields) | ||
|
||
## Job usage accounting | ||
To view the status of past jobs and their usage accounting information use [`sacct`](https://slurm.schedmd.com/sacct.html). | ||
``` | ||
$ sacct | ||
JobID JobName Partition Account AllocCPUS State ExitCode | ||
------------ ---------- ---------- ---------- ---------- ---------- -------- | ||
7 biobank_d+ biocloud-+ compute-a+ 180 COMPLETED 0:0 | ||
7.batch batch compute-a+ 180 COMPLETED 0:0 | ||
7.extern extern compute-a+ 180 COMPLETED 0:0 | ||
8 interacti+ biocloud-+ compute-a+ 1 FAILED 2:0 | ||
8.extern extern compute-a+ 1 COMPLETED 0:0 | ||
9 interacti+ biocloud-+ compute-a+ 1 COMPLETED 0:0 | ||
9.extern extern compute-a+ 1 COMPLETED 0:0 | ||
``` | ||
|
||
There are a huge number of other options to show, see [SLURM docs](https://slurm.schedmd.com/sacct.html#SECTION_Job-Accounting-Fields). If you really want to see everything use `sacct --long > file.txt` and dump it into a file or else it's too much for the terminal. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,44 @@ | ||
# Other commands / FAQ | ||
## How can I get a more detailed overview of the job queue and their requested ressources | ||
``` | ||
squeue -o "%.18i %Q %.8j %.8u %.2t %.10M %.10L %.6C %m %R" | ||
JOBID PRIORITY NAME USER ST TIME TIME_LEFT CPUS MIN_MEMORY NODELIST(REASON) | ||
9 4294901751 test jm12em@b PD 0:00 14-00:00:00 40 300G (Resources) | ||
10 4294901750 minimap2 ksa@bio. PD 0:00 1:00:00 4 40G (Nodes required for job are DOWN, DRAINED or reserved for jobs in higher priority partitions) | ||
11 4294901749 test jm12em@b PD 0:00 14-00:00:00 4 10G (Priority) | ||
8 4294901752 test jm12em@b R 20:35 13-23:39:25 40 0 bio-oscloud04 | ||
``` | ||
|
||
## Show details about a particular job | ||
``` | ||
$ scontrol show job 24 | ||
JobId=24 JobName=interactive | ||
[email protected](101632) [email protected](101632) MCS_label=N/A | ||
Priority=4294901738 Nice=0 Account=compute-account QOS=normal | ||
JobState=RUNNING Reason=None Dependency=(null) | ||
Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0 | ||
RunTime=00:02:00 TimeLimit=14-00:00:00 TimeMin=N/A | ||
SubmitTime=2023-11-01T11:20:01 EligibleTime=2023-11-01T11:20:01 | ||
AccrueTime=Unknown | ||
StartTime=2023-11-01T11:20:01 EndTime=2023-11-15T11:20:01 Deadline=N/A | ||
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2023-11-01T11:20:01 Scheduler=Main | ||
Partition=biocloud-cpu AllocNode:Sid=bio-ospikachu02:340145 | ||
ReqNodeList=(null) ExcNodeList=(null) | ||
NodeList=bio-ospikachu05 | ||
BatchHost=bio-ospikachu05 | ||
NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:* | ||
ReqTRES=cpu=1,mem=10G,node=1,billing=1 | ||
AllocTRES=cpu=1,mem=10G,node=1,billing=1 | ||
Socks/Node=* NtasksPerN:B:S:C=0:0:*:1 CoreSpec=* | ||
MinCPUsNode=1 MinMemoryNode=10G MinTmpDiskNode=0 | ||
Features=(null) DelayBoot=00:00:00 | ||
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) | ||
Command=/bin/bash | ||
WorkDir=/user_data/ksa/projects/slurmtest | ||
Power= | ||
``` | ||
|
||
## Show details about the whole cluster configuration | ||
``` | ||
$ scontrol show config | ||
``` |
Oops, something went wrong.