Skip to content

Commit

Permalink
refactor it a bit with menus
Browse files Browse the repository at this point in the history
  • Loading branch information
KasperSkytte committed Nov 10, 2023
1 parent f10a41b commit 62f7c53
Show file tree
Hide file tree
Showing 15 changed files with 260 additions and 255 deletions.
4 changes: 2 additions & 2 deletions docs/access.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
## Introduction
SSH (Secure Shell) is a widely used protocol for securely accessing remote Linux servers and is the primary way to access the BioCloud servers. Connecting through a virtual desktop is sometimes also needed for GUI apps and is possible using X2Go on the `axomamma.srv.aau.dk` server only. This page provides instructions on how to access the BioCloud through SSH using a few different SSH clients and platforms as well as how to set up the X2Go client for a virtual desktop. There are many other SSH clients available, it is entirely up to you which you prefer. Regardless of the client everything will run over the SSH protocol (port 22). You authenticate using your AAU email and password (possibly also a second factor) and you must be on the AAU network unless you are connected to the AAU network from the outside using either VPN or by using the AAU SSH gateway, both are described later under [external access](#external-access).

If you need to run GUI apps like CLC, Arb, RStudio, etc, you need to use `axomamma.srv.aau.dk` as hostname and install and set up the [X2Go client](#access-through-x2go-virtual-desktop). For anything else, you need to learn how to submit SLURM jobs through one of the login-nodes `bio-ospikachu01.srv.aau.dk`, `bio-ospikachu02.srv.aau.dk`, or `bio-ospikachu03.srv.aau.dk`. Preferably add all of them to distribute usage. After succesful login consult the [Slurm guide](slurm.md).
If you need to run GUI apps like CLC, Arb, RStudio, etc, you need to use `axomamma.srv.aau.dk` as hostname and install and set up the [X2Go client](#access-through-x2go-virtual-desktop). For anything else, you need to learn how to submit SLURM jobs through one of the login-nodes `bio-ospikachu01.srv.aau.dk`, `bio-ospikachu02.srv.aau.dk`, or `bio-ospikachu03.srv.aau.dk`. Preferably add all of them to distribute usage. After succesful login consult the [Slurm guide](slurm/intro.md).

## Access through SSH
It's rarely enough with just a terminal because you more often than not need to edit some scripts in order to do anything, so below are some instructions on how to connect using a few popular code editors with built-in SSH support, but also [just a terminal](#just-a-terminal).
Expand Down Expand Up @@ -112,7 +112,7 @@ Host bio-ospikachu03.srv.aau.dk
[SSH public key authentication](https://www.ssh.com/academy/ssh/public-key-authentication) offers a more secure way to connect to a server, and is also more convenient, since you don't have to type in your password every single time you log in or transfer a file. An SSH private key is essentially just a very long password that is used to authenticate with a server holding the cryptographically linked public key for your user (think of it as the lock for the private key). Any SSH client that you choose to use will use a standalone SSH program on your computer under the hood, so this applies to all of them.

#### Generating SSH Key Pairs
This must be done locally for security reasons, so that the private key never leaves your computer. If you use a password manager (please do) like 1Password or bitwarden you can usually both generate and safely store and use SSH keys directly from the vault without it lying around in a file. It's important that the key is not generated using the default (usually) RSA type algorithm, because it's outdated and can be brute-forced easily with modern hardware, use the ed25519 algorithm instead.
This must be done locally for security reasons, so that the private key never leaves your computer. If you use a password manager (please do) like 1Password or bitwarden you can usually both generate and safely store and use SSH keys directly from the vault without it lying around in a file. It's important that the key is not generated using the default (usually) RSA type algorithm, because it's outdated and can be brute-forced easily with modern hardware, use the `ed25519` algorithm instead.

##### On Linux or macOS:
1. Open your terminal.
Expand Down
12 changes: 0 additions & 12 deletions docs/guides.md

This file was deleted.

4 changes: 4 additions & 0 deletions docs/guides/codepractices.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
# Good code practices

- portability, noting software dependencies/distribute envs or containers
- software and scripts should be like a lab-book. Can't reproduce, can't use.
5 changes: 5 additions & 0 deletions docs/guides/git.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# Git and GitHub
- git
- github
- How to get git to work on servers (SSH key)
- building containers (maybe through github actions too)
3 changes: 3 additions & 0 deletions docs/guides/other.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# Other
- RSlurm
- How to use common tools
2 changes: 2 additions & 0 deletions docs/guides/snakemake.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
# Snakemake
Declarative, nice with checkpoints, integrates directly with slurm!
4 changes: 4 additions & 0 deletions docs/ood.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
# Web portal access

## OpenOndemand
Coming up when set up. https://curc.readthedocs.io/en/latest/gateways/OnDemand.html
28 changes: 28 additions & 0 deletions docs/slurm/intro.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
# Introduction to SLURM
SLURM (Simple Linux Utility for Resource Management) is a highly flexible and powerful job scheduler for managing and scheduling computational workloads on high-performance computing (HPC) clusters. SLURM is designed to efficiently allocate resources and manage job execution on clusters of any size, from a single server to tens of thousands. SLURM manages resources on an HPC cluster by dividing them into partitions. Users submit jobs to these partitions from a login-node, and then the SLURM controller schedules and allocates resources to those jobs based on available resources and user-defined constraints. SLURM also stores detailed usage information of all users' jobs in a usage accounting database, which allows enforcement of fair-share policies and priorities for job scheduling for each partition. The BioCloud servers are currently divided into two partitions with the same usage policies (currently no limit FIFO, first-in-first-out): the `biocloud-cpu` for CPU intensive jobs and the `biocloud-gpu` for jobs that benefit from a GPU.

**Overview figure here**

## Getting an overview
To start with it's always nice to get an overview of the cluster, it's partitions, and how many ressources that are currently allocated. This is achieved with the `sinfo` command, example output:

```
$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
biocloud-cpu* up 14-00:00:0 1 idle bio-oscloud04
```

To get an overview of running jobs use `squeue`, example output:
```
# everything
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
24 biocloud- interact ksa@bio. R 2:15 1 bio-oscloud04
# specific user (usually yourself)
$ squeue -u $(whoami)
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
35 biocloud- interact ksa@bio. R 8:28 1 bio-oscloud04
```

To get live information about the whole cluster, the ressource utilization of individual nodes, number of SLURM jobs running etc, visit the [Grafana dashboard](http://bio-ospikachu04.srv.aau.dk:3000/).
110 changes: 110 additions & 0 deletions docs/slurm/jobcontrol.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,110 @@
# Job control and usage accounting
Below are some nice to know commands for controlling and checking up on running jobs, current and past.

## Get job status info
Use [`squeue`](https://slurm.schedmd.com/squeue.html), for example:
```
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
24 biocloud- interact ksa@bio. R 2:15 1 bio-oscloud04
```

??? "Job state codes (ST)"
| Status Code | Explaination |
| --- | --- |
| COMPLETED | CD | The job has completed successfully. |
| COMPLETING | CG | The job is finishing but some processes are still active. |
| FAILED | F | The job terminated with a non-zero exit code and failed to execute. |
| PENDING | PD | The job is waiting for resource allocation. It will eventually run. |
| PREEMPTED | PR | The job was terminated because of preemption by another job. |
| RUNNING | R | The job currently is allocated to a node and is running. |
| SUSPENDED | S | A running job has been stopped with its cores released to other jobs. |
| STOPPED | ST | A running job has been stopped with its cores retained. |

A complete list can be found in SLURM's [documentation](https://slurm.schedmd.com/squeue.html#lbAG)

??? "Job reason codes (REASON )"
| Reason Code | Explaination |
| --- | --- |
| Priority | One or more higher priority jobs is in queue for running. Your job will eventually run. |
| Dependency | This job is waiting for a dependent job to complete and will run afterwards. |
| Resources | The job is waiting for resources to become available and will eventually run. |
| InvalidAccount | The job’s account is invalid. Cancel the job and rerun with correct account. |
| InvaldQoS | The job’s QoS is invalid. Cancel the job and rerun with correct account. |
| QOSGrpCpuLimit | All CPUs assigned to your job’s specified QoS are in use; job will run eventually. |
| QOSGrpMaxJobsLimit | Maximum number of jobs for your job’s QoS have been met; job will run eventually. |
| QOSGrpNodeLimit | All nodes assigned to your job’s specified QoS are in use; job will run eventually. |
| PartitionCpuLimit | All CPUs assigned to your job’s specified partition are in use; job will run eventually. |
| PartitionMaxJobsLimit | Maximum number of jobs for your job’s partition have been met; job will run eventually. |
| PartitionNodeLimit | All nodes assigned to your job’s specified partition are in use; job will run eventually. |
| AssociationCpuLimit | All CPUs assigned to your job’s specified association are in use; job will run eventually. |
| AssociationMaxJobsLimit | Maximum number of jobs for your job’s association have been met; job will run eventually. |
| AssociationNodeLimit | All nodes assigned to your job’s specified association are in use; job will run eventually. |

A complete list can be found in SLURM's [documentation](https://slurm.schedmd.com/squeue.html#lbAF)

## Cancel a job
With `sbatch` you won't be able to just hit CTRL+c to stop what's running like you're used to in a terminal. Instead you must use `scancel`. Get the job ID from `squeue -u $(whoami)`, then use [`scancel`](https://slurm.schedmd.com/scancel.html) to cancel a running job, for example:
```
$ scancel 24
```

If the particular job doesn't stop and doesn't respond, consider using [`skill`](https://slurm.schedmd.com/skill.html) instead.

## Pause or resume a job
Use [`scontrol`](https://slurm.schedmd.com/scontrol.html) to control your own jobs, for example suspend a running job:
```
$ scontrol suspend 24
```

Resume again with
```
$ scontrol resume 24
```

## Adjust allocated ressources
It's also possible to adjust allocated ressources to free them up for others to use without having to stop anything, for example:
```
$ scontrol update JobId=24 NumNodes=1 NumTasks=1 CPUsPerTask=1
```

## Job status information
Use [`sstat`](https://slurm.schedmd.com/sstat.html) to show the status and usage accounting information of running jobs.
```
$ sstat
```

With additional details:
```
sstat --jobs=your_job-id --format=jobid,avecpu,maxrss,ntasks
```

??? "Useful format variables"
| Variable | Description |
| --- | --- |
| avecpu | Average CPU time of all tasks in job. |
| averss | Average resident set size of all tasks. |
| avevmsize | Average virtual memory of all tasks in a job. |
| jobid | The id of the Job. |
| maxrss | Maximum number of bytes read by all tasks in the job. |
| maxvsize | Maximum number of bytes written by all tasks in the job. |
| ntasks | Number of tasks in a job. |

For all variables see the [SLURM documentation](https://slurm.schedmd.com/sstat.html#SECTION_Job-Status-Fields)

## Job usage accounting
To view the status of past jobs and their usage accounting information use [`sacct`](https://slurm.schedmd.com/sacct.html).
```
$ sacct
JobID JobName Partition Account AllocCPUS State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
7 biobank_d+ biocloud-+ compute-a+ 180 COMPLETED 0:0
7.batch batch compute-a+ 180 COMPLETED 0:0
7.extern extern compute-a+ 180 COMPLETED 0:0
8 interacti+ biocloud-+ compute-a+ 1 FAILED 2:0
8.extern extern compute-a+ 1 COMPLETED 0:0
9 interacti+ biocloud-+ compute-a+ 1 COMPLETED 0:0
9.extern extern compute-a+ 1 COMPLETED 0:0
```

There are a huge number of other options to show, see [SLURM docs](https://slurm.schedmd.com/sacct.html#SECTION_Job-Accounting-Fields). If you really want to see everything use `sacct --long > file.txt` and dump it into a file or else it's too much for the terminal.
44 changes: 44 additions & 0 deletions docs/slurm/other.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
# Other commands / FAQ
## How can I get a more detailed overview of the job queue and their requested ressources
```
squeue -o "%.18i %Q %.8j %.8u %.2t %.10M %.10L %.6C %m %R"
JOBID PRIORITY NAME USER ST TIME TIME_LEFT CPUS MIN_MEMORY NODELIST(REASON)
9 4294901751 test jm12em@b PD 0:00 14-00:00:00 40 300G (Resources)
10 4294901750 minimap2 ksa@bio. PD 0:00 1:00:00 4 40G (Nodes required for job are DOWN, DRAINED or reserved for jobs in higher priority partitions)
11 4294901749 test jm12em@b PD 0:00 14-00:00:00 4 10G (Priority)
8 4294901752 test jm12em@b R 20:35 13-23:39:25 40 0 bio-oscloud04
```

## Show details about a particular job
```
$ scontrol show job 24
JobId=24 JobName=interactive
[email protected](101632) [email protected](101632) MCS_label=N/A
Priority=4294901738 Nice=0 Account=compute-account QOS=normal
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
RunTime=00:02:00 TimeLimit=14-00:00:00 TimeMin=N/A
SubmitTime=2023-11-01T11:20:01 EligibleTime=2023-11-01T11:20:01
AccrueTime=Unknown
StartTime=2023-11-01T11:20:01 EndTime=2023-11-15T11:20:01 Deadline=N/A
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2023-11-01T11:20:01 Scheduler=Main
Partition=biocloud-cpu AllocNode:Sid=bio-ospikachu02:340145
ReqNodeList=(null) ExcNodeList=(null)
NodeList=bio-ospikachu05
BatchHost=bio-ospikachu05
NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
ReqTRES=cpu=1,mem=10G,node=1,billing=1
AllocTRES=cpu=1,mem=10G,node=1,billing=1
Socks/Node=* NtasksPerN:B:S:C=0:0:*:1 CoreSpec=*
MinCPUsNode=1 MinMemoryNode=10G MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=/bin/bash
WorkDir=/user_data/ksa/projects/slurmtest
Power=
```

## Show details about the whole cluster configuration
```
$ scontrol show config
```
Loading

0 comments on commit 62f7c53

Please sign in to comment.