refactor it a bit with menus

cmc-aau · Nov 10, 2023 · 62f7c53 · 62f7c53
1 parent f10a41b
commit 62f7c53
Show file tree

Hide file tree

Showing 15 changed files with 260 additions and 255 deletions.
diff --git a/docs/access.md b/docs/access.md
@@ -3,7 +3,7 @@
 ## Introduction
 SSH (Secure Shell) is a widely used protocol for securely accessing remote Linux servers and is the primary way to access the BioCloud servers. Connecting through a virtual desktop is sometimes also needed for GUI apps and is possible using X2Go on the `axomamma.srv.aau.dk` server only. This page provides instructions on how to access the BioCloud through SSH using a few different SSH clients and platforms as well as how to set up the X2Go client for a virtual desktop. There are many other SSH clients available, it is entirely up to you which you prefer. Regardless of the client everything will run over the SSH protocol (port 22). You authenticate using your AAU email and password (possibly also a second factor) and you must be on the AAU network unless you are connected to the AAU network from the outside using either VPN or by using the AAU SSH gateway, both are described later under [external access](#external-access).
 
-If you need to run GUI apps like CLC, Arb, RStudio, etc, you need to use `axomamma.srv.aau.dk` as hostname and install and set up the [X2Go client](#access-through-x2go-virtual-desktop). For anything else, you need to learn how to submit SLURM jobs through one of the login-nodes `bio-ospikachu01.srv.aau.dk`, `bio-ospikachu02.srv.aau.dk`, or `bio-ospikachu03.srv.aau.dk`. Preferably add all of them to distribute usage. After succesful login consult the [Slurm guide](slurm.md).
+If you need to run GUI apps like CLC, Arb, RStudio, etc, you need to use `axomamma.srv.aau.dk` as hostname and install and set up the [X2Go client](#access-through-x2go-virtual-desktop). For anything else, you need to learn how to submit SLURM jobs through one of the login-nodes `bio-ospikachu01.srv.aau.dk`, `bio-ospikachu02.srv.aau.dk`, or `bio-ospikachu03.srv.aau.dk`. Preferably add all of them to distribute usage. After succesful login consult the [Slurm guide](slurm/intro.md).
 
 ## Access through SSH
 It's rarely enough with just a terminal because you more often than not need to edit some scripts in order to do anything, so below are some instructions on how to connect using a few popular code editors with built-in SSH support, but also [just a terminal](#just-a-terminal).
@@ -112,7 +112,7 @@ Host bio-ospikachu03.srv.aau.dk
 [SSH public key authentication](https://www.ssh.com/academy/ssh/public-key-authentication) offers a more secure way to connect to a server, and is also more convenient, since you don't have to type in your password every single time you log in or transfer a file. An SSH private key is essentially just a very long password that is used to authenticate with a server holding the cryptographically linked public key for your user (think of it as the lock for the private key). Any SSH client that you choose to use will use a standalone SSH program on your computer under the hood, so this applies to all of them.
 
 #### Generating SSH Key Pairs
-This must be done locally for security reasons, so that the private key never leaves your computer. If you use a password manager (please do) like 1Password or bitwarden you can usually both generate and safely store and use SSH keys directly from the vault without it lying around in a file. It's important that the key is not generated using the default (usually) RSA type algorithm, because it's outdated and can be brute-forced easily with modern hardware, use the “ed25519” algorithm instead.
+This must be done locally for security reasons, so that the private key never leaves your computer. If you use a password manager (please do) like 1Password or bitwarden you can usually both generate and safely store and use SSH keys directly from the vault without it lying around in a file. It's important that the key is not generated using the default (usually) RSA type algorithm, because it's outdated and can be brute-forced easily with modern hardware, use the `ed25519` algorithm instead.
 
 ##### On Linux or macOS:
 1. Open your terminal.

diff --git a/docs/guides.md b/docs/guides.md
diff --git a/docs/guides/codepractices.md b/docs/guides/codepractices.md
@@ -0,0 +1,4 @@
+# Good code practices
+
+ - portability, noting software dependencies/distribute envs or containers
+ - software and scripts should be like a lab-book. Can't reproduce, can't use. 
diff --git a/docs/guides/git.md b/docs/guides/git.md
@@ -0,0 +1,5 @@
+# Git and GitHub
+ - git
+ - github
+ - How to get git to work on servers (SSH key)
+ - building containers (maybe through github actions too)
diff --git a/docs/guides/other.md b/docs/guides/other.md
@@ -0,0 +1,3 @@
+# Other
+ - RSlurm
+ - How to use common tools
diff --git a/docs/guides/snakemake.md b/docs/guides/snakemake.md
@@ -0,0 +1,2 @@
+# Snakemake
+Declarative, nice with checkpoints, integrates directly with slurm!
diff --git a/docs/ood.md b/docs/ood.md
@@ -0,0 +1,4 @@
+# Web portal access
+
+## OpenOndemand
+Coming up when set up. https://curc.readthedocs.io/en/latest/gateways/OnDemand.html
diff --git a/docs/slurm/intro.md b/docs/slurm/intro.md
@@ -0,0 +1,28 @@
+# Introduction to SLURM
+SLURM (Simple Linux Utility for Resource Management) is a highly flexible and powerful job scheduler for managing and scheduling computational workloads on high-performance computing (HPC) clusters. SLURM is designed to efficiently allocate resources and manage job execution on clusters of any size, from a single server to tens of thousands. SLURM manages resources on an HPC cluster by dividing them into partitions. Users submit jobs to these partitions from a login-node, and then the SLURM controller schedules and allocates resources to those jobs based on available resources and user-defined constraints. SLURM also stores detailed usage information of all users' jobs in a usage accounting database, which allows enforcement of fair-share policies and priorities for job scheduling for each partition. The BioCloud servers are currently divided into two partitions with the same usage policies (currently no limit FIFO, first-in-first-out): the `biocloud-cpu` for CPU intensive jobs and the `biocloud-gpu` for jobs that benefit from a GPU.
+
+**Overview figure here**
+
+## Getting an overview
+To start with it's always nice to get an overview of the cluster, it's partitions, and how many ressources that are currently allocated. This is achieved with the `sinfo` command, example output:
+
+```
+$ sinfo
+PARTITION     AVAIL  TIMELIMIT  NODES  STATE NODELIST
+biocloud-cpu*    up 14-00:00:0      1   idle bio-oscloud04
+```
+
+To get an overview of running jobs use `squeue`, example output:
+```
+# everything
+$ squeue
+JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
+   24 biocloud- interact ksa@bio.  R       2:15      1 bio-oscloud04
+
+# specific user (usually yourself)
+$ squeue -u $(whoami)
+JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
+   35 biocloud- interact ksa@bio.  R       8:28      1 bio-oscloud04
+```
+
+To get live information about the whole cluster, the ressource utilization of individual nodes, number of SLURM jobs running etc, visit the [Grafana dashboard](http://bio-ospikachu04.srv.aau.dk:3000/).
diff --git a/docs/slurm/jobcontrol.md b/docs/slurm/jobcontrol.md
@@ -0,0 +1,110 @@
+# Job control and usage accounting
+Below are some nice to know commands for controlling and checking up on running jobs, current and past.
+
+## Get job status info
+Use [`squeue`](https://slurm.schedmd.com/squeue.html), for example:
+```
+$ squeue
+JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
+   24 biocloud- interact ksa@bio.  R       2:15      1 bio-oscloud04
+```
+
+??? "Job state codes (ST)"
+      | Status	Code | Explaination |
+      | --- | --- |
+      | COMPLETED | CD | The job has completed successfully. |
+      | COMPLETING | CG | The job is finishing but some processes are still active. |
+      | FAILED | F | The job terminated with a non-zero exit code and failed to execute. |
+      | PENDING | PD | The job is waiting for resource allocation. It will eventually run. |
+      | PREEMPTED | PR | The job was terminated because of preemption by another job. |
+      | RUNNING | R | The job currently is allocated to a node and is running. |
+      | SUSPENDED | S | A running job has been stopped with its cores released to other jobs. |
+      | STOPPED | ST | A running job has been stopped with its cores retained. |
+
+      A complete list can be found in SLURM's [documentation](https://slurm.schedmd.com/squeue.html#lbAG)
+
+??? "Job reason codes (REASON )"
+      | Reason Code | Explaination |
+      | --- | --- |
+      | Priority | One or more higher priority jobs is in queue for running. Your job will eventually run. |
+      | Dependency | This job is waiting for a dependent job to complete and will run afterwards. |
+      | Resources | The job is waiting for resources to become available and will eventually run. |
+      | InvalidAccount | The job’s account is invalid. Cancel the job and rerun with correct account. |
+      | InvaldQoS | The job’s QoS is invalid. Cancel the job and rerun with correct account. |
+      | QOSGrpCpuLimit | All CPUs assigned to your job’s specified QoS are in use; job will run eventually. |
+      | QOSGrpMaxJobsLimit | Maximum number of jobs for your job’s QoS have been met; job will run eventually. |
+      | QOSGrpNodeLimit | All nodes assigned to your job’s specified QoS are in use; job will run eventually. |
+      | PartitionCpuLimit | All CPUs assigned to your job’s specified partition are in use; job will run eventually. |
+      | PartitionMaxJobsLimit | Maximum number of jobs for your job’s partition have been met; job will run eventually. |
+      | PartitionNodeLimit | All nodes assigned to your job’s specified partition are in use; job will run eventually. |
+      | AssociationCpuLimit | All CPUs assigned to your job’s specified association are in use; job will run eventually. |
+      | AssociationMaxJobsLimit | Maximum number of jobs for your job’s association have been met; job will run eventually. |
+      | AssociationNodeLimit | All nodes assigned to your job’s specified association are in use; job will run eventually. |
+
+      A complete list can be found in SLURM's [documentation](https://slurm.schedmd.com/squeue.html#lbAF)
+
+## Cancel a job
+With `sbatch` you won't be able to just hit CTRL+c to stop what's running like you're used to in a terminal. Instead you must use `scancel`. Get the job ID from `squeue -u $(whoami)`, then use [`scancel`](https://slurm.schedmd.com/scancel.html) to cancel a running job, for example:
+```
+$ scancel 24
+```
+
+If the particular job doesn't stop and doesn't respond, consider using [`skill`](https://slurm.schedmd.com/skill.html) instead.
+
+## Pause or resume a job
+Use [`scontrol`](https://slurm.schedmd.com/scontrol.html) to control your own jobs, for example suspend a running job:
+```
+$ scontrol suspend 24
+```
+
+Resume again with
+```
+$ scontrol resume 24
+```
+
+## Adjust allocated ressources
+It's also possible to adjust allocated ressources to free them up for others to use without having to stop anything, for example:
+```
+$ scontrol update JobId=24 NumNodes=1 NumTasks=1 CPUsPerTask=1
+```
+
+## Job status information
+Use [`sstat`](https://slurm.schedmd.com/sstat.html) to show the status and usage accounting information of running jobs.
+```
+$ sstat
+```
+
+With additional details:
+```
+sstat --jobs=your_job-id --format=jobid,avecpu,maxrss,ntasks
+```
+
+??? "Useful format variables"
+      | Variable | Description |
+      | --- | --- |
+      | avecpu | Average CPU time of all tasks in job. |
+      | averss | Average resident set size of all tasks. |
+      | avevmsize | Average virtual memory of all tasks in a job. |
+      | jobid | The id of the Job. |
+      | maxrss | Maximum number of bytes read by all tasks in the job. |
+      | maxvsize | Maximum number of bytes written by all tasks in the job. |
+      | ntasks | Number of tasks in a job. |
+
+      For all variables see the [SLURM documentation](https://slurm.schedmd.com/sstat.html#SECTION_Job-Status-Fields)
+
+## Job usage accounting
+To view the status of past jobs and their usage accounting information use [`sacct`](https://slurm.schedmd.com/sacct.html).
+```
+$ sacct
+JobID           JobName  Partition    Account  AllocCPUS      State ExitCode 
+------------ ---------- ---------- ---------- ---------- ---------- -------- 
+7            biobank_d+ biocloud-+ compute-a+        180  COMPLETED      0:0 
+7.batch           batch            compute-a+        180  COMPLETED      0:0 
+7.extern         extern            compute-a+        180  COMPLETED      0:0 
+8            interacti+ biocloud-+ compute-a+          1     FAILED      2:0 
+8.extern         extern            compute-a+          1  COMPLETED      0:0 
+9            interacti+ biocloud-+ compute-a+          1  COMPLETED      0:0 
+9.extern         extern            compute-a+          1  COMPLETED      0:0
+```
+
+There are a huge number of other options to show, see [SLURM docs](https://slurm.schedmd.com/sacct.html#SECTION_Job-Accounting-Fields). If you really want to see everything use `sacct --long > file.txt` and dump it into a file or else it's too much for the terminal.
diff --git a/docs/slurm/other.md b/docs/slurm/other.md
@@ -0,0 +1,44 @@
+# Other commands / FAQ
+## How can I get a more detailed overview of the job queue and their requested ressources
+```
+squeue -o "%.18i %Q  %.8j %.8u %.2t %.10M %.10L %.6C %m %R"
+             JOBID PRIORITY      NAME     USER ST       TIME  TIME_LEFT   CPUS MIN_MEMORY NODELIST(REASON)
+                 9 4294901751      test jm12em@b PD       0:00 14-00:00:00     40 300G (Resources)
+                10 4294901750  minimap2 ksa@bio. PD       0:00    1:00:00      4 40G (Nodes required for job are DOWN, DRAINED or reserved for jobs in higher priority partitions)
+                11 4294901749      test jm12em@b PD       0:00 14-00:00:00      4 10G (Priority)
+                 8 4294901752      test jm12em@b  R      20:35 13-23:39:25     40 0 bio-oscloud04
+```
+
+## Show details about a particular job
+```
+$ scontrol show job 24
+JobId=24 JobName=interactive
+   [email protected](101632) [email protected](101632) MCS_label=N/A
+   Priority=4294901738 Nice=0 Account=compute-account QOS=normal
+   JobState=RUNNING Reason=None Dependency=(null)
+   Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
+   RunTime=00:02:00 TimeLimit=14-00:00:00 TimeMin=N/A
+   SubmitTime=2023-11-01T11:20:01 EligibleTime=2023-11-01T11:20:01
+   AccrueTime=Unknown
+   StartTime=2023-11-01T11:20:01 EndTime=2023-11-15T11:20:01 Deadline=N/A
+   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2023-11-01T11:20:01 Scheduler=Main
+   Partition=biocloud-cpu AllocNode:Sid=bio-ospikachu02:340145
+   ReqNodeList=(null) ExcNodeList=(null)
+   NodeList=bio-ospikachu05
+   BatchHost=bio-ospikachu05
+   NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
+   ReqTRES=cpu=1,mem=10G,node=1,billing=1
+   AllocTRES=cpu=1,mem=10G,node=1,billing=1
+   Socks/Node=* NtasksPerN:B:S:C=0:0:*:1 CoreSpec=*
+   MinCPUsNode=1 MinMemoryNode=10G MinTmpDiskNode=0
+   Features=(null) DelayBoot=00:00:00
+   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
+   Command=/bin/bash
+   WorkDir=/user_data/ksa/projects/slurmtest
+   Power=
+```
+
+## Show details about the whole cluster configuration
+```
+$ scontrol show config
+```
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,2 @@
		# Snakemake
		Declarative, nice with checkpoints, integrates directly with slurm!