Skip to content

Commit

Permalink
fair-share+misc
Browse files Browse the repository at this point in the history
  • Loading branch information
KasperSkytte committed Dec 11, 2024
1 parent 0a0e5ad commit c7f6176
Show file tree
Hide file tree
Showing 4 changed files with 85 additions and 25 deletions.
65 changes: 50 additions & 15 deletions docs/slurm/accounting.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,14 +2,13 @@
All users belong to an account (usually their PI) where all usage is tracked on a per-user basis, but limitations and priorities can be set at a few different levels: at the cluster, partition, account, user, or QOS level. User associations with accounts rarely change, so in order to be able to temporarily request additional resources or obtain higher priority for certain projects, users can submit to different SLURM "Quality of Service"s (QOS). By default, all users can only submit jobs to the `normal` QOS with equal resource limits and base priority for everyone. Periodically users may submit to the `highprio` QOS instead, which has extra resources and higher priority (and therefore the usage is also billed 2x), however this must first be discussed among the owners of the hardware (PI's), and then you must contact an administrator to grant your user permission to submit jobs to it.

## Job priority
When a job is submitted a priority value is calculated based on several factors, where a higher number indicates a higher priority in the queue. This does not impact running jobs, and the effect of prioritization is only noticable when the cluster is operating near peak capacity, or when the hardware partition to which the job has been submitted is nearly fully allocated. Otherwise jobs will usually start immediately as long as there are resources available and you haven't reached the maximum CPU's per user limit, if any.
When a job is submitted a priority value is calculated based on several factors, where a higher number indicates a higher priority in the queue. This does not impact running jobs, and the effect of prioritization is only noticable when the cluster is operating near peak capacity, or when the hardware partition to which the job has been submitted is nearly fully allocated. Otherwise jobs will usually start immediately as long as there are resources available and you haven't reached the maximum CPU's per user limit.

Different weights are given to different priority factors, where the most significant ones are the account fair-share factor (which is simply the number of CPU's each account (PI) has contributed with to the cluster), the user's recent resource usage (higher usage results in a lower priority), and the QOS, as described above. All factors are normalized to a value between 0-1, then weighted by an adjustable scalar, which may be adjusted occasionally depending on the overall cluster usage. Users can also be nice to other users and reduce the priority of their own jobs by setting a "nice" value using `--nice` when submitting for example less time-critical jobs. Job priorities are then calculated according to the following formula:
Different weights are given to the individual priority factors, where the most significant ones are the account fair-share factor (described in more detail below) and the QOS, as mentioned above. All factors are normalized to a value between 0-1, then weighted by an adjustable scalar, which may be adjusted occasionally depending on the overall cluster usage. Users can also be nice to other users and reduce the priority of their own jobs by setting a "nice" value using `--nice` when submitting for example less time-critical jobs. Job priorities are then calculated according to the following formula:

```
Job_priority =
(PriorityWeightQOS) * (QOS_factor)
(PriorityWeightAssoc) * (assoc_factor) +
(PriorityWeightAge) * (age_factor) +
(PriorityWeightFairshare) * (fair-share_factor) +
(PriorityWeightJobSize) * (job_size_factor) -
Expand All @@ -23,34 +22,70 @@ $ sprio -w
Weights 1 5 10 3 10
```

The priority of pending jobs can be obtained using `sprio`, but is also shown when running `squeue`.
The priority of pending jobs can be obtained using `sprio`, but is also shown in the job queue when running `squeue`.

The fair-share factor is calculated according to the [fair-tree algorithm](https://slurm.schedmd.com/archive/slurm-23.02.6/fair_tree.html) and has a usage decay half-life of 2 weeks, but is completely reset at the first day of each month. To see the current fair-share factor for your user run `sshare -U`, and for all accounts `sshare -l`.
The age and job size factors are important to avoid the situation where large jobs can get stuck in the queue for a long time because smaller jobs will always fit in everywhere much more easily. The age factor will max out to `1.0` when 3 days of queue time has been accrued for any job. The job size factor is directly proportional to both the requested amount of resources as well as the time limit.

???+ tip "The fair-share factor and CPU efficiency"
The value of the fair-share factor is calculated based on CPU usage in units of **allocation seconds** and not CPU seconds, which is normally the unit used for CPU usage reported by the `sreport` and `sacct` commands. Therefore, this also means that the CPU efficiency of past jobs directly impacts how much actual work can be performed by the allocated CPUs for each user within each account before their fair share of resources is consumed for the period.
### The fair-share factor
As the name implies, the fair-share factor is used to ensure that users within each account have their fair share of computing resources made available to them over time. Because the individual research groups have contributed with different amounts of hardware to the cluster, the overall share of computing resources made available to them should match accordingly. Secondly, the resource usage of individual users within each account is important to consider as well, so that users who may have vastly overused their shares within each account should not have the highest priority. The goal of the fair-share factor is to balance the usage of all users by adjusting job priorities, so that it's possible for everyone to use their fair share of computing resources over time. The fair-share factor is calculated according to the [fair-tree algorithm](https://slurm.schedmd.com/archive/slurm-23.02.6/fair_tree.html), which is an integrated part of the SLURM scheduler. It has been configured with a usage decay half-life of 1 week, and the usage is completely reset at the first day of each month.

To see the current fair-share factor for your user and the amount of shares available for each account, you can run `sshare`:

The age and job size factors are important to avoid the situation where large jobs can get stuck in the queue for a long time because smaller jobs will always fit in everywhere much more easily. The age factor will max out to `1.0` when 3 days of queue time has been accrued for any job. The job size factor is directly proportional to both the requested amount of resources as well as the lime limit.
```
$ sshare
Account User RawShares NormShares RawUsage EffectvUsage FairShare
-------------------- ---------- ---------- ----------- ----------- ------------- ----------
root 0.000000 611505779 1.000000
root [email protected]+ 1 0.000549 471482 0.000771 0.036649
ao 25 0.013736 0 0.000000
jln 256 0.140659 110806504 0.181204
kln 25 0.013736 66477364 0.108712
kt 25 0.013736 6432056 0.010518
ma 608 0.334066 270531074 0.442397
md 243 0.133516 49291395 0.080607
mms 96 0.052747 35128287 0.057446
mto 25 0.013736 0 0.000000
ndj 25 0.013736 666427 0.001090
phn 365 0.200549 42044352 0.068756
pk 25 0.013736 2 0.000000
rw 25 0.013736 29130316 0.047637
sss 25 0.013736 0 0.000000
students 25 0.013736 264480 0.000433
ts 25 0.013736 262027 0.000428
```

- `RawShares`: the amount of "shares" assigned to each account (in our setup simply the number of CPUs each account has contributed with)
- `NormShares`: the fraction of shares given to each account normalized to the total shares available across all accounts, e.g. a value of 0.33 means an account has been assigned 33% of all the resources available in the cluster.
- `RawUsage`: usage of all jobs charged to the account or user. The value will decay over time depending on the usage decay half-life configured. The `RawUsage` for an account is the sum of the `RawUsage` for each user within the account, thus indicative of which users have contributed the most to the account’s overall score.
- `EffectvUsage`: `RawUsage` divided by the **total** `RawUsage` for the cluster, hence the column always sums to `1.0`. `EffectvUsage` is therefore the percentage of the total cluster usage the account has actually used. In the example above, the `ma` account has used `44.23%` of the cluster since the last usage reset.
- `FairShare`: The fair-share score calculated using the following formula `FS = 2^(-EffectvUsage/NormShares)`. The `FairShare` score can be interpreted by the following intervals:
- 1.0: **Unused**. The account has not run any jobs recently.
- 0.5 - 1.0: **Underutilization**. The account is underutilizing their granted share. For example a value of 0.75 means the account has underutilized their share 1:2
- 0.5: **Average utilization**. The account on average is using exactly as much as their granted share.
- 0.0 - 0.5: **Over-utilization**. The account is overusing their granted share. For example a value of 0.75 means the account has recently overutilized their share 2:1
- 0: No share left. The account is vastly overusing their granted share and users will get the lowest possible priority.

???+ tip "The fair-share factor and CPU efficiency"
The value of the fair-share factor is calculated based on CPU usage in units of **allocation seconds** and **not** CPU seconds, which is normally the unit used for CPU usage reported by the `sreport` and `sacct` commands. Therefore, this also means that the CPU efficiency of past jobs directly impacts how much actual work can be done by the allocated CPUs for each user within each account before their fair share of resources is consumed for the period.

For more details about job prioritization see the [SLURM documentation](https://slurm.schedmd.com/archive/slurm-23.02.6/priority_multifactor.html) and this [presentation](https://slurm.schedmd.com/SLUG19/Priority_and_Fair_Trees.pdf).

## QOS info and limitations
See all available QOS and their limitations:
```
$ sacctmgr show qos format=name,priority,grptres,mintres,maxtres,maxtrespu,maxjobspu,maxtrespa,maxjobspa
Name Priority GrpTRES MinTRES MaxTRES MaxTRESPU MaxJobsPU MaxTRESPA MaxJobsPA
---------- ---------- ------------- ------------- ------------- ------------- --------- ------------- ---------
normal 1 cpu=1,mem=512M cpu=192
highprio 10 cpu=1,mem=512M cpu=512
$ sacctmgr show qos format="name,priority,usagefactor,mintres%20,maxtrespu,maxjobspu"
Name Priority UsageFactor MinTRES MaxTRESPU MaxJobsPU
---------- ---------- ----------- -------------------- ------------- ---------
normal 1 1.000000 cpu=1,mem=512M cpu=192 500
highprio 10 2.000000 cpu=1,mem=512M cpu=512 2000
```

See all account associations for your user and the QOS's you are allowed to use:
```
$ sacctmgr list association user=$USER format=account%10s,user%20s,qos%20s
Account User QOS
---------- -------------------- --------------------
root ksa@bio.aau.dk highprio,normal
root abc@bio.aau.dk highprio,normal
```

### Undergraduate students
Expand Down
29 changes: 27 additions & 2 deletions docs/slurm/jobcontrol.md
Original file line number Diff line number Diff line change
@@ -1,12 +1,37 @@
# Job control and useful commands
Below are some nice to know commands for controlling and checking up on jobs, current and past.

## Overall cluster status
This will normally show some colored bars for each partition, which unfortunately doesn't render here.
```
Cluster allocation summary per partition or individual nodes (-n).
(Numbers are reported in free/allocated/total(OS factor)).
Partition | CPUs | Memory (GB) | GPUs |
========================================================================================================
shared | 1436 196 /1632 (3x) | 2091 268 /2359 |
general | 395 373 /768 | 2970 731 /3701 |
high-mem | 233 199 /432 | 1803 1936 /3739 |
gpu | 24 40 /64 | 29 180 /209 | 1 1 /2
--------------------------------------------------------------------------------------------------------
Total: | 2088 808 /2896 | 6894 3115 /10009 | 1 1 /2
Jobs running/pending/total:
26 / 1 / 27
Use sinfo or squeue to obtain more details.
```

## Get job status info
Use [`squeue`](https://slurm.schedmd.com/archive/slurm-23.02.6/squeue.html), for example:
```
$ squeue
JOBID NAME USER TIME TIME_LEFT CPU MIN_ME ST PARTITION NODELIST(REASON)
2380 dRep ab12cd@bio 1-01:36:22 12-22:23:38 80 300G R general bio-oscloud02
squeue
JOBID NAME USER ACCOUNT TIME TIME_LEFT CPU MIN_ME ST PRIO PARTITION NODELIST(REASON)
1275175 RStudioServer user01@bio acc1 0:00 3-00:00:00 32 5G PD 4 general (QOSMaxCpuPerUserLimit)
1275180 sshdbridge user02@bio acc2 7:14 7:52:46 8 40G R 6 general bio-oscloud03
1275170 VirtualDesktop user03@bio acc2 35:54 5:24:06 2 10G R 6 general bio-oscloud05
```

To show only your own jobs use `squeue --me`. This is used quite often so `sq` has been made an alias of `squeue --me`. You can for example also append `--partition`, `--nodelist`, `--reservation`, and more to only show the queue for those select partitions, nodes, or reservation.
Expand Down
14 changes: 7 additions & 7 deletions docs/slurm/partitions.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,28 +14,28 @@ Before submitting a job you must carefully choose the correct hardware partition
!!! warning "Always detect available CPUs dynamically in scripts and workflows, never hard-code it!"
It's very important to note that all partitions have a **max memory per CPU** configured, which may result in the scheduler allocating more CPU's for the job than requested until this ratio is satisfied. This is to ensure that no CPU's end up idle when a compute node is fully allocated in terms of memory, when it could have been at work finishing jobs faster instead. Therefore **NEVER** hardcode the number of threads/cores to use for individual software tools, instead make it a habit to detect it dynamically using fx `$(nproc)` in scripts and workflows.

## The `shared` partition
The `shared` partition is the shared partition and is overprovisoned with a factor of 3, meaning each CPU can run up to 3 jobs at once. This is ideal for low to medium (<75%) efficiency, low to medium memory, and interactive jobs, as well as I/O jobs, that don't fully utilize the allocated CPU's 100% at all times. It's **highly recommended** to use this partition for most CPU intensive jobs unless you need lots of memory, or have taken extra care [optimizing the CPU efficiency](efficiency.md) of the job(s), in which case you can use the `general` or `high-mem` partitions instead and finish the jobs faster. Due to over-subscription it's actually possible to allocate a total of 1632 CPUs for jobs on this partition, however with less memory. The job time limit for this partition is 7 days.
## Partitions
### The `shared` partition
The `shared` partition is the shared partition and is overprovisoned with a factor of 3, meaning each CPU can run up to 3 jobs at once. This is ideal for low to medium (<75%) efficiency, low to medium memory, and interactive jobs, as well as I/O jobs, that don't fully utilize the allocated CPU's 100% at all times. It's **highly recommended** to use this partition for most CPU intensive jobs unless you need lots of memory, or have taken extra care [optimizing the CPU efficiency](efficiency.md) of the job(s), in which case you can use the `general` or `high-mem` partitions instead and finish the jobs faster. Due to over-subscription it's actually possible to allocate a total of 1056 CPUs for jobs on this partition, however with less memory. The job time limit for this partition is 7 days.

**Max memory per CPU: 1.5GB**

| Hostname | vCPUs | Memory | Scratch space |
| :--- | :---: | :---: | :---: |
| `bio-oscloud01` | 96 | 0.5 TB | - |
| `bio-oscloud02` | 192 | 1 TB | - |
| `axomamma` | 256 | 1 TB | 3.5 TB |

## The `general` partition
### The `general` partition
The `general` partition is for high efficiency jobs only, since the CPU's are not shared among multiple jobs, but dedicated to each individual job. Therefore you must ensure that they are also fully utilized at all times, preferably 75-100%, otherwise please use the `shared` partition instead if the memory (per CPU) is sufficient. The job time limit for this partition is 14 days.

**Max memory per CPU: 5GB**

| Hostname | vCPUs | Memory | Scratch space |
| :--- | :---: | :---: | :---: |
| `bio-oscloud[03-05]` | 192 | 1 TB | - |
| `bio-oscloud[02-05]` | 192 | 1 TB | - |
| `bio-oscloud06` | 192 | 1 TB | 18TB |

## The `high-mem` partition
### The `high-mem` partition
The `high-mem` partition is only for high efficiency jobs that also require lots of memory. Please do not submit anything here that doesn't require at least 5GB per CPU, otherwise please use the `general` partition. Like the `general` partition, the CPU's are dedicated to each individual job. Therefore you must ensure that they are also fully utilized at all times, preferably 75-100%, otherwise please use the `shared` partition instead if the memory (per CPU) is sufficient. The job time limit for this partition is 28 days.

**Max memory per CPU: 10GB**
Expand All @@ -45,7 +45,7 @@ The `high-mem` partition is only for high efficiency jobs that also require lots
| `bio-oscloud07` | 240 | 2 TB | - |
| `bio-oscloud08` | 192 | 2 TB | - |

## The `gpu` partition
### The `gpu` partition
This partition is ONLY for jobs that require a GPU. Please do not submit jobs to this partition if you don't need a GPU. There is otherwise no max memory per CPU or over-provisioning configured for this partition. The job time limit for this partition is 14 days.

| Hostname | vCPUs | Memory | Scratch space | GPU |
Expand Down
2 changes: 1 addition & 1 deletion mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -14,8 +14,8 @@ nav:
- slurm/jobsubmission.md
- slurm/partitions.md
- slurm/efficiency.md
- slurm/jobcontrol.md
- slurm/accounting.md
- slurm/jobcontrol.md
- slurm/other.md
- Software:
- software/conda.md
Expand Down

0 comments on commit c7f6176

Please sign in to comment.