Skip to content

Commit

Permalink
remove more outdated SLURM access restrictions (#124)
Browse files Browse the repository at this point in the history
  • Loading branch information
sellth authored Mar 15, 2024
1 parent bfbdf0f commit fd7add1
Show file tree
Hide file tree
Showing 3 changed files with 20 additions and 25 deletions.
28 changes: 10 additions & 18 deletions bih-cluster/docs/how-to/connect/gpu-nodes.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,34 +6,24 @@ Connecting to a node with GPUs is easy.
You simply request a GPU using the `--gres=gpu:$CARD:COUNT` (for `CARD=tesla` or `CARD=a40`) argument to `srun` and `batch`.
This will automatically place your job in the `gpu` partition (which is where the GPU nodes live) and allocate a number of `COUNT` GPUs to your job.

!!! note

Recently, `--gres=gpu:tesla:COUNT` was often not able to allocate the right partion on it's own.
If scheduling a GPU fails, consider additionally indicating the GPU partion explicitely with `--partition gpu` (or `#SBATCH --partition gpu` in batch file).

!!! hint
!!! info
Fair use rules apply.
As GPU nodes are a limited resource, excessive use by single users is prohibited and can lead to mitigating actions.
Be nice and cooperative with other users.
Tip: `getent passwd USER_NAME` will give you a user's contact details.

Make sure to read the FAQ entry "[I have problems connecting to the GPU node! What's wrong?](../../help/faq.md#i-have-problems-connecting-to-the-gpu-node-whats-wrong)".

!!! important "Interactive Use of GPU Nodes is Discouraged"
!!! warning "Interactive Use of GPU Nodes is Discouraged"

While interactive computation on the GPU nodes is convenient, it makes it very easy to forget a job after your computation is complete and let it run idle.
While your job is allocated, it blocks the **allocated** GPUs and other users cannot use them although you might not be actually using them.
Please prefer batch jobs for your GPU jobs over interactive jobs.

Further, interactive GPU jobs are currently limited to 24 hours.
Furthermore, interactive GPU jobs are currently limited to 24 hours.
We will monitor the situation and adjust that limit to optimize GPU usage and usability.

!!! important "Allocation of GPUs through Slurm is mandatory"

In other word: using GPUs from SSH sessions is prohibited.
Please also note that allocation of GPUs through Slurm is mandatory, in other words: Using GPUs via SSH sessions is prohibited.
The scheduler is not aware of manually allocated GPUs and this interferes with other users' jobs.

## Prequisites

You have to register with [[email protected]](mailto:[email protected]) for requesting access.
Afterwards, you can connect to the GPU nodes as shown below.

## Preparation

We will setup a miniconda installation with `pytorch` testing the GPU.
Expand Down Expand Up @@ -96,6 +86,8 @@ True
Recently, `--gres=gpu:tesla:COUNT` was often not able to allocate the right partion on it's own.
If scheduling a GPU fails, consider additionally indicating the GPU partion explicitely with `--partition gpu` (or `#SBATCH --partition gpu` in batch file).

Also make sure to read the FAQ entry "[I have problems connecting to the GPU node! What's wrong?](../../help/faq.md#i-have-problems-connecting-to-the-gpu-node-whats-wrong)" if you encounter problems.

## Bonus #1: Who is using the GPUs?

Use `squeue` to find out about currently queued jobs (the `egrep` only keeps the header and entries in the `gpu` partition).
Expand Down
15 changes: 9 additions & 6 deletions bih-cluster/docs/how-to/connect/high-memory.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,14 @@
# How-To: Connect to High-Memory Nodes

## Prequisites

You have to register with [[email protected]](mailto:[email protected]) for requesting access.

Afterwards, you can connect to the High-Memory using the `highmem` SLURM partition (see below).
Jobs allocating more than 200GB of RAM should be routed automatically to the `highmem` nodes.
The cluster has 4 high-memory nodes with 1.5 TB of RAM.
You can connect to these nodes using the `highmem` SLURM partition (see below).
Jobs allocating more than 200 GB of RAM are automatically routed to the `highmem` nodes.

!!! info
Fair use rules apply.
As high-memory nodes are a limited resource, excessive use by single users is prohibited and can lead to mitigating actions.
Be nice and cooperative with other users.
Tip: `getent passwd USER_NAME` will give you a user's contact details.

## How-To

Expand Down
2 changes: 1 addition & 1 deletion bih-cluster/docs/how-to/software/tensorflow.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ You can find the original TensorFlow installation instructions [here](https://ww
This article describes how to set up TensorFlow with GPU support using Conda.
This how-to assumes that you have just connected to a GPU node via `srun --mem=10g --partition=gpu --gres=gpu:tesla:1 --pty bash -i` (for Tesla V100 GPUs, for A400 GPUs use `--gres=gpu:a40:1`).
Note that you will need to allocate "enough" memory, otherwise your python session will be `Killed` because of too little memory.
You should read the [How-To: Connect to GPU Nodes](../../how-to/connect/gpu-nodes/) tutorial on an explanation of how to do this and to learn how to register for GPU usage.
You should read the [How-To: Connect to GPU Nodes](../../how-to/connect/gpu-nodes/) tutorial on an explanation of how to do this.

This tutorial assumes, that conda has been set up as described in [Software Management]((../../best-practice/software-installation-with-conda.md).

Expand Down

0 comments on commit fd7add1

Please sign in to comment.