From fd7add18e7e24170e1a1ed42e4380fef6e344dad Mon Sep 17 00:00:00 2001 From: sellth Date: Fri, 15 Mar 2024 17:05:30 +0100 Subject: [PATCH] remove more outdated SLURM access restrictions (#124) --- bih-cluster/docs/how-to/connect/gpu-nodes.md | 28 +++++++------------ .../docs/how-to/connect/high-memory.md | 15 ++++++---- .../docs/how-to/software/tensorflow.md | 2 +- 3 files changed, 20 insertions(+), 25 deletions(-) diff --git a/bih-cluster/docs/how-to/connect/gpu-nodes.md b/bih-cluster/docs/how-to/connect/gpu-nodes.md index 4b0466e72..a105a51b9 100644 --- a/bih-cluster/docs/how-to/connect/gpu-nodes.md +++ b/bih-cluster/docs/how-to/connect/gpu-nodes.md @@ -6,34 +6,24 @@ Connecting to a node with GPUs is easy. You simply request a GPU using the `--gres=gpu:$CARD:COUNT` (for `CARD=tesla` or `CARD=a40`) argument to `srun` and `batch`. This will automatically place your job in the `gpu` partition (which is where the GPU nodes live) and allocate a number of `COUNT` GPUs to your job. -!!! note - - Recently, `--gres=gpu:tesla:COUNT` was often not able to allocate the right partion on it's own. - If scheduling a GPU fails, consider additionally indicating the GPU partion explicitely with `--partition gpu` (or `#SBATCH --partition gpu` in batch file). - -!!! hint +!!! info + Fair use rules apply. + As GPU nodes are a limited resource, excessive use by single users is prohibited and can lead to mitigating actions. + Be nice and cooperative with other users. + Tip: `getent passwd USER_NAME` will give you a user's contact details. - Make sure to read the FAQ entry "[I have problems connecting to the GPU node! What's wrong?](../../help/faq.md#i-have-problems-connecting-to-the-gpu-node-whats-wrong)". - -!!! important "Interactive Use of GPU Nodes is Discouraged" +!!! warning "Interactive Use of GPU Nodes is Discouraged" While interactive computation on the GPU nodes is convenient, it makes it very easy to forget a job after your computation is complete and let it run idle. While your job is allocated, it blocks the **allocated** GPUs and other users cannot use them although you might not be actually using them. Please prefer batch jobs for your GPU jobs over interactive jobs. - Further, interactive GPU jobs are currently limited to 24 hours. + Furthermore, interactive GPU jobs are currently limited to 24 hours. We will monitor the situation and adjust that limit to optimize GPU usage and usability. - -!!! important "Allocation of GPUs through Slurm is mandatory" - In other word: using GPUs from SSH sessions is prohibited. + Please also note that allocation of GPUs through Slurm is mandatory, in other words: Using GPUs via SSH sessions is prohibited. The scheduler is not aware of manually allocated GPUs and this interferes with other users' jobs. -## Prequisites - -You have to register with [hpc-helpdesk@bih-charite.de](mailto:hpc-helpdesk@bih-charite.de) for requesting access. -Afterwards, you can connect to the GPU nodes as shown below. - ## Preparation We will setup a miniconda installation with `pytorch` testing the GPU. @@ -96,6 +86,8 @@ True Recently, `--gres=gpu:tesla:COUNT` was often not able to allocate the right partion on it's own. If scheduling a GPU fails, consider additionally indicating the GPU partion explicitely with `--partition gpu` (or `#SBATCH --partition gpu` in batch file). + Also make sure to read the FAQ entry "[I have problems connecting to the GPU node! What's wrong?](../../help/faq.md#i-have-problems-connecting-to-the-gpu-node-whats-wrong)" if you encounter problems. + ## Bonus #1: Who is using the GPUs? Use `squeue` to find out about currently queued jobs (the `egrep` only keeps the header and entries in the `gpu` partition). diff --git a/bih-cluster/docs/how-to/connect/high-memory.md b/bih-cluster/docs/how-to/connect/high-memory.md index 81926c097..9fa3c3df8 100644 --- a/bih-cluster/docs/how-to/connect/high-memory.md +++ b/bih-cluster/docs/how-to/connect/high-memory.md @@ -1,11 +1,14 @@ # How-To: Connect to High-Memory Nodes -## Prequisites - -You have to register with [hpc-helpdesk@bih-charite.de](mailto:hpc-helpdesk@bih-charite.de) for requesting access. - -Afterwards, you can connect to the High-Memory using the `highmem` SLURM partition (see below). -Jobs allocating more than 200GB of RAM should be routed automatically to the `highmem` nodes. +The cluster has 4 high-memory nodes with 1.5 TB of RAM. +You can connect to these nodes using the `highmem` SLURM partition (see below). +Jobs allocating more than 200 GB of RAM are automatically routed to the `highmem` nodes. + +!!! info + Fair use rules apply. + As high-memory nodes are a limited resource, excessive use by single users is prohibited and can lead to mitigating actions. + Be nice and cooperative with other users. + Tip: `getent passwd USER_NAME` will give you a user's contact details. ## How-To diff --git a/bih-cluster/docs/how-to/software/tensorflow.md b/bih-cluster/docs/how-to/software/tensorflow.md index d2f94b73c..9aa9e6b9b 100644 --- a/bih-cluster/docs/how-to/software/tensorflow.md +++ b/bih-cluster/docs/how-to/software/tensorflow.md @@ -6,7 +6,7 @@ You can find the original TensorFlow installation instructions [here](https://ww This article describes how to set up TensorFlow with GPU support using Conda. This how-to assumes that you have just connected to a GPU node via `srun --mem=10g --partition=gpu --gres=gpu:tesla:1 --pty bash -i` (for Tesla V100 GPUs, for A400 GPUs use `--gres=gpu:a40:1`). Note that you will need to allocate "enough" memory, otherwise your python session will be `Killed` because of too little memory. -You should read the [How-To: Connect to GPU Nodes](../../how-to/connect/gpu-nodes/) tutorial on an explanation of how to do this and to learn how to register for GPU usage. +You should read the [How-To: Connect to GPU Nodes](../../how-to/connect/gpu-nodes/) tutorial on an explanation of how to do this. This tutorial assumes, that conda has been set up as described in [Software Management]((../../best-practice/software-installation-with-conda.md).