Skip to content

Commit

Permalink
gpulist uipdate
Browse files Browse the repository at this point in the history
small update to the GPUlist
  • Loading branch information
Morten-EN committed Mar 4, 2024
1 parent 1cdad3d commit 7fb5b95
Show file tree
Hide file tree
Showing 2 changed files with 22 additions and 21 deletions.
10 changes: 6 additions & 4 deletions docs/slurm-admin.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Slurm admin
# Slurm admin

Table of Contents
=================
Expand Down Expand Up @@ -42,7 +42,7 @@ After a crashed node got rebooted, Slurm will not trust it anymore, querying sta
$ sudo scontrol show node a00562
NodeName=a00562 Arch=x86_64 CoresPerSocket=10
...
State=DOWN
State=DOWN
...
Reason=Node unexpectedly rebooted

Expand All @@ -66,7 +66,7 @@ When the machines get rebooted, the slurm daemons will also come up automaticall

## Setup

How to build (strg+f rpm) https://slurm.schedmd.com/quickstart_admin.html
How to build (strg+f rpm) https://slurm.schedmd.com/quickstart_admin.html
Use the rpmbuild option!
Install mariadb-devel so we can build slurmdbd with database accounting
On a compute node we need
Expand All @@ -85,6 +85,8 @@ There are scripts for installing and upgrading slurm in the github repo.
2. Set paths in `install_slurm.gpu.sh`
3. `./install.slurm.gpu.sh <server-name>`

this have changed TODO get fixed so it is updated


## Install slurm on cpu node
TODO: make necesary adjustments to `install.slurm.gpu.sh` (maybe just get rid of gres part)
Expand All @@ -107,7 +109,7 @@ Build the new slurm packages. There are upgrade scripts in the githun repo.
2. `./upgrade_slurm.sh`

## Usage stats
Use `sreport` and `sacct`.
Use `sreport` and `sacct`.

### Calculate median wait time
Use `sacct` to get stats for 2020
Expand Down
33 changes: 16 additions & 17 deletions docs/slurm-cluster.md
Original file line number Diff line number Diff line change
Expand Up @@ -61,8 +61,8 @@ in `C:/Users/YOUR_WINDOWS_USER/.ssh/config` (Windows, a simple text file with no
With this in place, you can open a terminal (cmd or PowerShell in Windows) and run

ssh hendrix
This will connect you to a (random) gateway server. Gateway servers are small, relatively weak virtual machines and each time you login, you can be connected to a different server. As a normal use, you are not able to connect to the compute servers directly. Gateway servers allow you to compile programs or run small evaluation scripts, but anything that requires real compute power must be run on the compute servers via slurm.

This will connect you to a (random) gateway server. Gateway servers are small, relatively weak virtual machines and each time you login, you can be connected to a different server. As a normal use, you are not able to connect to the compute servers directly. Gateway servers allow you to compile programs or run small evaluation scripts, but anything that requires real compute power must be run on the compute servers via slurm.

## General Information

Expand All @@ -73,13 +73,12 @@ The cluster currently hosts one main partition with the following GPU cards(TODO

| Resource-Name | Model | Count | Memory(GB) |
|-----------------|-----------------------------|-------|----------- |
| A100 | Nvidia A100 | 14 | 40 |
| A40 | Nvidia A40 | 10 | 40 |
| titanrtx | Titan RTX + Quadro RTX 6000 | 48 | ?? |
| titanx | Titan X/Xp/V | 24 | ?? |
| testlak40 | Tesla K40 | 2 | ?? |
| testlak20 | Tesla K20 | 1 | ?? |
| gtx1080 | GTX 1080 | 4 | ?? |
| H100 | Nvidia H100 | 4 | 80 |
| A100 | Nvidia A100 | 26 | 80/40 |
| A40 | Nvidia A40 | 14 | 40 |
| titanrtx | Titan RTX + Quadro RTX 6000 | 55 | ?? |
| titanx | Titan X/Xp/V | 15 | ?? |



### Software Modules
Expand All @@ -91,11 +90,11 @@ the module package. A package can be loaded via the command
module load python/3.9.9
python3 --version
# prints 3.9.9

The list of all available software modules can be seen via

module avail

The current list of modules includes modern compilers, python versions, anaconda, but also cuda and cudnn.
Modules need to be loaded every time you login to a server, therefore it makes sense to store the commands in your `~/.bashrc`

Expand All @@ -116,7 +115,7 @@ Note that you need to mount ERDA directories on the machines that the job is sub
if [ -f "$key" ]
then
mkdir -p ${mnt}
sshfs ${user}@io.erda.dk:${erdadir} ${mnt} -o reconnect,ServerAliveInterval=15,ServerAliveCountMax=3 -o IdentityFile=${key}
sshfs ${user}@io.erda.dk:${erdadir} ${mnt} -o reconnect,ServerAliveInterval=15,ServerAliveCountMax=3 -o IdentityFile=${key}
else
echo "'${key}' is not an ssh key"
fi
Expand Down Expand Up @@ -144,7 +143,7 @@ Once you are in the university network (cable or VPN, see [Getting Access](#gett

scp -r my_file1 my_file2 my_folder/ hendrix:~/Dir

Or, you can use any sftp client.
Or, you can use any sftp client.

### ssh-tunnelling-and-port-forwarding
(Todo: this is not updated for hendrix. likely some of the details won't work)
Expand Down Expand Up @@ -194,7 +193,7 @@ This can also be used to run interactive jupyter notebooks. We can launch an int
[I 12:27:30.597 NotebookApp] or http://127.0.0.1:15000/?token=d305ab86adaf9c96bf4e44611c2253a1c7da6ec9e61557c4
[I 12:27:30.597 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
[C 12:27:30.614 NotebookApp]

To access the notebook, open this file in a browser:
file:///home/xyz123/.local/share/jupyter/runtime/nbserver-5918-open.html
Or copy and paste one of these URLs:
Expand All @@ -204,7 +203,7 @@ This can also be used to run interactive jupyter notebooks. We can launch an int
The jupyter server is now running. To connect to it using the browser on your local machine you need use local port forwarding and connect to the correct compute node (e.g. gpu02-diku-image in our example):

localuser@localmachine> ssh -N -L 15000:127.0.0.1:15000 gpu02-diku-image
[email protected]'s password:
[email protected]'s password:

While this connection persists in the background we can access the jupyter server using the URL from above:

Expand All @@ -225,7 +224,7 @@ Remember to shut down the jupyter server once you are done and exit your login s
[I 12:44:25.233 NotebookApp] Shutting down 0 terminals
(my_tf_env) [xyz123@gpu02-diku-image ~]$ exit
exit
[xyz123@a00552 ~]$
[xyz123@a00552 ~]$

A few words of caution:

Expand Down Expand Up @@ -302,7 +301,7 @@ In the script the number of cores is restricted to 4 for each task in the array,
#SBATCH --cpus-per-task=4
# max run time is 24 hours
#SBATCH --time= 24:00:00

python experiment.py ${SLURM_ARRAY_TASK_ID}


Expand Down

0 comments on commit 7fb5b95

Please sign in to comment.