gpulist uipdate

small update to the GPUlist
diku-dk · Mar 4, 2024 · 7fb5b95 · 7fb5b95
1 parent 1cdad3d
commit 7fb5b95
Show file tree

Hide file tree

Showing 2 changed files with 22 additions and 21 deletions.
diff --git a/docs/slurm-admin.md b/docs/slurm-admin.md
@@ -1,4 +1,4 @@
-# Slurm admin
+  # Slurm admin
 
 Table of Contents
 =================
@@ -42,7 +42,7 @@ After a crashed node got rebooted, Slurm will not trust it anymore, querying sta
     $ sudo scontrol show node a00562
     NodeName=a00562 Arch=x86_64 CoresPerSocket=10
        ...
-       State=DOWN 
+       State=DOWN
        ...
        Reason=Node unexpectedly rebooted
 
@@ -66,7 +66,7 @@ When the machines get rebooted, the slurm daemons will also come up automaticall
 
 ## Setup
 
-How to build (strg+f rpm) https://slurm.schedmd.com/quickstart_admin.html 
+How to build (strg+f rpm) https://slurm.schedmd.com/quickstart_admin.html
 Use the rpmbuild option!
 Install mariadb-devel so we can build slurmdbd with database accounting
 On a compute node we need
@@ -85,6 +85,8 @@ There are scripts for installing and upgrading slurm in the github repo.
 2. Set paths in `install_slurm.gpu.sh`
 3. `./install.slurm.gpu.sh <server-name>`
 
+this have changed TODO get fixed so it is updated
+
 
 ## Install slurm on cpu node
 TODO: make necesary adjustments to `install.slurm.gpu.sh` (maybe just get rid of gres part)
@@ -107,7 +109,7 @@ Build the new slurm packages. There are upgrade scripts in the githun repo.
 2. `./upgrade_slurm.sh`
 
 ## Usage stats
-Use `sreport` and `sacct`. 
+Use `sreport` and `sacct`.
 
 ### Calculate median wait time
 Use `sacct` to get stats for 2020

diff --git a/docs/slurm-cluster.md b/docs/slurm-cluster.md
@@ -61,8 +61,8 @@ in `C:/Users/YOUR_WINDOWS_USER/.ssh/config` (Windows, a simple text file with no
 With this in place, you can open a terminal (cmd or PowerShell in Windows) and run
 
     ssh hendrix
-    
-This will connect you to a (random) gateway server. Gateway servers are small, relatively weak virtual machines and each time you login, you can be connected to a different server. As a normal use, you are not able to connect to the compute servers directly. Gateway servers allow you to compile programs or run small evaluation scripts, but anything that requires real compute power must be run on the compute servers via slurm. 
+
+This will connect you to a (random) gateway server. Gateway servers are small, relatively weak virtual machines and each time you login, you can be connected to a different server. As a normal use, you are not able to connect to the compute servers directly. Gateway servers allow you to compile programs or run small evaluation scripts, but anything that requires real compute power must be run on the compute servers via slurm.
 
 ## General Information
 
@@ -73,13 +73,12 @@ The cluster currently hosts one main partition with the following GPU cards(TODO
 
 | Resource-Name   | Model                       | Count | Memory(GB) |
 |-----------------|-----------------------------|-------|----------- |
-| A100            | Nvidia A100                 |    14 | 40         |
-| A40             | Nvidia A40                  |    10 | 40         |
-| titanrtx        | Titan RTX + Quadro RTX 6000 |    48 | ??         |
-| titanx          | Titan X/Xp/V                |    24 | ??         |
-| testlak40       | Tesla K40                   |     2 | ??         |
-| testlak20       | Tesla K20                   |     1 | ??         |
-| gtx1080         | GTX 1080                    |     4 | ??         |
+| H100            | Nvidia H100                 |     4 | 80         |
+| A100            | Nvidia A100                 |    26 | 80/40      |
+| A40             | Nvidia A40                  |    14 | 40         |
+| titanrtx        | Titan RTX + Quadro RTX 6000 |    55 | ??         |
+| titanx          | Titan X/Xp/V                |    15 | ??         |
+
 
 
 ### Software Modules
@@ -91,11 +90,11 @@ the module package. A package can be loaded via the command
     module load python/3.9.9
     python3 --version
     # prints 3.9.9
-    
+
 The list of all available software modules can be seen via
 
     module avail
-    
+
 The current list of modules includes modern compilers, python versions, anaconda, but also cuda and cudnn.
 Modules need to be loaded every time you login to a server, therefore it makes sense to store the commands in your `~/.bashrc`
 
@@ -116,7 +115,7 @@ Note that you need to mount ERDA directories on the machines that the job is sub
     if [ -f "$key" ]
     then
         mkdir -p ${mnt}
-        sshfs ${user}@io.erda.dk:${erdadir} ${mnt} -o reconnect,ServerAliveInterval=15,ServerAliveCountMax=3 -o IdentityFile=${key} 
+        sshfs ${user}@io.erda.dk:${erdadir} ${mnt} -o reconnect,ServerAliveInterval=15,ServerAliveCountMax=3 -o IdentityFile=${key}
     else
         echo "'${key}' is not an ssh key"
     fi
@@ -144,7 +143,7 @@ Once you are in the university network (cable or VPN, see [Getting Access](#gett
 
     scp -r my_file1 my_file2 my_folder/ hendrix:~/Dir
 
-Or, you can use any sftp client. 
+Or, you can use any sftp client.
 
 ### ssh-tunnelling-and-port-forwarding
 (Todo: this is not updated for hendrix. likely some of the details won't work)
@@ -194,7 +193,7 @@ This can also be used to run interactive jupyter notebooks. We can launch an int
     [I 12:27:30.597 NotebookApp]  or http://127.0.0.1:15000/?token=d305ab86adaf9c96bf4e44611c2253a1c7da6ec9e61557c4
     [I 12:27:30.597 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
     [C 12:27:30.614 NotebookApp]
-    
+
         To access the notebook, open this file in a browser:
             file:///home/xyz123/.local/share/jupyter/runtime/nbserver-5918-open.html
         Or copy and paste one of these URLs:
@@ -204,7 +203,7 @@ This can also be used to run interactive jupyter notebooks. We can launch an int
 The jupyter server is now running. To connect to it using the browser on your local machine you need use local port forwarding and connect to the correct compute node (e.g. gpu02-diku-image in our example):
 
     localuser@localmachine> ssh -N -L 15000:127.0.0.1:15000 gpu02-diku-image
-    [email protected]'s password: 
+    [email protected]'s password:
 
 While this connection persists in the background we can access the jupyter server using the URL from above:
 
@@ -225,7 +224,7 @@ Remember to shut down the jupyter server once you are done and exit your login s
     [I 12:44:25.233 NotebookApp] Shutting down 0 terminals
     (my_tf_env) [xyz123@gpu02-diku-image ~]$ exit
     exit
-    [xyz123@a00552 ~]$ 
+    [xyz123@a00552 ~]$
 
 A few words of caution:
 
@@ -302,7 +301,7 @@ In the script the number of cores is restricted to 4 for each task in the array,
     #SBATCH --cpus-per-task=4
     # max run time is 24 hours
     #SBATCH --time= 24:00:00
-    
+
     python experiment.py ${SLURM_ARRAY_TASK_ID}