Skip to content

Commit

Permalink
typo
Browse files Browse the repository at this point in the history
  • Loading branch information
KasperSkytte committed Nov 12, 2024
1 parent 18da12a commit ed904bf
Showing 1 changed file with 3 additions and 3 deletions.
6 changes: 3 additions & 3 deletions docs/storage.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ All nodes are connected to a Ceph network storage cluster through a few differen
There is a max file size of 4TB on the Ceph network cluster, which cannot be increased, [details here](https://docs.ceph.com/en/latest/cephfs/administration/#maximum-file-sizes-and-performance).

## Local scratch space
For jobs requiring heavy I/O where large amounts of data or thousands of temporary files need to be written, it's important to avoid using the network storage (if possible) and instead write to a local harddrive instead, which both avoids overloading the network and the storage cluster itself, but is also much faster in some cases. When deciding whether you need local scratch space, both the size of the data as well as the number of files are important, however the ladder is by far the most important because it has the biggest impact on the performance of the storage cluster overall, [more info below](#avoid-small-files-at-all-costs).
For jobs requiring heavy I/O where large amounts of data or thousands of temporary files need to be written, it's important to avoid using the network storage (if possible) and instead write to a local harddrive instead, which both avoids overloading the network and the storage cluster itself, but is also much faster in some cases. When deciding whether you need local scratch space, both the size of the data as well as the number of files are important, however the ladder is by far the most important because it has the biggest impact on the performance of the storage cluster overall, [more info below](#avoid-numerous-small-files-at-all-costs).

Local scratch space is always mounted at the same moint point on all machines, which is under `/scratch`. It's not available on all compute nodes (refer to [hardware partitions](slurm/partitions.md)), however, so if you need it you must ensure that you submit jobs to specific compute nodes using the `--nodelist` option to the `srun`, `sbatch`, and `salloc` commands.

Expand All @@ -25,7 +25,7 @@ All data owned by your user anywhere under `/scratch` on a particular compute no
## Temporary space
Each job will have a separate and entirely private mount point for temporary data (default location is usually `/tmp` or `/var/tmp`), which will be automatically deleted after each job. On compute nodes with extra local scratch space it will automatically be mounted somewhere in there instead, so that there is a lot more space available for temporary files on those nodes (refer to [hardware partitions](slurm/partitions.md)).

If you would need more space for temporary data on compute nodes that have no extra local scratch space, or you need even more temporary space than there's available on local scratch space, it's possible to place it on the Ceph network storage as well. However, if you choose to do so, please see the [best practices](#avoid-small-files-at-all-costs) below. It can simply be done by for example setting the environment variable `TMPDIR` early in the batch script by adding a line, fx `export TMPDIR=${HOME}/tmp`. Ensure no conflicts can occur within the folder(s) if you run multiple jobs on multiple different compute nodes at once.
If you would need more space for temporary data on compute nodes that have no extra local scratch space, or you need even more temporary space than there's available on local scratch space, it's possible to place it on the Ceph network storage as well. However, if you choose to do so, please see the [best practices](#avoid-numerous-small-files-at-all-costs) below. It can simply be done by for example setting the environment variable `TMPDIR` early in the batch script by adding a line, fx `export TMPDIR=${HOME}/tmp`. Ensure no conflicts can occur within the folder(s) if you run multiple jobs on multiple different compute nodes at once.

## Storage policy
**Your data - your responsibility!**
Expand All @@ -38,7 +38,7 @@ Furthermore, if you are no longer employed or studying at AAU, your home folder
If you need to move large amounts of data (or numerous files at once regardless of size) it will happen in an instant regardless of the size if you make sure you are doing it on the same filesystem/mount, for example from `/projects/xxx` to `/projects/yyy`. However, if you need to move things across mount points, for example from `/home/xxx` to `/projects/yyy` please ask an administrator to do it for you, since moving between mount points will happen over the network and cause unneccessary traffic and load on the storage cluster, let alone be very slow. An administrator can move anything instantly anywhere on the storage cluster, but if you need to transfer external files there is no way around it, it will be limited by the network speed.

## Best practices with Ceph storage
### Avoid small files at all costs
### Avoid numerous small files at all costs
**Whereever possible always try to write fewer but larger files instead of many small ones**. The Ceph storage cluster uses a file metadata cache server that holds an entry for every single open file, so the cache memory usage is directly proportional to the number of open files (meaning being accessed in any way). If there are too many open files (across **ALL** connected clients/nodes at once) the metadata server will ask clients to release some pressure on the cache and if they fail to do so in time they will simply be [evicted](https://docs.ceph.com/en/reef/cephfs/eviction/) without notice (meaning the Ceph cluster will force an unmount). This will happen on any node that tips the iceberg and is thus not a result of the exact storage actions of individual clients but rather that of all nodes connected to the Ceph cluster across the whole university at any one time. See [Ceph best usage practices](https://docs.ceph.com/en/reef/cephfs/app-best-practices/) for additional details.

### Avoid using `ls -l`
Expand Down

0 comments on commit ed904bf

Please sign in to comment.