Skip to content

Commit

Permalink
Update readme + fix EBU_USER_PREFIX
Browse files Browse the repository at this point in the history
  • Loading branch information
Vmjkom committed Aug 7, 2024
1 parent 8e75151 commit df35d54
Show file tree
Hide file tree
Showing 3 changed files with 28 additions and 12 deletions.
30 changes: 23 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,26 +3,42 @@

# Lumi documentation

## Setup
All of the dependencies are readily available in the module PyTorch/2.2.2-rocm-5.6.1-python-3.10-singularity-20240404
The module is located in `/projappl/project_462000319/villekom/modules`, so to get access just
## Module
All of the dependencies are available in the module PyTorch/2.2.2-rocm-5.6.1-python-3.10-singularity-20240404
The easybuild config is from https://lumi-supercomputer.github.io/LUMI-EasyBuild-docs/p/PyTorch/PyTorch-2.2.2-rocm-5.6.1-python-3.10-singularity-20240617/
The missing dependencies from [`requirements.txt`](./requirements/lumi_requirements.txt) we're pip installed into the venv inside the module.
The module is located in `/projappl/project_462000353/Easybuild`, so to get access run:
```
module use /projappl/project_462000319/villekom/modules
module load PyTorch/2.2.2-rocm-5.6.1-python-3.10-singularity-20240404
module --force purge #Lumi module cannot be loaded when changing the EBU_USER_PREFIX enviroment variable
export EBU_USER_PREFIX=/projappl/project_462000353/Easybuild #Wont work without export
module load LUMI/23.09 #CrayEnv also works here
module load PyTorch/2.2.2-rocm-5.6.1-python-3.10-singularity-20240617
```

## Prepare data

To download and tokenize the enwiki8 dataset follow the steps in the original README under [`Datasets`](#datasets)
To download and prepare the enwiki8 dataset, with the default GPT2Tokenizer
```
module --force purge #This must be done before chaning EBU_USER_PREFIX
export EBU_USER_PREFIX=/projappl/project_462000353/Easybuild #Get access to the correct module
python prepare_data.py -d ./data
```

To see more examples see [`Datasets`](#datasets)

## Launching a job
Both of these steps are done for you in lumi.train.sh, so to start you can just `sbatch lumi_train.sh`
You can launch a run simply with ```sbatch lumi_train.sh```. Module loading is handled inside the `lumi_train.sh` script.
For debugging/testing it might be wiser to first get an salloc and iteratively run `lumi_train.sh` through that. There is an ease of use script for salloc [`interactive.sh`](./interactive.sh) (params) NNODES TIME JOB_NAME. For example, to get a 1 node allocation with 1 hour runtime you do this:
```
./interactive.sh 1 01:00:00 test-neox
```
and then:
```
./lumi_train.sh
```
Read through the rest of gpt-neox's official README to get a better idea configure your own jobs.
### NOTE

There was trouble using `deepy.py` with any of the supported launchers on Lumi, so instead of the `deepy.py` we use `run.py`

# End of Lumi spesific Readme
Expand Down
File renamed without changes.
10 changes: 5 additions & 5 deletions lumi_train.sh
Original file line number Diff line number Diff line change
@@ -1,12 +1,12 @@
#!/bin/bash
#SBATCH --job-name=test_neox
#SBATCH --nodes=1
#SBATCH --job-name=neox_1.8B_europa
#SBATCH --nodes=32
#SBATCH --ntasks-per-node=8
#SBATCH --cpus-per-task=7
#SBATCH --mem=480G
#SBATCH --exclusive
#SBATCH --partition=dev-g
#SBATCH --time=00:10:00
#SBATCH --partition=standard-g
#SBATCH --time=02-00:00:00
#SBATCH --gpus-per-node=8
#SBATCH --account=project_462000353
#SBATCH --output=logs/%x-%j.out
Expand All @@ -17,7 +17,7 @@ ln -f -s $SLURM_JOB_NAME-$SLURM_JOB_ID.out logs/latest.out
ln -f -s $SLURM_JOB_NAME-$SLURM_JOB_ID.err logs/latest.err

module --force purge
EBU_USER_PREFIX=/projappl/project_462000353/Easybuild
export EBU_USER_PREFIX=/projappl/project_462000353/Easybuild
module load LUMI/23.09
module load PyTorch/2.2.2-rocm-5.6.1-python-3.10-singularity-20240617

Expand Down

0 comments on commit df35d54

Please sign in to comment.