Lumi documentation

This is a fork spesific to the HPC Lumi.

UPDATES

8.11.2024 - On the current version of the upstream repo there is an issue with LLAMA and in perticular the mlp layer, which leads to faulty layer weights. An issue has been made for this. For now the default branch was changed to "revert". This branch goes back to an earlier commit before llama mlp changes.

Module

All of the dependencies are available in the module PyTorch/2.2.2-rocm-5.6.1-python-3.10-singularity-20240404 The easybuild config is from https://lumi-supercomputer.github.io/LUMI-EasyBuild-docs/p/PyTorch/PyTorch-2.2.2-rocm-5.6.1-python-3.10-singularity-20240617/ The missing dependencies from requirements.txt we're pip installed into the venv inside the module. The module is located in /projappl/project_462000353/Easybuild, so to get access run:

module --force purge #Lumi module cannot be loaded when changing the EBU_USER_PREFIX enviroment variable
export EBU_USER_PREFIX=/projappl/project_462000353/Easybuild #Wont work without export
module load LUMI/23.09 #CrayEnv also works here
module load PyTorch/2.2.2-rocm-5.6.1-python-3.10-singularity-20240617

Prepare data

To download and prepare the enwiki8 dataset, with the default GPT2Tokenizer

module --force purge #This must be done before chaning EBU_USER_PREFIX
export EBU_USER_PREFIX=/projappl/project_462000353/Easybuild #Get access to the correct module
python prepare_data.py -d ./data

To see more examples see Datasets

Launching a job

You can launch a run simply with sbatch lumi_train.sh. Module loading is handled inside the lumi_train.sh script. For debugging/testing it might be wiser to first get an salloc and iteratively run lumi_train.sh through that. There is an ease of use script for salloc interactive.sh (params) NNODES TIME JOB_NAME. For example, to get a 1 node allocation with 1 hour runtime you do this:

./interactive.sh 1 01:00:00 test-neox

and then:

./lumi_train.sh

Read through the rest of gpt-neox's official README to get a better idea configure your own jobs.

NOTE

There was trouble using deepy.py with any of the supported launchers on Lumi, so instead of the deepy.py we use run.py

Name		Name	Last commit message	Last commit date
Latest commit History 2,283 Commits
.github		.github
configs		configs
deprecated		deprecated
eval_tasks		eval_tasks
images		images
megatron		megatron
requirements		requirements
slurm_scripts		slurm_scripts
tests		tests
tools		tools
.clang-format		.clang-format
.dockerignore		.dockerignore
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CITATION.cff		CITATION.cff
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README-MUP.md		README-MUP.md
README.md		README.md
clean.sh		clean.sh
deepy.py		deepy.py
docker-compose-dockerhub.yml		docker-compose-dockerhub.yml
docker-compose.yml		docker-compose.yml
eval.py		eval.py
generate.py		generate.py
parse_logs.sh		parse_logs.sh
prepare_data.py		prepare_data.py
run.py		run.py
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Lumi documentation

UPDATES

Module

Prepare data

Launching a job

NOTE

About

Releases

Packages

Languages

License

Vmjkom/gpt-neox

Folders and files

Latest commit

History

Repository files navigation

Lumi documentation

UPDATES

Module

Prepare data

Launching a job

NOTE

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages