Skip to content

An implementation of model parallel autoregressive transformers on GPUs, based on the Megatron and DeepSpeed libraries

License

Notifications You must be signed in to change notification settings

Vmjkom/gpt-neox

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Lumi documentation

This is a fork spesific to the HPC Lumi.

UPDATES

  • 8.11.2024 - On the current version of the upstream repo there is an issue with LLAMA and in perticular the mlp layer, which leads to faulty layer weights. An issue has been made for this. For now the default branch was changed to "revert". This branch goes back to an earlier commit before llama mlp changes.

Module

All of the dependencies are available in the module PyTorch/2.2.2-rocm-5.6.1-python-3.10-singularity-20240404 The easybuild config is from https://lumi-supercomputer.github.io/LUMI-EasyBuild-docs/p/PyTorch/PyTorch-2.2.2-rocm-5.6.1-python-3.10-singularity-20240617/ The missing dependencies from requirements.txt we're pip installed into the venv inside the module. The module is located in /projappl/project_462000353/Easybuild, so to get access run:

module --force purge #Lumi module cannot be loaded when changing the EBU_USER_PREFIX enviroment variable
export EBU_USER_PREFIX=/projappl/project_462000353/Easybuild #Wont work without export
module load LUMI/23.09 #CrayEnv also works here
module load PyTorch/2.2.2-rocm-5.6.1-python-3.10-singularity-20240617

Prepare data

To download and prepare the enwiki8 dataset, with the default GPT2Tokenizer

module --force purge #This must be done before chaning EBU_USER_PREFIX
export EBU_USER_PREFIX=/projappl/project_462000353/Easybuild #Get access to the correct module
python prepare_data.py -d ./data 

To see more examples see Datasets

Launching a job

You can launch a run simply with sbatch lumi_train.sh. Module loading is handled inside the lumi_train.sh script. For debugging/testing it might be wiser to first get an salloc and iteratively run lumi_train.sh through that. There is an ease of use script for salloc interactive.sh (params) NNODES TIME JOB_NAME. For example, to get a 1 node allocation with 1 hour runtime you do this:

./interactive.sh 1 01:00:00 test-neox

and then:

./lumi_train.sh

Read through the rest of gpt-neox's official README to get a better idea configure your own jobs.

NOTE

There was trouble using deepy.py with any of the supported launchers on Lumi, so instead of the deepy.py we use run.py

About

An implementation of model parallel autoregressive transformers on GPUs, based on the Megatron and DeepSpeed libraries

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 81.3%
  • C++ 13.5%
  • Cuda 3.1%
  • Shell 1.6%
  • Dockerfile 0.4%
  • C 0.1%