Small-scale training code for LLaMa using open-source inference code (LLaMa Inference), using the open-source Pile dataset (Pile data).
- Implement basic training code for LLaMA on a small scale
- Use a small subset of the Pile dataset to train small LLaMA models to perform ablation studies on different model types
torch
sentencepiece
torchvision
tqdm
xformers
- A copy of the tokenizer model from Meta to tokenize data (see below)
- A small subset of a Pile train subset (see below)
- Pile validaton set (see below)
Note: the following instructions are for using LLaMA-Train on a computer with a GPU. To use the Google Colab notebook supplied under notebooks
, which provides the same functionality, see the acompanying document.
First, install the requirements with pip install -r requirements.txt
.
To train and evaluate the model, download data from Pile. We used a subset of Pile subset 07
to train our models, and the first 10k
sequences from val
to evaluate each model. These files will need to be decompressed to JSONL files and stored under the root directory in a folder titled data
, and rename the train file to train.jsonl
and the validation file to val.jsonl
.
Next, we will also need a copy of the tokenizer model from Meta to tokenize data, which can be requested via the Google Form. Store the tokenizer.model
file under data
as well.
Finally, setup the local structure by adding the root directory of the project to your Python path. For example, the following command does this on Bash: export PYTHONPATH="$PWD"
For training, evaluation, and generation, run the corresponding script in the main/scripts
directory. Each script is configured to parse command line arguments and is set to show a detailed explanation of each parameter when provided with the --help
argument.
main/llama
contains the model, tokenizer and model generation code, which is based on LLaMa Inference, heavily modified to fit the goals of this projectmain/util
contains data loading and processing, metric computation (loss calculation), and checkpointing codemain/scripts
contains scripts to run training, evaluation, and inference for various model parametersnotebooks
contains notebooks used to train models on Google Colab, and uses the same code as available insrc
This project utilizes large portions of the LLaMa Inference code. See the License file.