Skip to content

Commit

Permalink
Wrap long command line arguments in redpajama docs (#655)
Browse files Browse the repository at this point in the history
  • Loading branch information
rasbt authored Oct 19, 2023
1 parent b1e4bac commit cf91ad2
Showing 1 changed file with 20 additions and 6 deletions.
26 changes: 20 additions & 6 deletions tutorials/pretrain_redpajama.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,8 @@ git clone https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T dat

```bash
# The 1 billion token subset
git clone https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T-Sample data/RedPajama-Data-1T-Sample
git clone https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T-Sample \
data/RedPajama-Data-1T-Sample
```

## Prepare RedPajama for training
Expand All @@ -52,19 +53,28 @@ streaming dataset that comes with lit-gpt. You will need to have the tokenizer c
```bash
pip install huggingface_hub sentencepiece

python scripts/download.py --repo_id meta-llama/Llama-2-7b-chat-hf --access_token your_hf_token
python scripts/download.py \
--repo_id meta-llama/Llama-2-7b-chat-hf \
--access_token your_hf_token
```

Then, run

```bash
python scripts/prepare_redpajama.py --source_path data/RedPajama-Data-1T --checkpoint_dir checkpoints/meta-llama/Llama-2-7b-hf/ --destination_path data/lit-redpajama
python scripts/prepare_redpajama.py \
--source_path data/RedPajama-Data-1T \
--checkpoint_dir checkpoints/meta-llama/Llama-2-7b-hf/ \
--destination_path data/lit-redpajama
```

or

```bash
python scripts/prepare_redpajama.py --source_path data/RedPajama-Data-1T-Sample --checkpoint_dir checkpoints/meta-llama/Llama-2-7b-hf/ --destination_path data/lit-redpajama-sample --sample True
python scripts/prepare_redpajama.py \
--source_path data/RedPajama-Data-1T-Sample \
--checkpoint_dir checkpoints/meta-llama/Llama-2-7b-hf/ \
--destination_path data/lit-redpajama-sample \
--sample True
```

for the sample dataset.
Expand All @@ -78,13 +88,17 @@ The script will take a while to run, so time for :tea: (The 1B sample script tak
Running the pretraining script with its default settings requires at least 4 GPUs with 40GB+ each (A100).

```bash
python pretrain/redpajama.py --devices 4 --train_data_dir data/lit-redpajama
python pretrain/redpajama.py \
--devices 4 \
--train_data_dir data/lit-redpajama
```

For running on the sample dataset:

```bash
python pretrain/redpajama.py --devices 4 --train_data_dir data/lit-redpajama-sample
python pretrain/redpajama.py \
--devices 4 \
--train_data_dir data/lit-redpajama-sample
```

The script will save checkpoints periodically to the folder `out/`.
Expand Down

0 comments on commit cf91ad2

Please sign in to comment.