Dealing with out-of-memory (OOM) errors

If you got this error while running a script

OutOfMemoryError: CUDA out of memory. Tried to allocate 2.22 GiB. GPU 0 has a total capacity of 79.15 GiB of which 228.38 MiB is free. Including non-PyTorch memory, this process
has 78.93 GiB memory in use. Of the allocated memory 76.28 GiB is allocated by PyTorch, and 2.14 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory
is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

it means that your GPU memory size wasn't big enough for the model and script configuration.

Here's a few things you can try:

Reduce the micro batch size

Adjust the --train.micro_batch_size argument in the fine-tuning and pretraining scripts. This variable determines the number of samples loaded per iteration.

A smaller value will simply load fewer samples simultaneously. The minimum value is 1.

Experiment with different micro batch sizes to find a balance between memory consumption and computational efficiency. Smaller micro batch sizes consume less memory but may result in slower training convergence. Conversely, larger micro batch sizes require more memory but can accelerate training speed.

Reduce the model's context length

The context length (block_size in the code) plays a significant role in running models with attention.

The pretraining scripts are configured to use the full context length of the model to train.
The finetuning scripts are configured to use the longest sample length of the training data to avoid allocating unnecessary memory (--train.max_seq_length argument). If that's longer than the model's context length, an error is raised. If you try to run a batch that is longer than this, an error is raised.

However, your hardware may not support such large context lengths. Here's what you can do:

For the pretraining scripts, you can simply reduce the Config(block_size=...) value.
For the finetuning scripts, you can trim the length of the samples in your dataset. Most of the scripts/prepare_*.py scripts expose a --max_seq_length=... argument. This might also be useful in cases where sample lengths are highly unbalanced, as the presence of a single very long sample would incur a larger memory usage for all other shorter samples. For example, the median length of the samples in Alpaca is 110 tokens. Truncating the Alpaca dataset to 256 max tokens reduces the memory requirements of a Falcon 7B model from 23.52 GB to 15.73 GB. For more information about the dataset truncation, please see the Truncating datasets section in the prepare_datasets.md tutorial.

Keep in mind that reducing the context length will affect the modelling performance on text sequences longer than the limit.

Use lower precision

Our scripts expose the --precision argument, this directly impacts the memory usage.

Using true lower precision (16-true, bf16-true) reduces the memory usage by half compared to 32-true, however, the model might start producing NaNs due to the limited range of representable values.

Mixed precision training (16-mixed, bf16-mixed) provides better stability but offers limited memory reduction.

Do sharding across multiple GPUs

For exceptionally large models, the aforementioned techniques might still not suffice. If you have multiple GPUs available, you can trade off memory for speed by changing the --devices 1 argument in the scripts. Enabling this option enables a parallelism technique (FSDP), sharding the memory across different GPUs.

The default configuration already uses activation checkpointing, but you can enable CPU offloading by changing the cpu_offload=False argument in the scripts.

Try a different optimizer

Our scripts use the AdamW optimizer. It maintains 2 states for each trainable parameter of the model, meaning that the optimizer memory is double compared to an optimizer like SGD.

You can try replacing it with your optimizer of choice that is lighter in memory requirements. Keep in mind that different optimizers have distinct optimization behaviors, so it's essential to assess their impact on the training process and model performance. An example would be the recently published Sophia or Lion optimizers.

This suggestion is particularly relevant for pretraining, as the trainable parameters in the model represent a small subset of the total in the fine-tuning scripts.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

oom.md

oom.md

Dealing with out-of-memory (OOM) errors

Reduce the micro batch size

Reduce the model's context length

Use lower precision

Do sharding across multiple GPUs

Try a different optimizer

Files

oom.md

Latest commit

History

oom.md

File metadata and controls

Dealing with out-of-memory (OOM) errors

Reduce the micro batch size

Reduce the model's context length

Use lower precision

Do sharding across multiple GPUs

Try a different optimizer