From 8f8e1cf06c1aa5b5bd72bddfc1b9539aac91c952 Mon Sep 17 00:00:00 2001
From: bghira <bghira@users.github.com>
Date: Tue, 3 Dec 2024 16:14:41 -0600
Subject: [PATCH] add gradient checkpointing option to docs

---
 OPTIONS.md                       | 9 +++++++++
 documentation/quickstart/FLUX.md | 5 +++++
 2 files changed, 14 insertions(+)

diff --git a/OPTIONS.md b/OPTIONS.md
index d0069deb..79c77603 100644
--- a/OPTIONS.md
+++ b/OPTIONS.md
@@ -40,6 +40,15 @@ The script `configure.py` in the project root can be used via `python configure.
 - **What**: Path to the pretrained T5 model or its identifier from https://huggingface.co/models.
 - **Why**: When training PixArt, you might want to use a specific source for your T5 weights so that you can avoid downloading them multiple times when switching the base model you train from.
 
+### `--gradient_checkpointing`
+
+- **What**: During training, gradients will be calculated layerwise and accumulated to save on peak VRAM requirements at the cost of slower training.
+
+### `--gradient_checkpointing_interval`
+
+- **What**: Checkpoint only every _n_ blocks, where _n_ is a value greater than zero. A value of 1 is effectively the same as just leaving `--gradient_checkpointing` enabled, and a value of 2 will checkpoint every other block.
+- **Note**: SDXL and Flux are currently the only models supporting this option. SDXL uses a hackish implementation.
+
 ### `--refiner_training`
 
 - **What**: Enables training a custom mixture-of-experts model series. See [Mixture-of-Experts](/documentation/MIXTURE_OF_EXPERTS.md) for more information on these options.
diff --git a/documentation/quickstart/FLUX.md b/documentation/quickstart/FLUX.md
index 6b2c60b1..ee5b98cc 100644
--- a/documentation/quickstart/FLUX.md
+++ b/documentation/quickstart/FLUX.md
@@ -144,6 +144,8 @@ There, you will possibly need to modify the following variables:
   - This option causes update steps to be accumulated over several steps. This will increase the training runtime linearly, such that a value of 2 will make your training run half as quickly, and take twice as long.
 - `optimizer` - Beginners are recommended to stick with adamw_bf16, though optimi-lion and optimi-stableadamw are also good choices.
 - `mixed_precision` - Beginners should keep this in `bf16`
+- `gradient_checkpointing` - set this to true in practically every situation on every device
+- `gradient_checkpointing_interval` - this could be set to a value of 2 or higher on larger GPUs to only checkpoint every _n_ blocks. A value of 2 would checkpoint half of the blocks, and 3 would be one-third.
 
 Multi-GPU users can reference [this document](/OPTIONS.md#environment-configuration-variables) for information on configuring the number of GPUs to use.
 
@@ -415,6 +417,9 @@ Currently, the lowest VRAM utilisation (9090M) can be attained with:
 - PyTorch: 2.6 Nightly (Sept 29th build)
 - Using `--quantize_via=cpu` to avoid outOfMemory error during startup on <=16G cards.
 - With `--attention_mechanism=sageattention` to further reduce VRAM by 0.1GB and improve training validation image generation speed.
+- Be sure to enable `--gradient_checkpointing` or nothing you do will stop it from OOMing
+
+**NOTE**: Pre-caching of VAE embeds and text encoder outputs may use more memory and still OOM. If so, text encoder quantisation and VAE tiling can be enabled.
 
 Speed was approximately 1.4 iterations per second on a 4090.