Could somebody share a sdxl-env.sh that works with a 24gb gpu (3090/4090)? I keep getting cuda OOM #219

minienglish1 · 2023-11-19T12:48:17Z

minienglish1
Nov 19, 2023

Thanks in advance. Been following this repo's development on SDXL for a while now, excited to have time to sit down and test it out.

If someone has successfully finetuned on a 3090/4090, please share your config so I can have a starting point to work with. Just trying to get it running.

I've followed all the instructions I can find here on the github, configured accelerate/deepspeed, edited sdxl-env.sh. And yet I get cuda OOM errors on 3090/4090.

Accelerate config according to example on: DEEPSPEED.MD
Also tried deepspeed level 2/3, also OOM.

In sdxl-env.sh, kept everything same except:
added dataset/output directories
DEBUG_EXTRA_ARGS empty to disable wandb
TRAIN_BATCH_SIZE=1
TRAINER_EXTRA_ARGS= removed "--use_ema" added, also tried "--use_adafactor_optimizer" instead of "--use_8bit_adam"

Training machine is Ubuntu 22.04, 2x 3090, 1x 4090, 64gb ram.
Any other info to assist would be appreciated.

End goal is to use deepspeed to split the SDXL model across multiple gpus (3090/4090s). Any info towards that goal would be appreciated.

Error output:
[2023-11-19 20:31:13,927] [INFO] [logging.py:96:log_dist] [Rank 0] Using DeepSpeed Optimizer param name adamw as basic optimizer
[2023-11-19 20:31:14,297] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Basic Optimizer = FusedAdam
[2023-11-19 20:31:14,297] [INFO] [utils.py:54:is_zero_supported_optimizer] Checking ZeRO support for optimizer=FusedAdam type=<class 'deepspeed.ops.adam.fused_adam.FusedAdam'>
[2023-11-19 20:31:14,297] [INFO] [logging.py:96:log_dist] [Rank 0] Creating torch.bfloat16 ZeRO stage 1 optimizer
[2023-11-19 20:31:14,297] [INFO] [stage_1_and_2.py:146:init] Reduce bucket size 500,000,000
[2023-11-19 20:31:14,297] [INFO] [stage_1_and_2.py:147:init] Allgather bucket size 500,000,000
[2023-11-19 20:31:14,297] [INFO] [stage_1_and_2.py:148:init] CPU Offload: False
[2023-11-19 20:31:14,297] [INFO] [stage_1_and_2.py:149:init] Round robin gradient partitioning: False
Rank: 0 partition count [1] and sizes[(2567463684, False)]
[2023-11-19 20:31:17,259] [INFO] [utils.py:803:see_memory_usage] Before initializing optimizer states
[2023-11-19 20:31:17,259] [INFO] [utils.py:804:see_memory_usage] MA 14.35 GB Max_MA 19.13 GB CA 19.14 GB Max_CA 19 GB
[2023-11-19 20:31:17,259] [INFO] [utils.py:811:see_memory_usage] CPU Virtual Memory: used = 15.41 GB, percent = 24.6%
Traceback (most recent call last):
File "/mnt/storage/installs/SimpleTuner/train_sdxl.py", line 1434, in
main()
File "/mnt/storage/installs/SimpleTuner/train_sdxl.py", line 791, in main
unet, train_dataloader, lr_scheduler, optimizer = accelerator.prepare(
File "/mnt/storage/installs/SimpleTuner/.venv/lib/python3.10/site-packages/accelerate/accelerator.py", line 1198, in prepare
result = self._prepare_deepspeed(*args)
File "/mnt/storage/installs/SimpleTuner/.venv/lib/python3.10/site-packages/accelerate/accelerator.py", line 1537, in _prepare_deepspeed
engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)
File "/mnt/storage/installs/SimpleTuner/.venv/lib/python3.10/site-packages/deepspeed/init.py", line 171, in initialize
engine = DeepSpeedEngine(args=args,
File "/mnt/storage/installs/SimpleTuner/.venv/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 303, in init
self._configure_optimizer(optimizer, model_parameters)
File "/mnt/storage/installs/SimpleTuner/.venv/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1213, in _configure_optimizer
self.optimizer = self._configure_zero_optimizer(basic_optimizer)
File "/mnt/storage/installs/SimpleTuner/.venv/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1467, in _configure_zero_optimizer
optimizer = DeepSpeedZeroOptimizer(
File "/mnt/storage/installs/SimpleTuner/.venv/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 516, in init
self.initialize_optimizer_states()
File "/mnt/storage/installs/SimpleTuner/.venv/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 639, in initialize_optimizer_states
single_grad_partition = torch.zeros(int(self.partition_size[i]),
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 9.57 GiB. GPU 0 has a total capacty of 23.65 GiB of which 8.64 GiB is free. Including non-PyTorch memory, this process has 15.00 GiB memory in use. Of the allocated memory 14.35 GiB is allocated by PyTorch, and 7.60 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Answered by bghira

Apr 28, 2024

8bit adam actually wasn't the reason 24G was doable before, that was Adafactor

but bf16 support can be introduced to any one of the optimizers with the exception of Bits and Bytes, which would require altering that upstream project. for any of the python optimizers, just open a pull request with a stochastic bf16 variant.

View full answer

bghira · 2023-11-19T20:35:57Z

bghira
Nov 19, 2023
Maintainer

i believe the pytorch version has a substantial impact here. can you share your library versions? and your deepspeed config?

4 replies

minienglish1 Nov 20, 2023
Author

Appreciate the reply. Everything should be according to pyproject.toml & DEEPSPEED.MD.
I can update torch/xformers/bitsandbits to latest versions if needed.

Attached is deepspeed config, pip list, and edxl-env.sh export values. Figured put it all out at once and find the issue.

System: Ubuntu 22.04, cuda 12.1, 64gb system ram, rtx 3090/4090

/home/rtxhome/.cache/huggingface/accelerate/default_config.yaml

compute_environment: LOCAL_MACHINE
deepspeed_config:
  gradient_accumulation_steps: 1
  zero3_init_flag: false
  zero_stage: 1
distributed_type: DEEPSPEED
downcast_bf16: 'no'
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 1
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

(.venv) pip list:

Package Version

accelerate 0.21.0
aiohttp 3.8.6
aiosignal 1.3.1
appdirs 1.4.4
async-timeout 4.0.3
attrs 23.1.0
bitsandbytes 0.41.2
boto3 1.28.73
botocore 1.31.73
build 1.0.3
CacheControl 0.13.1
certifi 2023.7.22
cffi 1.16.0
charset-normalizer 3.3.1
cleo 2.1.0
click 8.1.7
clip-interrogator 0.6.0
colorama 0.4.6
compel 2.0.2
crashtest 0.4.1
cryptography 41.0.5
dadaptation 3.1
datasets 2.14.6
deepspeed 0.10.3
diffusers 0.22.0.dev0
dill 0.3.7
distlib 0.3.7
docker-pycreds 0.4.0
dulwich 0.21.6
fastjsonschema 2.19.0
filelock 3.13.0
frozenlist 1.4.0
fsspec 2023.10.0
ftfy 6.1.1
gitdb 4.0.11
GitPython 3.1.40
hjson 3.1.0
huggingface-hub 0.17.3
idna 3.4
importlib-metadata 6.8.0
installer 0.7.0
iterutils 0.1.6
jaraco.classes 3.3.0
jeepney 0.8.0
Jinja2 3.1.2
jmespath 1.0.1
keyring 24.3.0
lightning-utilities 0.9.0
MarkupSafe 2.1.3
more-itertools 10.1.0
mpmath 1.3.0
msgpack 1.0.7
multidict 6.0.4
multiprocess 0.70.15
networkx 3.2.1
ninja 1.11.1.1
numpy 1.26.1
nvidia-cublas-cu12 12.1.3.1
nvidia-cuda-cupti-cu12 12.1.105
nvidia-cuda-nvrtc-cu12 12.1.105
nvidia-cuda-runtime-cu12 12.1.105
nvidia-cudnn-cu12 8.9.2.26
nvidia-cufft-cu12 11.0.2.54
nvidia-curand-cu12 10.3.2.106
nvidia-cusolver-cu12 11.4.5.107
nvidia-cusparse-cu12 12.1.0.106
nvidia-nccl-cu12 2.18.1
nvidia-nvjitlink-cu12 12.3.101
nvidia-nvtx-cu12 12.1.105
open-clip-torch 2.23.0
opencv-python 4.8.1.78
packaging 23.2
pandas 2.1.2
pathtools 0.1.2
pexpect 4.8.0
Pillow 10.1.0
pip 23.3.1
pkginfo 1.9.6
platformdirs 3.11.0
poetry 1.7.1
poetry-core 1.8.1
poetry-plugin-export 1.6.0
protobuf 4.24.4
psutil 5.9.6
ptyprocess 0.7.0
py-cpuinfo 9.0.0
pyarrow 13.0.0
pycparser 2.21
pydantic 1.10.13
pyparsing 3.1.1
pyproject_hooks 1.0.0
python-dateutil 2.8.2
pytz 2023.3.post1
PyYAML 6.0.1
rapidfuzz 3.5.2
regex 2023.10.3
requests 2.31.0
requests-toolbelt 1.0.0
s3transfer 0.7.0
safetensors 0.3.3
scipy 1.11.3
SecretStorage 3.3.3
sentencepiece 0.1.99
sentry-sdk 1.32.0
setproctitle 1.3.3
setuptools 68.2.2
shellingham 1.5.4
six 1.16.0
smmap 5.0.1
sympy 1.12
timm 0.9.8
tokenizers 0.14.1
tomli 2.0.1
tomlkit 0.12.3
torch 2.1.1
torchaudio 2.1.1
torchmetrics 1.2.0
torchsde 0.2.6
torchvision 0.16.1
tqdm 4.66.1
trampoline 0.1.2
transformers 4.34.1
triton 2.1.0
triton-library 1.0.0rc2
trove-classifiers 2023.11.14
typing_extensions 4.8.0
tzdata 2023.3
urllib3 1.26.18
virtualenv 20.24.6
wandb 0.15.12
wcwidth 0.2.8
xformers 0.0.22.post7
xxhash 3.4.1
yarl 1.9.2
zipp 3.17.0

sdxl-env.sh:

export RESUME_CHECKPOINT="latest"
export CHECKPOINTING_STEPS=150
export CHECKPOINTING_LIMIT=2
export LEARNING_RATE=8e-7 #@param {type:"number"}
export MODEL_NAME="stabilityai/stable-diffusion-xl-base-1.0"
export DEBUG_EXTRA_ARGS=""
export TRACKER_PROJECT_NAME="sdxl-training"
export TRACKER_RUN_NAME="simpletuner-sdxl"
export VALIDATION_PROMPT="ethnographic photography of teddy bear at a picnic"
export VALIDATION_GUIDANCE=7.5
export VALIDATION_GUIDANCE_RESCALE=0.0
export VALIDATION_STEPS=100
export MAX_NUM_STEPS=30000
export NUM_EPOCHS=25
export BASE_DIR="/mnt/storage/training/datasets/test-out"
export INSTANCE_DIR="/mnt/storage/training/datasets/test-instance"
export OUTPUT_DIR="/mnt/storage/training/ckpts"
export RESOLUTION=1024
export MINIMUM_RESOLUTION=$RESOLUTION
export VALIDATION_RESOLUTION=$RESOLUTION
export RESOLUTION_TYPE="pixel"
export TRAIN_BATCH_SIZE=1
export GRADIENT_ACCUMULATION_STEPS=4
export STATE_PATH="${BASE_DIR}/training_state.json"
export SEEN_STATE_PATH="${BASE_DIR}/training_images_seen.json"
export LR_SCHEDULE="constant"
export LR_WARMUP_STEPS=$((MAX_NUM_STEPS / 10))
export CAPTION_DROPOUT_PROBABILITY=0.1
export TRAINER_EXTRA_ARGS="--allow_tf32 --use_adafactor_optimizer"
export TRAINING_SEED=420420420
export MIXED_PRECISION="bf16"                # Might not be supported on all GPUs. fp32 will be needed for others.
export TRAINING_NUM_PROCESSES=1
export TRAINING_NUM_MACHINES=1
export ACCELERATE_EXTRA_ARGS=""                          # --multi_gpu or other similar flags for huggingface accelerate
export TRAINER_EXTRA_ARGS="${TRAINER_EXTRA_ARGS} --enable_xformers_memory_efficient_attention --set_grads_to_none"
export TRAINER_EXTRA_ARGS="${TRAINER_EXTRA_ARGS} --gradient_checkpointing --gradient_accumulation_steps=${GRADIENT_ACCUMULATION_STEPS}"
export TRAINING_DYNAMO_BACKEND='no'          # or 'inductor' if you want to brave PyTorch 2 compile issues

bghira Nov 22, 2023
Maintainer

does the same occur with zero stage 2 or 3?

minienglish1 Nov 22, 2023
Author

Yes. It also cuda OOMs with zero stage 2 & 3. That was the first thing I tried.

Currently, my plan is:

Re-read the install procedure, do a fresh re-install. Try SDXL train again.
Try accelerate/deepspeed with SD 2.1 training, confirm it is working. If it does work, see if I can figure out where the error is with SDXL.
Try torch/triton/xformers nightly & bitsandbytes build from source. Might work.

If you have any suggestions, or want me to test something, let me know.

I'm in no rush. Currently still working on my dataset. Just getting started setting up the training environment.

bghira Nov 26, 2023
Maintainer

compute_environment: LOCAL_MACHINE
deepspeed_config:
  gradient_accumulation_steps: 1
  offload_optimizer_device: cpu
  offload_param_device: cpu
  zero3_init_flag: false
  zero_stage: 2
distributed_type: DEEPSPEED
downcast_bf16: 'no'
machine_rank: 0
main_training_function: main
mixed_precision: fp16
num_machines: 1
num_processes: 1
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

so this is what works on an 8GB laptop with a NVIDIA 3070.

and then tested this on NVIDIA 4090 24G:

----------------------------------------------------------------------------------------------------------------------------
In which compute environment are you running?
This machine
----------------------------------------------------------------------------------------------------------------------------
Which type of machine are you using?
No distributed training
Do you want to run your training on CPU only (even if a GPU / Apple Silicon / Ascend NPU device is available)? [yes/NO]:NO  
Do you wish to optimize your script with torch dynamo?[yes/NO]:NO
Do you want to use DeepSpeed? [yes/NO]: yes
Do you want to specify a json file to a DeepSpeed config? [yes/NO]: NO
----------------------------------------------------------------------------------------------------------------------------
What should be your DeepSpeed's ZeRO optimization stage?
1
How many gradient accumulation steps you're passing in your script? [1]: 4                                                  
Do you want to use gradient clipping? [yes/NO]: NO
Do you want to enable `deepspeed.zero.Init` when using ZeRO Stage-3 for constructing massive models? [yes/NO]: NO
How many GPU(s) should be used for distributed training? [1]: 1
----------------------------------------------------------------------------------------------------------------------------
Do you wish to use FP16 or BF16 (mixed precision)? bf16                                                                                                                        
accelerate configuration saved at /root/.cache/huggingface/accelerate/default_config.yaml

zhuliyi0 · 2024-04-28T10:33:46Z

zhuliyi0
Apr 28, 2024

Is it still possible to train SDXL with 24GB vram using latest release branch? Now that the only optimizer supported is adam. I always get cuda OOM with deepspeed stage 1. Stage 2 is way too slow, 100+ s/it.

3 replies

zhuliyi0 Apr 28, 2024

Also, training speed is much slower than before, probably because I was using adam8bit before. Wonder what's the reason behind the decision to only support adam?

bghira Apr 28, 2024
Maintainer

bits and bytes isn't multi-platform
mixed precision causes issues due to the repeated casting to and from fp16 to fp32
autocast isn't supported on all platforms
writing the weights directly in bf16 and doing calculations there is more efficient

not really sure why it's slower, it shouldn't be.

bghira Apr 28, 2024
Maintainer

8bit adam actually wasn't the reason 24G was doable before, that was Adafactor

but bf16 support can be introduced to any one of the optimizers with the exception of Bits and Bytes, which would require altering that upstream project. for any of the python optimizers, just open a pull request with a stochastic bf16 variant.

Answer selected by bghira

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Could somebody share a sdxl-env.sh that works with a 24gb gpu (3090/4090)? I keep getting cuda OOM #219

{{title}}

Replies: 2 comments 7 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Could somebody share a sdxl-env.sh that works with a 24gb gpu (3090/4090)? I keep getting cuda OOM #219

minienglish1 Nov 19, 2023

Replies: 2 comments · 7 replies

bghira Nov 19, 2023 Maintainer

minienglish1 Nov 20, 2023 Author

bghira Nov 22, 2023 Maintainer

minienglish1 Nov 22, 2023 Author

bghira Nov 26, 2023 Maintainer

zhuliyi0 Apr 28, 2024

zhuliyi0 Apr 28, 2024

bghira Apr 28, 2024 Maintainer

bghira Apr 28, 2024 Maintainer

minienglish1
Nov 19, 2023

Replies: 2 comments 7 replies

bghira
Nov 19, 2023
Maintainer

minienglish1 Nov 20, 2023
Author

bghira Nov 22, 2023
Maintainer

minienglish1 Nov 22, 2023
Author

bghira Nov 26, 2023
Maintainer

zhuliyi0
Apr 28, 2024

bghira Apr 28, 2024
Maintainer

bghira Apr 28, 2024
Maintainer