-
-
Notifications
You must be signed in to change notification settings - Fork 448
Benchmark
Vladimir Mandic edited this page Nov 13, 2023
·
17 revisions
- Hardware: nVidia RTX 4090 with i9-12900KF
- Packages: Torch 2.1.0 with CUDA 12.1 and cuDNN 8.9
- Params: model=SD15 | batch-size=4 | batch-count=4 | steps=50 | resolution=512px | sampler=Euler A
Diffusers | Original | ||||||
---|---|---|---|---|---|---|---|
Precision | Params | SDP | xFormers | SDP | xFormers | None | |
FP32 | Default | 33.0 | 20.0 | ||||
BF16 | Default | 73.0 | 45.5 | ||||
FP16 | Default | 73.0 | 75.0 | 48.0 | 48.6 | 17.3 | |
NHWC (channels last) | 72.0 | ||||||
HyperTile (256) | 79.0 | ||||||
ToMe (0.5) | 77.0 | ||||||
Model no-move (medvram) | 85.0 | ||||||
VAE no-slicing, no-tiling | 73.8 | ||||||
Sequential offload (lowvram) | 27.0 |
- Test matrix is not full as some options can be combined together (e.g. cuDNN + HyperTile)
while others cannot (e.g. HyperTile + ToMe) - Results may differ on different GPU/CPU combinations
For example, pairing better CPU with older GPU may benefit from more processing done on CPU and leaving GPU to do only core ML tasks while paring high-end GPU with older CPU may result in lower results since CPU cannot feed enough tasks to GPU - Diffusers perform significantly better than original backend on modern hardware since tasks remain on GPU for longer time
Equally, original backend may perform better on older hardware - Running quick tasks such as single image generate at low steps may not be sufficient to fully saturate high-end GPU so results will be lower
- xFormers have a slight performance advantage over SDP
However, SDP is a built-in in Torch and "just works" while xFormers needs manual install and its highly version dependent - Some extensions can add significant overhead to pre/post processing even if they are not used
- Not worth consideration: cuDNN, NHWC, inference mode, eval
- cuDNN full bench finds best math algorithm for specific GPU, but default is nearly identical
- channels-last should better trigger utilization of tensor cores, but in practise result is nearly identical
- inference-mode should have more optimizations than default no_grad, but in practise result is nearly identical
- eval mode should allow for removal of some params in the model, but in pracise result is nearly identical
- Benefit of BF16 vs FP16 is not performance as much, its ability to run higher numerical ranges so it can perform calculations where FP16 may result in NaN
- Running in FP32 results in 60% performance drop - if you need FP32, you're leaving a lot on the table
- Cost of using lowvram is very high as it needs to swap parts of model in-memory. Even using medvram comes at noticeable cost
- Best: xFormers, FP16, HyperTile, no-model-move, no-slicing/tiling
Compile type | Performance | Overhead |
---|---|---|
cudnn/default | 73.5 | 4 |
inductor/default | 89.0 | 40 |
inductor/reduce-overhead | 92.0 | 40 |
inductor/max-autotune | 91.0 | 220 |
nvfuser/default | 84.0 | 5 |
cudagraphs/reduce-overhead | 85.0 | 14 |
stable-fast/sdp | 96.0 | 76 |
stable-fast/xformers | 96.0 | 101 |
stable-fast/full-graph | 94.0 | 96 |
- Overhead is time in seconds needed to optimize a model with specific params
Model needs compile on initial generate, but it may also need a recompile if params such as resolution of batch size change - Model compile may not be compatible with any method that modifies underlying model,
including loading Lora weights on top of a model - stable-fast compile backend requires that package is manually installed on the system