-
-
Notifications
You must be signed in to change notification settings - Fork 448
Benchmark
Disty0 edited this page Jan 11, 2024
·
17 revisions
- Hardware: nVidia RTX 4090 with i9-12900KF
- Packages: Torch 2.1.0 with CUDA 12.1 and cuDNN 8.9
- Params: model=SD15 | batch-size=4 | batch-count=4 | steps=50 | resolution=512px | sampler=Euler A
Diffusers | Original | ||||||
---|---|---|---|---|---|---|---|
Precision | Params | SDP | xFormers | SDP | xFormers | None | |
FP32 | Default | 33.0 | 20.0 | ||||
BF16 | Default | 73.0 | 45.5 | ||||
FP16 | Default | 73.0 | 75.0 | 48.0 | 48.6 | 17.3 | |
NHWC (channels last) | 72.0 | ||||||
HyperTile (256) | 79.0 | ||||||
ToMe (0.5) | 77.0 | ||||||
Model no-move (medvram) | 85.0 | ||||||
VAE no-slicing, no-tiling | 73.8 | ||||||
Sequential offload (lowvram) | 27.0 |
- All numbers are in it/s and higher is better
- Test matrix is not full as some options can be combined together (e.g. cuDNN + HyperTile)
while others cannot (e.g. HyperTile + ToMe) - Results may differ on different GPU/CPU combinations
For example, pairing better CPU with older GPU may benefit from more processing done on CPU and leaving GPU to do only core ML tasks while paring high-end GPU with older CPU may result in lower results since CPU cannot feed enough tasks to GPU - Diffusers perform significantly better than original backend on modern hardware since tasks remain on GPU for longer time
Equally, original backend may perform better on older hardware - Running quick tasks such as single image generate at low steps may not be sufficient to fully saturate high-end GPU so results will be lower
- xFormers have a slight performance advantage over SDP
However, SDP is a built-in in Torch and "just works" while xFormers needs manual install and its highly version dependent - Some extensions can add significant overhead to pre/post processing even if they are not used
- Not worth consideration: cuDNN, NHWC, inference mode, eval
- cuDNN full bench finds best math algorithm for specific GPU, but default is nearly identical
- channels-last should better trigger utilization of tensor cores, but in practise result is nearly identical
- inference-mode should have more optimizations than default no_grad, but in practise result is nearly identical
- eval mode should allow for removal of some params in the model, but in pracise result is nearly identical
- Benefit of BF16 vs FP16 is not performance as much, its ability to run higher numerical ranges so it can perform calculations where FP16 may result in NaN
- Running in FP32 results in 60% performance drop - if you need FP32, you're leaving a lot on the table
- Cost of using lowvram is very high as it needs to swap parts of model in-memory. Even using medvram comes at noticeable cost
- Best: xFormers, FP16, HyperTile, no-model-move, no-slicing/tiling
Compile type | Performance | Overhead |
---|---|---|
cudnn/default | 73.5 | 4 |
inductor/default | 89.0 | 40 |
inductor/reduce-overhead | 92.0 | 40 |
inductor/max-autotune | 91.0 | 220 |
nvfuser/default | 84.0 | 5 |
cudagraphs/reduce-overhead | 85.0 | 14 |
stable-fast/sdp | 96.0 | 76 |
stable-fast/xformers | 96.0 | 101 |
stable-fast/full-graph | 94.0 | 96 |
- Performance numbers is in it/s and higher is better
- Overhead is time in seconds needed to optimize a model with specific params and lower is better
Model needs compile on initial generate, but it may also need a recompile if params such as resolution of batch size change - Model compile may not be compatible with any method that modifies underlying model,
including loading Lora weights on top of a model - stable-fast compile backend requires that package is manually installed on the system
- Hardware: Intel ARC 770 LE 16GB with R7 5800X3D & MSI B350M Mortar
- Packages: 2.1.0a0+cxx11.abi with IPEX 2.1.10+xpu and MKL / DPCPP 2024.0.0
- Params: model=SD15 | batch-size=1 | batch-count=1 | steps=40 | resolution=512px | sampler=Euler a | CFG 6
Diffusers | Original | |||
---|---|---|---|---|
Precision | Params | it/s | it/s | |
BF16 | Default | 8.54 | 7.75 | |
FP16 | Default | 6.92 | 7.23 | |
FP32 | Default | 3.73 | 3.74 | |
BF16 | HyperTile (256) | 10.03 | 9.32 | |
BF16 | ToMe (0.5) | 9.24 | 8.61 | |
BF16 | No IPEX Optimize | 8.23 | 7.82 | |
BF16 | Model no-move (medvram) | 9.04 | ||
BF16 | VAE no-slicing, no-tiling | 8.67 | ||
BF16 | Sequential offload (lowvram) | 1.60 | 0.67 |