Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: benchmarking our models against Jax (Flax) #1000

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

avik-pal
Copy link
Member

@avik-pal avik-pal commented Nov 2, 2024

Copy link
Contributor

github-actions bot commented Nov 2, 2024

Benchmark Results (ASV)

main f041d46... main/f041d460598b26...
basics/overhead 0.122 ± 0.00091 μs 0.132 ± 0.0043 μs 0.921
time_to_load 0.951 ± 0.012 s 0.978 ± 0.0092 s 0.972

Benchmark Plots

A plot of the benchmark results have been uploaded as an artifact to the workflow run for this PR.
Go to "Actions"->"Benchmark a pull request"->[the most recent run]->"Artifacts" (at the bottom).

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lux Benchmarks

Benchmark suite Current: f041d46 Previous: 409eda2 Ratio
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 4333 ns 4334 ns 1.00
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 4042 ns 4125 ns 0.98
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 5416 ns 5417 ns 1.00
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 4334 ns 4167 ns 1.04
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA 61224 ns 59978 ns 1.02
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 10584 ns 10333 ns 1.02
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 10916 ns 10167 ns 1.07
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 11125 ns 10500 ns 1.06
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 10750 ns 10167 ns 1.06
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA 426403 ns 416390 ns 1.02
bias_activation(32, act=relu)(32 x 128)/forward/CPU/2 thread(s) 1208 ns 1166.5 ns 1.04
bias_activation(32, act=relu)(32 x 128)/forward/CPU/4 thread(s) 3042 ns 3042 ns 1
bias_activation(32, act=relu)(32 x 128)/forward/CPU/8 thread(s) 1250 ns 1208 ns 1.03
bias_activation(32, act=relu)(32 x 128)/forward/CPU/1 thread(s) 1000 ns 1000 ns 1
bias_activation(32, act=relu)(32 x 128)/forward/GPU/CUDA 18376 ns 18063 ns 1.02
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/2 thread(s) 4042 ns 4084 ns 0.99
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/4 thread(s) 4083 ns 3958 ns 1.03
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/8 thread(s) 4333 ns 4250 ns 1.02
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/1 thread(s) 4062.5 ns 4125 ns 0.98
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/CUDA 110978.5 ns 109325.5 ns 1.02
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 57479.5 ns 56041 ns 1.03
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 46084 ns 46084 ns 1
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 46833 ns 46375 ns 1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 82541 ns 81834 ns 1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 37554.5 ns 36229 ns 1.04
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2024292 ns 2056625 ns 0.98
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2082458.5 ns 2082416.5 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2089083 ns 2056666.5 ns 1.02
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2009166 ns 1995458 ns 1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 197120.5 ns 192802 ns 1.02
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 146917 ns 172458 ns 0.85
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 145958 ns 144854.5 ns 1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 144958.5 ns 148125 ns 0.98
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 148500 ns 146125 ns 1.02
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 166245.5 ns 166789 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1116625 ns 1157666 ns 0.96
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1115271 ns 1110395.5 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1124042 ns 1128416.5 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1137562.5 ns 1120208 ns 1.02
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 527410 ns 516061 ns 1.02
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 3750 ns 3583 ns 1.05
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 3354.5 ns 3583.5 ns 0.94
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 4541 ns 4229.5 ns 1.07
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 3917 ns 3292 ns 1.19
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA 70257 ns 69748 ns 1.01
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 9542 ns 8792 ns 1.09
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 9750 ns 9125 ns 1.07
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 9375 ns 9000 ns 1.04
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 8959 ns 9209 ns 0.97
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA 480558 ns 470533 ns 1.02
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 17167 ns 15083 ns 1.14
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 16958 ns 14875 ns 1.14
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 18041.5 ns 16583 ns 1.09
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 15417 ns 14917 ns 1.03
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 55098 ns 53475 ns 1.03
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 219500 ns 222375 ns 0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 213417 ns 213084 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 212958 ns 213250 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 220083.5 ns 213520.5 ns 1.03
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 274263 ns 267675 ns 1.02
bias_activation(2, act=relu)(2 x 128)/forward/CPU/2 thread(s) 625 ns 500 ns 1.25
bias_activation(2, act=relu)(2 x 128)/forward/CPU/4 thread(s) 625 ns 542 ns 1.15
bias_activation(2, act=relu)(2 x 128)/forward/CPU/8 thread(s) 750 ns 584 ns 1.28
bias_activation(2, act=relu)(2 x 128)/forward/CPU/1 thread(s) 625 ns 583 ns 1.07
bias_activation(2, act=relu)(2 x 128)/forward/GPU/CUDA 17758 ns 17384 ns 1.02
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/2 thread(s) 1458 ns 1500 ns 0.97
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/4 thread(s) 1417 ns 1500 ns 0.94
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/8 thread(s) 1458 ns 1750 ns 0.83
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/1 thread(s) 1750 ns 1583 ns 1.11
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/CUDA 103393 ns 103376 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7041 ns 7041 ns 1
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 5958 ns 5625 ns 1.06
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 5917 ns 5709 ns 1.04
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 10167 ns 9916 ns 1.03
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 23878 ns 23093 ns 1.03
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 233625 ns 227583.5 ns 1.03
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 240250 ns 230417 ns 1.04
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 229083 ns 228000 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 221791 ns 215542 ns 1.03
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 170258.5 ns 166208.5 ns 1.02
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/2 thread(s) 3833 ns 3916 ns 0.98
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/4 thread(s) 3875 ns 3875 ns 1
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/8 thread(s) 3833 ns 3834 ns 1.00
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/1 thread(s) 3875 ns 3834 ns 1.01
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/CUDA 24141 ns 23533 ns 1.03
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/2 thread(s) 16667 ns 16708 ns 1.00
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/4 thread(s) 17000 ns 16750 ns 1.01
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/8 thread(s) 16667 ns 16791 ns 0.99
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/1 thread(s) 16792 ns 16625 ns 1.01
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/CUDA 164532.5 ns 160718 ns 1.02
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/2 thread(s) 579916 ns 577333 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/4 thread(s) 572084 ns 573417 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/8 thread(s) 580875 ns 579000 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/1 thread(s) 575500 ns 574042 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/CUDA 114358 ns 113474 ns 1.01
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) 1423042 ns 1432312.5 ns 0.99
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) 1416167 ns 1426250 ns 0.99
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) 1424833 ns 1425917 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) 1428583 ns 1418000 ns 1.01
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/CUDA 214116.5 ns 211622 ns 1.01
lenet(28, 28, 1, 64)/forward/CPU/2 thread(s) 1079500 ns 1046541 ns 1.03
lenet(28, 28, 1, 64)/forward/CPU/4 thread(s) 959437.5 ns 965500 ns 0.99
lenet(28, 28, 1, 64)/forward/CPU/8 thread(s) 1349375 ns 1347458 ns 1.00
lenet(28, 28, 1, 64)/forward/CPU/1 thread(s) 1290312.5 ns 1290542 ns 1.00
lenet(28, 28, 1, 64)/forward/GPU/CUDA 277710 ns 267857 ns 1.04
lenet(28, 28, 1, 64)/zygote/CPU/2 thread(s) 5926792 ns 5895833.5 ns 1.01
lenet(28, 28, 1, 64)/zygote/CPU/4 thread(s) 4595334 ns 4588042 ns 1.00
lenet(28, 28, 1, 64)/zygote/CPU/8 thread(s) 4957000 ns 4928187 ns 1.01
lenet(28, 28, 1, 64)/zygote/CPU/1 thread(s) 5554125.5 ns 5737167 ns 0.97
lenet(28, 28, 1, 64)/zygote/GPU/CUDA 1102113.5 ns 1066176 ns 1.03
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/2 thread(s) 541 ns 500 ns 1.08
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/4 thread(s) 541 ns 500 ns 1.08
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/8 thread(s) 542 ns 542 ns 1
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/1 thread(s) 542 ns 500 ns 1.08
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/CUDA 23990 ns 23460 ns 1.02
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/2 thread(s) 2167 ns 2084 ns 1.04
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/4 thread(s) 2167 ns 2125 ns 1.02
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/8 thread(s) 2167 ns 2292 ns 0.95
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/1 thread(s) 2167 ns 2125 ns 1.02
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/CUDA 172418 ns 169490.5 ns 1.02
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 6541 ns 5458 ns 1.20
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 6459 ns 4000 ns 1.61
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 5834 ns 5687.5 ns 1.03
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 4333 ns 6250 ns 0.69
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA 65787 ns 64594 ns 1.02
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 11542 ns 11083 ns 1.04
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 11917 ns 11333 ns 1.05
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 11791 ns 12041 ns 0.98
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 11792 ns 11083.5 ns 1.06
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA 451007 ns 444224 ns 1.02
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 8062.5 ns 6708 ns 1.20
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 7709 ns 6416 ns 1.20
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 8042 ns 7875 ns 1.02
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 7041 ns 6500 ns 1.08
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA 52207 ns 51136 ns 1.02
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 18583 ns 17583 ns 1.06
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 17208 ns 16958 ns 1.01
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 17417 ns 18145.5 ns 0.96
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 17666 ns 16916 ns 1.04
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA 303772 ns 297812 ns 1.02
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) 583 ns 500 ns 1.17
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) 500 ns 500 ns 1
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) 541 ns 583 ns 0.93
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) 542 ns 500 ns 1.08
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA 32740 ns 31896 ns 1.03
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 9229.5 ns 8916 ns 1.04
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 8792 ns 8667 ns 1.01
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 8834 ns 9250 ns 0.96
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 8833 ns 8645.5 ns 1.02
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA 160219.5 ns 155805 ns 1.03
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/2 thread(s) 64500 ns 64937.5 ns 0.99
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/4 thread(s) 64583 ns 62625 ns 1.03
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/8 thread(s) 64708 ns 64500 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/1 thread(s) 64500 ns 64667 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/CUDA 112862 ns 110478.5 ns 1.02
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/2 thread(s) 280333.5 ns 294791 ns 0.95
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/4 thread(s) 282917 ns 279125 ns 1.01
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/8 thread(s) 273667 ns 275479.5 ns 0.99
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/1 thread(s) 295167 ns 280854.5 ns 1.05
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/CUDA 189364.5 ns 185224.5 ns 1.02
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/2 thread(s) 3317583.5 ns 3152041.5 ns 1.05
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/4 thread(s) 3019208.5 ns 3026187 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/8 thread(s) 3016375 ns 3022520.5 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/1 thread(s) 4051917 ns 3964167 ns 1.02
mlp7layer_bn(gelu)(32 x 256)/forward/GPU/CUDA 573433 ns 573818.5 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/2 thread(s) 7633708.5 ns 7551166.5 ns 1.01
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/4 thread(s) 7443875 ns 7449979 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/8 thread(s) 7454416 ns 7447000 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/1 thread(s) 8302917 ns 8208396 ns 1.01
mlp7layer_bn(gelu)(32 x 256)/zygote/GPU/CUDA 1360917.5 ns 1327975 ns 1.02
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/2 thread(s) 18805792 ns 18867458 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/4 thread(s) 19114625 ns 19142541 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/8 thread(s) 19126000 ns 19088834 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/1 thread(s) 15867542 ns 15711167 ns 1.01
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) 23869791.5 ns 24315583.5 ns 0.98
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) 33627063 ns 33983500 ns 0.99
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) 37312875 ns 37046583.5 ns 1.01
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) 35492042 ns 34841833 ns 1.02
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/GPU/CUDA 1852173 ns 2130242 ns 0.87
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) 189567292 ns 192387270.5 ns 0.99
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) 164695625 ns 163943875 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) 152795917 ns 152577625 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) 448899584 ns 437847333 ns 1.03
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/GPU/CUDA 13897300 ns 14119852 ns 0.98
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) 288986708.5 ns 294725229.5 ns 0.98
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) 262891542 ns 338344395.5 ns 0.78
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) 299964833 ns 300590083.5 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) 400223041.5 ns 396800708.5 ns 1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 24354.5 ns 23687.5 ns 1.03
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 24667 ns 23083 ns 1.07
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 25291 ns 24791 ns 1.02
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 22500 ns 23708 ns 0.95
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 95040 ns 95862 ns 0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 103104.5 ns 103250 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 104541 ns 103458 ns 1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 104291.5 ns 103667 ns 1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 103083 ns 102750 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 493349.5 ns 494978 ns 1.00
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 6937.5 ns 7083 ns 0.98
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 7042 ns 5750 ns 1.22
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 7459 ns 6875 ns 1.08
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 6958.5 ns 7000 ns 0.99
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA 66551.5 ns 67128 ns 0.99
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 15416 ns 15375 ns 1.00
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 15979.5 ns 15395.5 ns 1.04
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 15917 ns 16000 ns 0.99
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 15459 ns 14791.5 ns 1.05
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA 466767.5 ns 467877 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) 2917521 ns 3009166.5 ns 0.97
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) 2054000 ns 2067250 ns 0.99
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) 2294625 ns 2279667 ns 1.01
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) 4822250 ns 4832667 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/GPU/CUDA 580968 ns 581800.5 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) 23583437.5 ns 23921708.5 ns 0.99
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) 18092083 ns 18037292 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) 16977875 ns 16963187.5 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) 36087625 ns 34623770.5 ns 1.04
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/GPU/CUDA 3231520 ns 3105602 ns 1.04
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) 33370542 ns 33780291 ns 0.99
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) 27613625.5 ns 27715666.5 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) 27378771 ns 27451041 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) 42452646 ns 41640208 ns 1.02
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 73875 ns 80479 ns 0.92
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 76083 ns 72416 ns 1.05
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 84042 ns 78354 ns 1.07
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 73208 ns 74645.5 ns 0.98
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 102315.5 ns 100885 ns 1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 297250 ns 311542 ns 0.95
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 313813 ns 224520.5 ns 1.40
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 307833 ns 209667 ns 1.47
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 288708 ns 257021 ns 1.12
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 539612.5 ns 539235 ns 1.00
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 12708 ns 12500 ns 1.02
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 12625 ns 11708 ns 1.08
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 13500 ns 12542 ns 1.08
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 12000 ns 12833.5 ns 0.94
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA 69202.5 ns 70648 ns 0.98
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 27083 ns 26667 ns 1.02
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 27270.5 ns 26958.5 ns 1.01
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 27416 ns 27333.5 ns 1.00
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 27417 ns 26625 ns 1.03
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA 467808.5 ns 470896 ns 0.99
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 13042 ns 12791 ns 1.02
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 13208 ns 12333 ns 1.07
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 13375 ns 13500 ns 0.99
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 13042 ns 12875 ns 1.01
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA 51727.5 ns 52214 ns 0.99
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 26312 ns 25959 ns 1.01
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 26000 ns 25750 ns 1.01
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 27438 ns 26500 ns 1.04
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 26125 ns 26500 ns 0.99
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA 298006 ns 300818.5 ns 0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 179854.5 ns 180750 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 183021 ns 179583 ns 1.02
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 183834 ns 183146 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 180208 ns 179250 ns 1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 55691.5 ns 56380 ns 0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 584666.5 ns 593542 ns 0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 583687.5 ns 582459 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 594563 ns 585042 ns 1.02
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 592166 ns 594562 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 281460.5 ns 284588 ns 0.99
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) 6958 ns 6770.5 ns 1.03
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) 6875 ns 5958 ns 1.15
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) 7375 ns 7084 ns 1.04
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) 6083 ns 7125 ns 0.85
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA 68775 ns 70103 ns 0.98
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 14500 ns 14709 ns 0.99
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 14500 ns 14500 ns 1
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 15292 ns 15291.5 ns 1.00
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 14833 ns 13958 ns 1.06
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA 454520 ns 460969.5 ns 0.99
batchedmm(512, Bsize=4)/forward/CPU/2 thread(s) 1192833.5 ns 1217750 ns 0.98
batchedmm(512, Bsize=4)/forward/CPU/4 thread(s) 1266334 ns 1209125 ns 1.05
batchedmm(512, Bsize=4)/forward/CPU/8 thread(s) 1252229.5 ns 1249750 ns 1.00
batchedmm(512, Bsize=4)/forward/CPU/1 thread(s) 1308521 ns 1326625 ns 0.99
batchedmm(512, Bsize=4)/forward/GPU/CUDA 301869 ns 302841 ns 1.00
batchedmm(512, Bsize=4)/zygote/CPU/2 thread(s) 4121500 ns 4351270.5 ns 0.95
batchedmm(512, Bsize=4)/zygote/CPU/4 thread(s) 4378666.5 ns 4353042 ns 1.01
batchedmm(512, Bsize=4)/zygote/CPU/8 thread(s) 4521604 ns 4630333 ns 0.98
batchedmm(512, Bsize=4)/zygote/CPU/1 thread(s) 4629374.5 ns 4466479 ns 1.04
batchedmm(512, Bsize=4)/zygote/GPU/CUDA 1047678 ns 1039570 ns 1.01
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/2 thread(s) 1833 ns 1833 ns 1
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/4 thread(s) 1875 ns 1792 ns 1.05
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/8 thread(s) 1916 ns 1833 ns 1.05
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/1 thread(s) 1833 ns 1875 ns 0.98
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/CUDA 23265 ns 23644 ns 0.98
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) 4917 ns 4875 ns 1.01
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) 4917 ns 4875 ns 1.01
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) 5000 ns 5042 ns 0.99
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) 4917 ns 4875 ns 1.01
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/CUDA 186658 ns 189061.5 ns 0.99
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 6625 ns 6021 ns 1.10
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 6875 ns 5708 ns 1.20
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 7521 ns 7042 ns 1.07
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 6667 ns 7416 ns 0.90
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA 54037.5 ns 54998.5 ns 0.98
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 11666 ns 11437.5 ns 1.02
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 11458 ns 11084 ns 1.03
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 11042 ns 11666 ns 0.95
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 11792 ns 12333 ns 0.96
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA 325516.5 ns 332242 ns 0.98
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/2 thread(s) 292 ns 333 ns 0.88
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/4 thread(s) 292 ns 292 ns 1
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/8 thread(s) 333 ns 333 ns 1
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/1 thread(s) 292 ns 292 ns 1
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/CUDA 22987.5 ns 22998 ns 1.00
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/2 thread(s) 2667 ns 2667 ns 1
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/4 thread(s) 3000 ns 2750 ns 1.09
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/8 thread(s) 2709 ns 2750 ns 0.99
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/1 thread(s) 2917 ns 2709 ns 1.08
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/CUDA 156969 ns 158762.5 ns 0.99
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 13708 ns 13687.5 ns 1.00
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 14083 ns 11208 ns 1.26
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 14000 ns 13958 ns 1.00
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 12125 ns 14125 ns 0.86
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA 55221 ns 57325 ns 0.96
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 25000 ns 24625 ns 1.02
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 25167 ns 24250 ns 1.04
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 25500 ns 25500 ns 1
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 25167 ns 24875 ns 1.01
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA 287940 ns 295945 ns 0.97
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/2 thread(s) 4166 ns 4167 ns 1.00
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/4 thread(s) 4167 ns 4166 ns 1.00
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/8 thread(s) 4166 ns 4167 ns 1.00
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/1 thread(s) 4166 ns 4125 ns 1.01
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/CUDA 24849 ns 24912 ns 1.00
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/2 thread(s) 16250 ns 16084 ns 1.01
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/4 thread(s) 16291 ns 16209 ns 1.01
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/8 thread(s) 16375 ns 16333.5 ns 1.00
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/1 thread(s) 15958 ns 16208 ns 0.98
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/CUDA 193777.5 ns 199034.5 ns 0.97
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 5708 ns 5708 ns 1
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 5667 ns 5584 ns 1.01
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 5792 ns 5708 ns 1.01
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 5667 ns 5708 ns 0.99
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA 32729 ns 33099 ns 0.99
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 20916 ns 21166 ns 0.99
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 20875 ns 20458 ns 1.02
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 20791 ns 21333.5 ns 0.97
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 21333 ns 20875 ns 1.02
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA 172416.5 ns 174613 ns 0.99
batchedmm(16, Bsize=512)/forward/CPU/2 thread(s) 397062.5 ns 383042 ns 1.04
batchedmm(16, Bsize=512)/forward/CPU/4 thread(s) 374667 ns 373541 ns 1.00
batchedmm(16, Bsize=512)/forward/CPU/8 thread(s) 485708 ns 485896 ns 1.00
batchedmm(16, Bsize=512)/forward/CPU/1 thread(s) 509458 ns 532854.5 ns 0.96
batchedmm(16, Bsize=512)/forward/GPU/CUDA 66848 ns 66578.5 ns 1.00
batchedmm(16, Bsize=512)/zygote/CPU/2 thread(s) 994333 ns 938166 ns 1.06
batchedmm(16, Bsize=512)/zygote/CPU/4 thread(s) 892041 ns 847083 ns 1.05
batchedmm(16, Bsize=512)/zygote/CPU/8 thread(s) 1239979 ns 1235042 ns 1.00
batchedmm(16, Bsize=512)/zygote/CPU/1 thread(s) 1415854 ns 1418833 ns 1.00
batchedmm(16, Bsize=512)/zygote/GPU/CUDA 189858 ns 191164 ns 0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 80604 ns 81020.5 ns 0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 83208 ns 80354.5 ns 1.04
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 82875 ns 82250 ns 1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 83458.5 ns 132458 ns 0.63
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 193036 ns 192525 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1914708 ns 1945166 ns 0.98
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1920583 ns 1909584 ns 1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1917958 ns 1920333 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1922834 ns 1914354.5 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 390299 ns 402795 ns 0.97
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/2 thread(s) 292 ns 292 ns 1
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/4 thread(s) 292 ns 292 ns 1
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/8 thread(s) 292 ns 292 ns 1
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/1 thread(s) 292 ns 292 ns 1
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/CUDA 22056 ns 21790 ns 1.01
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/2 thread(s) 1834 ns 1792 ns 1.02
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/4 thread(s) 1834 ns 1791 ns 1.02
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/8 thread(s) 1875 ns 1916 ns 0.98
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/1 thread(s) 1833 ns 1792 ns 1.02
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/CUDA 167168.5 ns 172681 ns 0.97
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 6833 ns 8000 ns 0.85
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 7459 ns 6833 ns 1.09
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 7834 ns 8334 ns 0.94
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 8250 ns 7999.5 ns 1.03
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA 56962 ns 62227.5 ns 0.92
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 9292 ns 9375 ns 0.99
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 9334 ns 8875 ns 1.05
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 9708 ns 9625 ns 1.01
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 9666 ns 9250 ns 1.04
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA 298435.5 ns 315550.5 ns 0.95
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) 121682687 ns 159022167 ns 0.77
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) 174477334 ns 174256125 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) 147749499.5 ns 147914021 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) 106196334 ns 102407958 ns 1.04
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/GPU/CUDA 5464130 ns 5468366 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) 617964916.5 ns 678096083 ns 0.91
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) 554980792 ns 555598625 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) 450462938 ns 453528479 ns 0.99
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) 774770271 ns 754205958.5 ns 1.03
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/GPU/CUDA 38232798 ns 34940005 ns 1.09
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) 650752208 ns 703546875 ns 0.92
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) 665180625 ns 666832020.5 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) 588795479 ns 585927312.5 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) 746732334 ns 742692916 ns 1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 57792 ns 57542 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 47375 ns 47583 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 47542 ns 47291 ns 1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 83625 ns 82208 ns 1.02
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 37861 ns 37135 ns 1.02
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1805209 ns 1947333 ns 0.93
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1972833.5 ns 1971042 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1982166.5 ns 1976458 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1895625 ns 1893520.5 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 175561.5 ns 171380.5 ns 1.02
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 267979 ns 272291 ns 0.98
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 269854 ns 265834 ns 1.02
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 272271 ns 289417 ns 0.94
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 269270.5 ns 267167 ns 1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 126577.5 ns 135867.5 ns 0.93
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 687417 ns 671917 ns 1.02
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 587708 ns 596708 ns 0.98
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 677146 ns 696292 ns 0.97
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 621875 ns 692687.5 ns 0.90
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 660721 ns 737698 ns 0.90
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 2217167 ns 2231188 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 2243687.5 ns 2215042 ns 1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 2207166.5 ns 2207229 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 2210229 ns 2243770.5 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 134472.5 ns 133226 ns 1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 5500312.5 ns 5572500 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 5481541.5 ns 5486875 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 5533437.5 ns 5511083 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 5557958.5 ns 5495666.5 ns 1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 709361 ns 759202.5 ns 0.93
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/2 thread(s) 646625 ns 652833.5 ns 0.99
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/4 thread(s) 640708 ns 657229 ns 0.97
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/8 thread(s) 644875 ns 639500 ns 1.01
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/1 thread(s) 649667 ns 639791 ns 1.02
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/CUDA 47317 ns 46976 ns 1.01
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) 1821292 ns 1799583 ns 1.01
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) 1717167 ns 1724792 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) 1720208 ns 1722792 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) 2099917 ns 2103895.5 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/CUDA 224949 ns 221178.5 ns 1.02
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 58167 ns 56541 ns 1.03
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 46208 ns 46833 ns 0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 47125 ns 46041 ns 1.02
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 85042 ns 83792 ns 1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 28856 ns 28073 ns 1.03
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2030667 ns 2058250 ns 0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2068666.5 ns 2078709 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2091708.5 ns 2093000 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1995624.5 ns 1996646 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 190710.5 ns 187152 ns 1.02
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) 13403541 ns 13406125 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) 12452354.5 ns 12455458 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) 12584874.5 ns 12584792 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) 15131208.5 ns 14882959 ns 1.02
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/GPU/CUDA 515943 ns 517201.5 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) 47288708 ns 47687000 ns 0.99
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) 41797750 ns 41754625 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) 41166292 ns 40922625 ns 1.01
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) 59000458 ns 58112708 ns 1.02
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/GPU/CUDA 3214571.5 ns 3212087 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) 74057125 ns 74213479 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) 91281417 ns 68010000 ns 1.34
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) 90658500 ns 90988625 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) 99567250 ns 76809750 ns 1.30
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 58917 ns 56917 ns 1.04
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 46917 ns 47042 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 47292 ns 47041 ns 1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 84583 ns 83375 ns 1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 47001.5 ns 46301 ns 1.02
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1909417 ns 1939854 ns 0.98
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1970125 ns 1973333 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1977500 ns 1974729.5 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1893666.5 ns 1884375 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 193241.5 ns 189579 ns 1.02
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) 334 ns 291 ns 1.15
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) 292 ns 333 ns 0.88
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) 334 ns 375 ns 0.89
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) 375 ns 250 ns 1.50
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA 32060 ns 31617 ns 1.01
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 6958 ns 6229.5 ns 1.12
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 6292 ns 6167 ns 1.02
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 6500 ns 6458 ns 1.01
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 6500 ns 6167 ns 1.05
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA 167892 ns 171396 ns 0.98
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/2 thread(s) 250 ns 250 ns 1
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/4 thread(s) 291 ns 250 ns 1.16
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/8 thread(s) 292 ns 250 ns 1.17
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/1 thread(s) 292 ns 250 ns 1.17
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/CUDA 31793 ns 31328 ns 1.01
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/2 thread(s) 2750 ns 2583 ns 1.06
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/4 thread(s) 2875 ns 2625 ns 1.10
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/8 thread(s) 2625 ns 2792 ns 0.94
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/1 thread(s) 2792 ns 2625 ns 1.06
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/CUDA 158050.5 ns 161410 ns 0.98
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) 286392208.5 ns 324182500 ns 0.88
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) 339629083 ns 339536042 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) 314686979 ns 314625854 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) 270977208 ns 273060250 ns 0.99
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/GPU/CUDA 7039044 ns 7093070 ns 0.99
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) 987040583 ns 1051455583 ns 0.94
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) 937877000 ns 941830875 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) 860491458.5 ns 858538271 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) 1177023417 ns 1153691292 ns 1.02
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/GPU/CUDA 33913611 ns 34020243.5 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) 1312562792 ns 1359481562.5 ns 0.97
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) 1685161292 ns 1360673729 ns 1.24
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) 1644146958 ns 1640965792 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) 1659747291 ns 1309802292 ns 1.27
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 1406521 ns 1414416.5 ns 0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 1407084 ns 1409541 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 1411250 ns 1408500 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 1418562.5 ns 1453875 ns 0.98
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 128151.5 ns 127358 ns 1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 5012500 ns 5056229 ns 0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 5022521 ns 5013583 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 5040375 ns 4954291 ns 1.02
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 5037729 ns 5017021 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 509438.5 ns 601067 ns 0.85
vgg16(32, 32, 3, 32)/forward/CPU/2 thread(s) 168947750 ns 170719208 ns 0.99
vgg16(32, 32, 3, 32)/forward/CPU/4 thread(s) 130945895.5 ns 132607979.5 ns 0.99
vgg16(32, 32, 3, 32)/forward/CPU/8 thread(s) 131658333 ns 124493437.5 ns 1.06
vgg16(32, 32, 3, 32)/forward/CPU/1 thread(s) 157330500 ns 162230500 ns 0.97
vgg16(32, 32, 3, 32)/forward/GPU/CUDA 4922111 ns 4886055.5 ns 1.01
vgg16(32, 32, 3, 32)/zygote/CPU/2 thread(s) 672521958 ns 854987208 ns 0.79
vgg16(32, 32, 3, 32)/zygote/CPU/4 thread(s) 643855750 ns 644456708 ns 1.00
vgg16(32, 32, 3, 32)/zygote/CPU/8 thread(s) 500540709 ns 532057834 ns 0.94
vgg16(32, 32, 3, 32)/zygote/CPU/1 thread(s) 857337417 ns 687805708 ns 1.25
vgg16(32, 32, 3, 32)/zygote/GPU/CUDA 16232508 ns 16138006 ns 1.01
batchedmm(512, Bsize=32)/forward/CPU/2 thread(s) 8936646 ns 9114041.5 ns 0.98
batchedmm(512, Bsize=32)/forward/CPU/4 thread(s) 8743917 ns 8770313 ns 1.00
batchedmm(512, Bsize=32)/forward/CPU/8 thread(s) 7871374.5 ns 7860292 ns 1.00
batchedmm(512, Bsize=32)/forward/CPU/1 thread(s) 10349042 ns 10147292 ns 1.02
batchedmm(512, Bsize=32)/forward/GPU/CUDA 1606235 ns 1612586 ns 1.00
batchedmm(512, Bsize=32)/zygote/CPU/2 thread(s) 36162708 ns 37546375 ns 0.96
batchedmm(512, Bsize=32)/zygote/CPU/4 thread(s) 36945042 ns 36886146 ns 1.00
batchedmm(512, Bsize=32)/zygote/CPU/8 thread(s) 33490646 ns 33451021 ns 1.00
batchedmm(512, Bsize=32)/zygote/CPU/1 thread(s) 39951625 ns 38875771 ns 1.03
batchedmm(512, Bsize=32)/zygote/GPU/CUDA 8915488 ns 6459090.5 ns 1.38
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/2 thread(s) 47500 ns 47458.5 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/4 thread(s) 47459 ns 49333 ns 0.96
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/8 thread(s) 47667 ns 49583 ns 0.96
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/1 thread(s) 47416 ns 47250 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/CUDA 18767 ns 18585 ns 1.01
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/2 thread(s) 50458 ns 50584 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/4 thread(s) 50542 ns 50416 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/8 thread(s) 50750 ns 50708.5 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/1 thread(s) 50500 ns 50500 ns 1
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/CUDA 162985 ns 216293 ns 0.75
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 7708 ns 7979.5 ns 0.97
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 8042 ns 6791 ns 1.18
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 8333 ns 8875 ns 0.94
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 7750 ns 8583 ns 0.90
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA 75765 ns 106035 ns 0.71
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 9958 ns 10333 ns 0.96
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 10437.5 ns 9958 ns 1.05
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 10791 ns 10500 ns 1.03
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 10375 ns 10167 ns 1.02
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA 456610.5 ns 612658 ns 0.75
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 8209 ns 8750 ns 0.94
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 8667 ns 6438 ns 1.35
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 8875 ns 8667 ns 1.02
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 6750 ns 5875 ns 1.15
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA 86529.5 ns 119844.5 ns 0.72
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 13166 ns 13375 ns 0.98
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 13458.5 ns 13000 ns 1.04
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 13542 ns 13416 ns 1.01
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 13333 ns 12791 ns 1.04
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA 414607 ns 517417.5 ns 0.80
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 1083 ns 1042 ns 1.04
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 1042 ns 958 ns 1.09
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 1083 ns 1042 ns 1.04
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 1083 ns 1042 ns 1.04
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA 32103 ns 31817 ns 1.01
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 8250 ns 8041 ns 1.03
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 8250 ns 7750 ns 1.06
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 8125 ns 8333 ns 0.98
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 8167 ns 8292 ns 0.98
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA 187429 ns 203048 ns 0.92
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/2 thread(s) 23334 ns 23145.5 ns 1.01
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/4 thread(s) 23042 ns 24541 ns 0.94
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/8 thread(s) 23750 ns 24167 ns 0.98
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/1 thread(s) 23292 ns 23334 ns 1.00
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/CUDA 18388 ns 18371 ns 1.00
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) 52458 ns 52542 ns 1.00
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) 53000 ns 52416 ns 1.01
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) 54667 ns 52500 ns 1.04
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) 52687.5 ns 52334 ns 1.01
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/CUDA 232789 ns 295739.5 ns 0.79
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 1399583 ns 1440625 ns 0.97
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 1403500 ns 1400291 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 1404584 ns 1400875 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 1409833 ns 1406313 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 195872 ns 194620 ns 1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 4993083 ns 5047479.5 ns 0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 4929125 ns 5003458.5 ns 0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 5032375 ns 4836292 ns 1.04
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 5036542 ns 4996708 ns 1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 540199 ns 628014 ns 0.86
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) 3042896 ns 3062438 ns 0.99
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) 2068771 ns 2084417 ns 0.99
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) 2296625 ns 2227208.5 ns 1.03
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) 4859000 ns 4812250 ns 1.01
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/GPU/CUDA 582789 ns 579246 ns 1.01
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) 24291292 ns 24741125 ns 0.98
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) 18912083.5 ns 18811521 ns 1.01
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) 18946541.5 ns 18691437 ns 1.01
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) 37157750 ns 36587416 ns 1.02
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/GPU/CUDA 3180435 ns 3196070 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) 34008000 ns 34435312 ns 0.99
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) 28284166.5 ns 28306583.5 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) 28170875 ns 28069750 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) 42294291.5 ns 41958375 ns 1.01
batchedmm(512, Bsize=512)/forward/CPU/2 thread(s) 144251875 ns 145325041 ns 0.99
batchedmm(512, Bsize=512)/forward/CPU/4 thread(s) 142297416 ns 141848041.5 ns 1.00
batchedmm(512, Bsize=512)/forward/CPU/8 thread(s) 125012271 ns 123758375 ns 1.01
batchedmm(512, Bsize=512)/forward/CPU/1 thread(s) 174017792 ns 173196604 ns 1.00
batchedmm(512, Bsize=512)/forward/GPU/CUDA 22792818 ns 22560824 ns 1.01
batchedmm(512, Bsize=512)/zygote/CPU/2 thread(s) 1719710313 ns 942531917 ns 1.82
batchedmm(512, Bsize=512)/zygote/CPU/4 thread(s) 1131712375 ns 871530625 ns 1.30
batchedmm(512, Bsize=512)/zygote/CPU/8 thread(s) 737685000 ns 1498315250 ns 0.49
batchedmm(512, Bsize=512)/zygote/CPU/1 thread(s) 687694667 ns 674150833 ns 1.02
batchedmm(512, Bsize=512)/zygote/GPU/CUDA 118898101 ns 118289465 ns 1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 76084 ns 76208 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 74354.5 ns 75041 ns 0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 78812.5 ns 77875 ns 1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 78083 ns 75417 ns 1.04
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 190339.5 ns 273038.5 ns 0.70
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 290500 ns 299708 ns 0.97
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 288500 ns 284646 ns 1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 207354 ns 191687.5 ns 1.08
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 292084 ns 202979.5 ns 1.44
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1065883 ns 1439967 ns 0.74
batchedmm(512, Bsize=128)/forward/CPU/2 thread(s) 35467167 ns 36345458 ns 0.98
batchedmm(512, Bsize=128)/forward/CPU/4 thread(s) 35107250 ns 35416645.5 ns 0.99
batchedmm(512, Bsize=128)/forward/CPU/8 thread(s) 32188375 ns 32239562.5 ns 1.00
batchedmm(512, Bsize=128)/forward/CPU/1 thread(s) 41524750 ns 40930312.5 ns 1.01
batchedmm(512, Bsize=128)/forward/GPU/CUDA 5849338 ns 5849412 ns 1.00
batchedmm(512, Bsize=128)/zygote/CPU/2 thread(s) 146402291 ns 151966416 ns 0.96
batchedmm(512, Bsize=128)/zygote/CPU/4 thread(s) 151223500 ns 152232437.5 ns 0.99
batchedmm(512, Bsize=128)/zygote/CPU/8 thread(s) 134376625 ns 136165208.5 ns 0.99
batchedmm(512, Bsize=128)/zygote/CPU/1 thread(s) 227426875 ns 287396625 ns 0.79
batchedmm(512, Bsize=128)/zygote/GPU/CUDA 34896547 ns 34914778 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) 121785604 ns 158627833 ns 0.77
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) 173641625 ns 174511667 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) 147809020.5 ns 148215771.5 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) 104816458 ns 108212479 ns 0.97
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/GPU/CUDA 5458335 ns 5459784 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) 470304750 ns 524328229.5 ns 0.90
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) 466493959 ns 467038291 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) 437765479 ns 441190000 ns 0.99
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) 760795396 ns 741818542 ns 1.03
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/GPU/CUDA 35146337 ns 32279915 ns 1.09
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) 713005604 ns 692549750 ns 1.03
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) 653623645.5 ns 656203708.5 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) 573970979.5 ns 573625208 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) 865330958 ns 853537834 ns 1.01
mlp7layer_bn(relu)(32 x 256)/forward/CPU/2 thread(s) 1307604 ns 1226937.5 ns 1.07
mlp7layer_bn(relu)(32 x 256)/forward/CPU/4 thread(s) 972645.5 ns 992979 ns 0.98
mlp7layer_bn(relu)(32 x 256)/forward/CPU/8 thread(s) 954916.5 ns 904625 ns 1.06
mlp7layer_bn(relu)(32 x 256)/forward/CPU/1 thread(s) 2066083 ns 2085917 ns 0.99
mlp7layer_bn(relu)(32 x 256)/forward/GPU/CUDA 571599 ns 566912.5 ns 1.01
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/2 thread(s) 2322750 ns 2909667 ns 0.80
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/4 thread(s) 2616416 ns 2628208 ns 1.00
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/8 thread(s) 2622375 ns 2006333.5 ns 1.31
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/1 thread(s) 3780709 ns 3693750.5 ns 1.02
mlp7layer_bn(relu)(32 x 256)/zygote/GPU/CUDA 1633470.5 ns 1796011.5 ns 0.91
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/2 thread(s) 6650459 ns 6757875 ns 0.98
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/4 thread(s) 6509729 ns 6503250 ns 1.00
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/8 thread(s) 6513417 ns 6239125 ns 1.04
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/1 thread(s) 4521042 ns 4454771 ns 1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7375 ns 7250 ns 1.02
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 5958 ns 6167 ns 0.97
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 6167 ns 6208 ns 0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 10291 ns 10250 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 25452.5 ns 24809.5 ns 1.03
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 212833.5 ns 213666 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 220708 ns 220313 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 220916 ns 220125 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 206042 ns 209542 ns 0.98
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 235444 ns 276995.5 ns 0.85
vgg16(32, 32, 3, 64)/forward/CPU/2 thread(s) 302269625 ns 315354292 ns 0.96
vgg16(32, 32, 3, 64)/forward/CPU/4 thread(s) 221254250 ns 221860750 ns 1.00
vgg16(32, 32, 3, 64)/forward/CPU/8 thread(s) 198336250 ns 197740833.5 ns 1.00
vgg16(32, 32, 3, 64)/forward/CPU/1 thread(s) 309762604 ns 312004542 ns 0.99
vgg16(32, 32, 3, 64)/forward/GPU/CUDA 7684423 ns 7676221 ns 1.00
vgg16(32, 32, 3, 64)/zygote/CPU/2 thread(s) 1084217792 ns 1085627020.5 ns 1.00
vgg16(32, 32, 3, 64)/zygote/CPU/4 thread(s) 893335249.5 ns 891084375.5 ns 1.00
vgg16(32, 32, 3, 64)/zygote/CPU/8 thread(s) 869473667 ns 865730125 ns 1.00
vgg16(32, 32, 3, 64)/zygote/CPU/1 thread(s) 1182096250 ns 1163266979.5 ns 1.02
vgg16(32, 32, 3, 64)/zygote/GPU/CUDA 26585457 ns 26544800.5 ns 1.00
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 5584 ns 6083 ns 0.92
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 6959 ns 5583 ns 1.25
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 8792 ns 7375 ns 1.19
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 6541 ns 5270.5 ns 1.24
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA 141680.5 ns 178949 ns 0.79
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7500 ns 7708 ns 0.97
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7500 ns 7292 ns 1.03
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7375 ns 7500 ns 0.98
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7958 ns 6792 ns 1.17
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA 523009 ns 667282.5 ns 0.78
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) 541 ns 542 ns 1.00
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) 542 ns 459 ns 1.18
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) 584 ns 542 ns 1.08
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) 500 ns 459 ns 1.09
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA 24058 ns 23245 ns 1.03
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 9375 ns 9583.5 ns 0.98
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 9604 ns 9167 ns 1.05
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 9833 ns 9458.5 ns 1.04
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 9729 ns 8792 ns 1.11
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA 191258 ns 227149 ns 0.84
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/2 thread(s) 353291 ns 352521.5 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/4 thread(s) 351041 ns 352709 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/8 thread(s) 351479.5 ns 352958.5 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/1 thread(s) 351542 ns 352708 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/CUDA 21523 ns 21007 ns 1.02
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) 775875 ns 828104 ns 0.94
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) 776000 ns 820292 ns 0.95
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) 773667 ns 773500 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) 831708 ns 828312 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/CUDA 254955.5 ns 289596 ns 0.88
batchedmm(16, Bsize=32)/forward/CPU/2 thread(s) 332666 ns 312083.5 ns 1.07
batchedmm(16, Bsize=32)/forward/CPU/4 thread(s) 343000 ns 340166.5 ns 1.01
batchedmm(16, Bsize=32)/forward/CPU/8 thread(s) 451416 ns 445354 ns 1.01
batchedmm(16, Bsize=32)/forward/CPU/1 thread(s) 311792 ns 333520.5 ns 0.93
batchedmm(16, Bsize=32)/forward/GPU/CUDA 18160 ns 17918 ns 1.01
batchedmm(16, Bsize=32)/zygote/CPU/2 thread(s) 689103.5 ns 691583 ns 1.00
batchedmm(16, Bsize=32)/zygote/CPU/4 thread(s) 738792 ns 732334 ns 1.01
batchedmm(16, Bsize=32)/zygote/CPU/8 thread(s) 1027062.5 ns 1026459 ns 1.00
batchedmm(16, Bsize=32)/zygote/CPU/1 thread(s) 692499.5 ns 691042 ns 1.00
batchedmm(16, Bsize=32)/zygote/GPU/CUDA 225496 ns 273557 ns 0.82
batchedmm(16, Bsize=128)/forward/CPU/2 thread(s) 348333 ns 332396 ns 1.05
batchedmm(16, Bsize=128)/forward/CPU/4 thread(s) 356062.5 ns 348875 ns 1.02
batchedmm(16, Bsize=128)/forward/CPU/8 thread(s) 408667 ns 409541 ns 1.00
batchedmm(16, Bsize=128)/forward/CPU/1 thread(s) 349542 ns 375250 ns 0.93
batchedmm(16, Bsize=128)/forward/GPU/CUDA 22957 ns 22378 ns 1.03
batchedmm(16, Bsize=128)/zygote/CPU/2 thread(s) 753250 ns 755875 ns 1.00
batchedmm(16, Bsize=128)/zygote/CPU/4 thread(s) 751666.5 ns 743000 ns 1.01
batchedmm(16, Bsize=128)/zygote/CPU/8 thread(s) 1074208 ns 1068417 ns 1.01
batchedmm(16, Bsize=128)/zygote/CPU/1 thread(s) 817958 ns 822124.5 ns 0.99
batchedmm(16, Bsize=128)/zygote/GPU/CUDA 215397 ns 239682 ns 0.90
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/2 thread(s) 3500 ns 3625 ns 0.97
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/4 thread(s) 3625 ns 3417 ns 1.06
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/8 thread(s) 3750 ns 3583 ns 1.05
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/1 thread(s) 3542 ns 3583 ns 0.99
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/CUDA 18313 ns 17823 ns 1.03
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/2 thread(s) 4167 ns 4208 ns 0.99
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/4 thread(s) 4416 ns 4167 ns 1.06
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/8 thread(s) 4333 ns 4375 ns 0.99
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/1 thread(s) 4208 ns 4292 ns 0.98
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/CUDA 220540.5 ns 271995 ns 0.81
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) 4416 ns 4792 ns 0.92
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) 3833 ns 3834 ns 1.00
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) 5229.5 ns 5250 ns 1.00
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) 3625 ns 3625 ns 1
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA 166773 ns 214003.5 ns 0.78
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 8125 ns 8354.5 ns 0.97
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 8667 ns 8334 ns 1.04
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 8500 ns 8667 ns 0.98
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 8625 ns 8417 ns 1.02
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA 1013865.5 ns 1200425 ns 0.84
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 204458 ns 204209 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 211917 ns 210000 ns 1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 212625 ns 211875 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 200542 ns 199417 ns 1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 34728 ns 34086 ns 1.02
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 649167 ns 608520.5 ns 1.07
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 625833 ns 620750 ns 1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 620583 ns 620416 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 630375 ns 628625 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 287189.5 ns 347622 ns 0.83
batchedmm(128, Bsize=128)/forward/CPU/2 thread(s) 972479.5 ns 980000 ns 0.99
batchedmm(128, Bsize=128)/forward/CPU/4 thread(s) 935500 ns 929916.5 ns 1.01
batchedmm(128, Bsize=128)/forward/CPU/8 thread(s) 952291.5 ns 954250 ns 1.00
batchedmm(128, Bsize=128)/forward/CPU/1 thread(s) 1300521 ns 1278542 ns 1.02
batchedmm(128, Bsize=128)/forward/GPU/CUDA 207194 ns 206777 ns 1.00
batchedmm(128, Bsize=128)/zygote/CPU/2 thread(s) 4514354.5 ns 4651729 ns 0.97
batchedmm(128, Bsize=128)/zygote/CPU/4 thread(s) 4465500 ns 4500083 ns 0.99
batchedmm(128, Bsize=128)/zygote/CPU/8 thread(s) 4301791.5 ns 4296645.5 ns 1.00
batchedmm(128, Bsize=128)/zygote/CPU/1 thread(s) 6497562.5 ns 6216979.5 ns 1.05
batchedmm(128, Bsize=128)/zygote/GPU/CUDA 973204.5 ns 942518 ns 1.03
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) 4916.5 ns 3916 ns 1.26
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) 3292 ns 3375 ns 0.98
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) 5167 ns 4667 ns 1.11
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) 4166 ns 3354.5 ns 1.24
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA 167779.5 ns 231395.5 ns 0.73
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7875 ns 7375 ns 1.07
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7562.5 ns 7292 ns 1.04
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7792 ns 7667 ns 1.02
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7416 ns 7000 ns 1.06
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA 854999 ns 1002762 ns 0.85
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) 1611729.5 ns 1644583 ns 0.98
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) 1152542 ns 1174458 ns 0.98
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) 1367062.5 ns 1323125 ns 1.03
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) 2452792 ns 2461333.5 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/GPU/CUDA 216363 ns 213304.5 ns 1.01
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) 12329750 ns 12444729.5 ns 0.99
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) 9540958 ns 9564709 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) 9263625 ns 9234833 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) 18057917 ns 18020417 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/GPU/CUDA 1943297 ns 1940786 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) 17406250 ns 17431792 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) 14327042 ns 14392958.5 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) 14326417 ns 14240000 ns 1.01
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) 21194250 ns 21049562.5 ns 1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 89187.5 ns 90625 ns 0.98
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 86854 ns 88041 ns 0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 91709 ns 92333 ns 0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 91270.5 ns 136917 ns 0.67
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 126235 ns 125618 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2037042 ns 2061125 ns 0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1748458 ns 2018458 ns 0.87
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1941042 ns 1720042 ns 1.13
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2051500 ns 2024104 ns 1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 855504 ns 1024038 ns 0.84
batchedmm(2, Bsize=4)/forward/CPU/2 thread(s) 343916.5 ns 331312 ns 1.04
batchedmm(2, Bsize=4)/forward/CPU/4 thread(s) 346500 ns 343500 ns 1.01
batchedmm(2, Bsize=4)/forward/CPU/8 thread(s) 394208.5 ns 395083 ns 1.00
batchedmm(2, Bsize=4)/forward/CPU/1 thread(s) 290375 ns 310458.5 ns 0.94
batchedmm(2, Bsize=4)/forward/GPU/CUDA 15675 ns 15733 ns 1.00
batchedmm(2, Bsize=4)/zygote/CPU/2 thread(s) 700250 ns 699959 ns 1.00
batchedmm(2, Bsize=4)/zygote/CPU/4 thread(s) 731792 ns 722062.5 ns 1.01
batchedmm(2, Bsize=4)/zygote/CPU/8 thread(s) 1023542 ns 1018209 ns 1.01
batchedmm(2, Bsize=4)/zygote/CPU/1 thread(s) 649292 ns 646375 ns 1.00
batchedmm(2, Bsize=4)/zygote/GPU/CUDA 154345 ns 189475.5 ns 0.81
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7333 ns 7167 ns 1.02
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 5875 ns 5958 ns 0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 6042 ns 5875 ns 1.03
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 10250 ns 10000 ns 1.02
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 33224 ns 33239 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 215125.5 ns 221625 ns 0.97
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 220459 ns 219959 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 220917 ns 219750 ns 1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 240125 ns 218375 ns 1.10
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 265024.5 ns 314279 ns 0.84
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/2 thread(s) 3667 ns 3750 ns 0.98
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/4 thread(s) 3708 ns 3667 ns 1.01
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/8 thread(s) 3667 ns 3667 ns 1
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/1 thread(s) 3750 ns 3667 ns 1.02
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/CUDA 22498 ns 22722 ns 0.99
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/2 thread(s) 14417 ns 14167 ns 1.02
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/4 thread(s) 14417 ns 14334 ns 1.01
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/8 thread(s) 14375 ns 14291 ns 1.01
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/1 thread(s) 14292 ns 14375 ns 0.99
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/CUDA 375019.5 ns 475447 ns 0.79
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 92042 ns 95166.5 ns 0.97
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 93291 ns 91833 ns 1.02
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 96333 ns 96125 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 94604.5 ns 139167 ns 0.68
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 125576 ns 125450 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1923750 ns 1948250 ns 0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1650625 ns 1921104.5 ns 0.86
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1929083 ns 1669729.5 ns 1.16
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1934583 ns 1920708.5 ns 1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 790898.5 ns 954893.5 ns 0.83
lenet(28, 28, 1, 32)/forward/CPU/2 thread(s) 870500 ns 854375 ns 1.02
lenet(28, 28, 1, 32)/forward/CPU/4 thread(s) 827875 ns 817542 ns 1.01
lenet(28, 28, 1, 32)/forward/CPU/8 thread(s) 1213792 ns 1213833.5 ns 1.00
lenet(28, 28, 1, 32)/forward/CPU/1 thread(s) 935542 ns 958895.5 ns 0.98
lenet(28, 28, 1, 32)/forward/GPU/CUDA 272131.5 ns 276078 ns 0.99
lenet(28, 28, 1, 32)/zygote/CPU/2 thread(s) 2796417 ns 2843334 ns 0.98
lenet(28, 28, 1, 32)/zygote/CPU/4 thread(s) 2438417 ns 2456145.5 ns 0.99
lenet(28, 28, 1, 32)/zygote/CPU/8 thread(s) 3334749.5 ns 3332000 ns 1.00
lenet(28, 28, 1, 32)/zygote/CPU/1 thread(s) 3398250 ns 3419792 ns 0.99
lenet(28, 28, 1, 32)/zygote/GPU/CUDA 1412395 ns 1629171 ns 0.87
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 16937 ns 15333 ns 1.10
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 17187.5 ns 14709 ns 1.17
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 19208.5 ns 17041 ns 1.13
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 16500 ns 14333 ns 1.15
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 112153 ns 142609.5 ns 0.79
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 262584 ns 262125 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 217708 ns 215416.5 ns 1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 216167 ns 215250 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 257875 ns 221958 ns 1.16
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 531938.5 ns 641081.5 ns 0.83
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 222209 ns 221583.5 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 221417 ns 218625 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 222958 ns 222833 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 221625 ns 221750 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 211097.5 ns 271537.5 ns 0.78
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 499479 ns 497750 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 498250 ns 494833 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 497500 ns 497084 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 506791.5 ns 509000 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1218166.5 ns 1365399 ns 0.89
batchedmm(16, Bsize=4)/forward/CPU/2 thread(s) 330729 ns 315729 ns 1.05
batchedmm(16, Bsize=4)/forward/CPU/4 thread(s) 336917 ns 333917 ns 1.01
batchedmm(16, Bsize=4)/forward/CPU/8 thread(s) 363625 ns 375125 ns 0.97
batchedmm(16, Bsize=4)/forward/CPU/1 thread(s) 303145.5 ns 322083 ns 0.94
batchedmm(16, Bsize=4)/forward/GPU/CUDA 16466 ns 16846 ns 0.98
batchedmm(16, Bsize=4)/zygote/CPU/2 thread(s) 710584 ns 710041 ns 1.00
batchedmm(16, Bsize=4)/zygote/CPU/4 thread(s) 729625 ns 725063 ns 1.01
batchedmm(16, Bsize=4)/zygote/CPU/8 thread(s) 1023687.5 ns 1022417 ns 1.00
batchedmm(16, Bsize=4)/zygote/CPU/1 thread(s) 661708 ns 663021 ns 1.00
batchedmm(16, Bsize=4)/zygote/GPU/CUDA 188015 ns 196884 ns 0.95
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 21250 ns 17625 ns 1.21
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 17625 ns 16708 ns 1.05
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 20125 ns 18792 ns 1.07
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 18521 ns 17625 ns 1.05
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 143789 ns 144721 ns 0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 213541 ns 220104.5 ns 0.97
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 218604 ns 212792 ns 1.03
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 213062.5 ns 212750 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 239750 ns 217250 ns 1.10
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 905449 ns 955774 ns 0.95
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) 6833 ns 6042 ns 1.13
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) 6375 ns 4250 ns 1.50
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) 6229 ns 6958 ns 0.90
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) 6917 ns 6541 ns 1.06
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA 186975 ns 245177 ns 0.76
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 10625 ns 10583.5 ns 1.00
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 10667 ns 10250 ns 1.04
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 11500 ns 10708 ns 1.07
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 10979 ns 10084 ns 1.09
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA 1010754 ns 1099715 ns 0.92
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 3791.5 ns 4542 ns 0.83
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 3208 ns 3208 ns 1
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 4895.5 ns 4834 ns 1.01
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 4354 ns 2875 ns 1.51
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA 207943 ns 250616.5 ns 0.83
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7750 ns 7125 ns 1.09
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7625 ns 7375 ns 1.03
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 8208 ns 7750 ns 1.06
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7792 ns 7375 ns 1.06
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA 1022804 ns 1110249 ns 0.92
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) 23591167 ns 24293729.5 ns 0.97
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) 34749750 ns 34647499.5 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) 37965979 ns 38065167 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) 35331667 ns 34799687.5 ns 1.02
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/GPU/CUDA 1845478 ns 1834951 ns 1.01
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) 184329000 ns 187799375 ns 0.98
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) 159275792 ns 159175458 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) 146325020.5 ns 146555271 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) 421919833.5 ns 415008291 ns 1.02
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/GPU/CUDA 16511002 ns 16504056.5 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) 426977958 ns 437855250 ns 0.98
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) 253895666.5 ns 254443000 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) 232491541.5 ns 231693624.5 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) 496597729.5 ns 485497958 ns 1.02
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 182708 ns 184229.5 ns 0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 183500 ns 181916 ns 1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 186750 ns 184084 ns 1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 183625.5 ns 182167 ns 1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 180986.5 ns 230730 ns 0.78
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 599646 ns 637084 ns 0.94
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 599167 ns 586270.5 ns 1.02
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 586312.5 ns 586583 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 631583 ns 631542 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1006207 ns 1097701 ns 0.92
batchedmm(128, Bsize=512)/forward/CPU/2 thread(s) 3875729 ns 3894562.5 ns 1.00
batchedmm(128, Bsize=512)/forward/CPU/4 thread(s) 3691625.5 ns 3827292 ns 0.96
batchedmm(128, Bsize=512)/forward/CPU/8 thread(s) 3494000 ns 3469958 ns 1.01
batchedmm(128, Bsize=512)/forward/CPU/1 thread(s) 5492270.5 ns 5353020.5 ns 1.03
batchedmm(128, Bsize=512)/forward/GPU/CUDA 537118 ns 535365 ns 1.00
batchedmm(128, Bsize=512)/zygote/CPU/2 thread(s) 17349416 ns 18146250 ns 0.96
batchedmm(128, Bsize=512)/zygote/CPU/4 thread(s) 17179917 ns 17166041.5 ns 1.00
batchedmm(128, Bsize=512)/zygote/CPU/8 thread(s) 16549228.5 ns 16601417 ns 1.00
batchedmm(128, Bsize=512)/zygote/CPU/1 thread(s) 23178583 ns 22202083 ns 1.04
batchedmm(128, Bsize=512)/zygote/GPU/CUDA 2620708.5 ns 2616593 ns 1.00
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 500 ns 500 ns 1
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 542 ns 458 ns 1.18
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 500 ns 500 ns 1
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 583 ns 500 ns 1.17
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA 31700 ns 32123 ns 0.99
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 9584 ns 9458 ns 1.01
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 9417 ns 8667 ns 1.09
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 9458 ns 9167 ns 1.03
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 9937.5 ns 9208 ns 1.08
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA 257065.5 ns 267754 ns 0.96
vgg16(32, 32, 3, 128)/forward/CPU/2 thread(s) 495483250 ns 580762562.5 ns 0.85
vgg16(32, 32, 3, 128)/forward/CPU/4 thread(s) 429872875 ns 427173312.5 ns 1.01
vgg16(32, 32, 3, 128)/forward/CPU/8 thread(s) 433975833 ns 376948624.5 ns 1.15
vgg16(32, 32, 3, 128)/forward/CPU/1 thread(s) 676305021 ns 671986666.5 ns 1.01
vgg16(32, 32, 3, 128)/forward/GPU/CUDA 12479376 ns 12479261 ns 1.00
vgg16(32, 32, 3, 128)/zygote/CPU/2 thread(s) 2043995812.5 ns 2061821458.5 ns 0.99
vgg16(32, 32, 3, 128)/zygote/CPU/4 thread(s) 1631415458 ns 1626836125 ns 1.00
vgg16(32, 32, 3, 128)/zygote/CPU/8 thread(s) 1493405541.5 ns 1500724875 ns 1.00
vgg16(32, 32, 3, 128)/zygote/CPU/1 thread(s) 2222310229 ns 2217147562.5 ns 1.00
vgg16(32, 32, 3, 128)/zygote/GPU/CUDA 49049243.5 ns 48947892 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) 1658542 ns 1651250 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) 1177291 ns 1196959 ns 0.98
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) 1381521 ns 1346187.5 ns 1.03
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) 2411375 ns 2356042 ns 1.02
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/GPU/CUDA 215956.5 ns 218070 ns 0.99
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) 12702500 ns 12822417 ns 0.99
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) 9925417 ns 9953541.5 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) 9668750 ns 9605000 ns 1.01
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) 18553500 ns 18408062.5 ns 1.01
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/GPU/CUDA 2022588 ns 2047696.5 ns 0.99
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) 17684438 ns 17771104.5 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) 14677583 ns 14762729 ns 0.99
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) 14539333.5 ns 14473917 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) 21490417 ns 21336042 ns 1.01
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/2 thread(s) 26208 ns 26250 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/4 thread(s) 26208 ns 26209 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/8 thread(s) 26208 ns 26583 ns 0.99
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/1 thread(s) 26250 ns 26209 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/CUDA 24528 ns 24922 ns 0.98
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) 66750 ns 66792 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) 67084 ns 67000 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) 67750 ns 66791 ns 1.01
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) 66875 ns 66916 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/CUDA 381706 ns 410676.5 ns 0.93
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 204209 ns 203542 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 208750 ns 210583 ns 0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 209667 ns 210500 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 199916 ns 199958 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 26151 ns 26405 ns 0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 649958 ns 602333 ns 1.08
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 624646 ns 621292 ns 1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 666084 ns 621250 ns 1.07
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 587687.5 ns 630584 ns 0.93
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 308606 ns 355627 ns 0.87
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 650875 ns 657646 ns 0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 634875 ns 638729 ns 0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 648000 ns 544125 ns 1.19
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 654646 ns 677396 ns 0.97
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 131873 ns 132242 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2262708.5 ns 2305542 ns 0.98
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1990917 ns 2254292 ns 0.88
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2254312.5 ns 1426250 ns 1.58
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2295625 ns 2248542 ns 1.02
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 1112430.5 ns 1182706 ns 0.94
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 18688 ns 17937.5 ns 1.04
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 18063 ns 17042 ns 1.06
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 22542 ns 19500 ns 1.16
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 19375 ns 16895.5 ns 1.15
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 143353.5 ns 144900 ns 0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 261250 ns 220000 ns 1.19
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 219104.5 ns 218416.5 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 263062.5 ns 219458 ns 1.20
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 231875 ns 261708 ns 0.89
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 943050 ns 1051792 ns 0.90
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 542 ns 459 ns 1.18
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 583 ns 459 ns 1.27
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 583 ns 542 ns 1.08
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 583 ns 458 ns 1.27
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA 23123 ns 23475 ns 0.99
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 10125 ns 9520.5 ns 1.06
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 10292 ns 9541 ns 1.08
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 9750 ns 10166 ns 0.96
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 10250 ns 9375 ns 1.09
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA 253508 ns 261505 ns 0.97
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) 6292 ns 6542 ns 0.96
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) 7229.5 ns 5292 ns 1.37
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) 6791 ns 6625 ns 1.03
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) 6334 ns 7416 ns 0.85
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA 187408.5 ns 235631 ns 0.80
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7417 ns 7000 ns 1.06
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7583 ns 7291 ns 1.04
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7000 ns 7250 ns 0.97
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7500 ns 7208 ns 1.04
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA 734791.5 ns 803793 ns 0.91
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/2 thread(s) 2333 ns 2334 ns 1.00
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/4 thread(s) 2042 ns 2041 ns 1.00
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/8 thread(s) 2500 ns 2292 ns 1.09
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/1 thread(s) 2291.5 ns 2333 ns 0.98
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/CUDA 17938 ns 18245.5 ns 0.98
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) 6625 ns 6750 ns 0.98
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) 6750 ns 6459 ns 1.05
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) 6583 ns 6667 ns 0.99
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) 6750 ns 6625 ns 1.02
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/CUDA 306716.5 ns 333087.5 ns 0.92
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/2 thread(s) 749833 ns 748458 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/4 thread(s) 746916 ns 746645.5 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/8 thread(s) 748791 ns 746833 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/1 thread(s) 749417 ns 749417 ns 1
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/CUDA 21741 ns 21817 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/2 thread(s) 777958 ns 789125.5 ns 0.99
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/4 thread(s) 795500 ns 772625 ns 1.03
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/8 thread(s) 799667 ns 775145.5 ns 1.03
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/1 thread(s) 784479 ns 787875 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/CUDA 294410 ns 298327 ns 0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7416 ns 7291 ns 1.02
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 5916 ns 5959 ns 0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 6125 ns 5750 ns 1.07
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 10458 ns 10792 ns 0.97
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 32276 ns 32858 ns 0.98
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 262167 ns 221541 ns 1.18
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 230083 ns 226958 ns 1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 248167 ns 226625 ns 1.10
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 226459 ns 220292 ns 1.03
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 314403 ns 360131.5 ns 0.87
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 13000 ns 10250 ns 1.27
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 12291 ns 9917 ns 1.24
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 12500 ns 12459 ns 1.00
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 10875 ns 10583.5 ns 1.03
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA 198124.5 ns 243730.5 ns 0.81
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 25083.5 ns 24834 ns 1.01
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 25166 ns 24833.5 ns 1.01
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 25250 ns 24750 ns 1.02
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 25104.5 ns 24666 ns 1.02
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA 1042743 ns 1133764 ns 0.92
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) 106137000 ns 107061375 ns 0.99
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) 117694792 ns 116928479.5 ns 1.01
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) 120933167 ns 121136000 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) 117852479.5 ns 117635875 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/GPU/CUDA 2636050.5 ns 2659433 ns 0.99
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) 390604709 ns 396814083.5 ns 0.98
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) 366692333 ns 366591458 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) 424499104 ns 425794499.5 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) 486163875 ns 482285959 ns 1.01
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/GPU/CUDA 15155612.5 ns 15258375 ns 0.99
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) 762084521 ns 769963270.5 ns 0.99
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) 754403541 ns 576371708 ns 1.31
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) 747877125 ns 745582312 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) 959384083 ns 765495854.5 ns 1.25
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) 8208 ns 7333 ns 1.12
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) 7667 ns 6334 ns 1.21
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) 7583 ns 7750 ns 0.98
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) 7625 ns 8333 ns 0.92
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA 226253.5 ns 237972 ns 0.95
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 13958 ns 14125 ns 0.99
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 14000 ns 13209 ns 1.06
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 14542 ns 13417 ns 1.08
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 14500 ns 13459 ns 1.08
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA 1064461 ns 1080162 ns 0.99
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) 8791.5 ns 7667 ns 1.15
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) 8333 ns 5583 ns 1.49
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) 7541 ns 8167 ns 0.92
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) 7250 ns 8291 ns 0.87
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA 236495 ns 233794.5 ns 1.01
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 12584 ns 12542 ns 1.00
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 12583 ns 11875 ns 1.06
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 12000 ns 12645.5 ns 0.95
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 12750 ns 11875 ns 1.07
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA 795022.5 ns 787815 ns 1.01
batchedmm(2, Bsize=128)/forward/CPU/2 thread(s) 343417 ns 332667 ns 1.03
batchedmm(2, Bsize=128)/forward/CPU/4 thread(s) 344208 ns 344396 ns 1.00
batchedmm(2, Bsize=128)/forward/CPU/8 thread(s) 391084 ns 395770.5 ns 0.99
batchedmm(2, Bsize=128)/forward/CPU/1 thread(s) 294084 ns 312500 ns 0.94
batchedmm(2, Bsize=128)/forward/GPU/CUDA 16891 ns 16497 ns 1.02
batchedmm(2, Bsize=128)/zygote/CPU/2 thread(s) 706521 ns 706958.5 ns 1.00
batchedmm(2, Bsize=128)/zygote/CPU/4 thread(s) 725708.5 ns 725208 ns 1.00
batchedmm(2, Bsize=128)/zygote/CPU/8 thread(s) 1022083 ns 1019750 ns 1.00
batchedmm(2, Bsize=128)/zygote/CPU/1 thread(s) 656916 ns 658292 ns 1.00
batchedmm(2, Bsize=128)/zygote/GPU/CUDA 200720.5 ns 198046.5 ns 1.01
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) 375 ns 375 ns 1
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) 375 ns 292 ns 1.28
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) 417 ns 375 ns 1.11
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) 375 ns 292 ns 1.28
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA 23487 ns 22951 ns 1.02
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 6875 ns 6542 ns 1.05
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 6667 ns 6208 ns 1.07
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 6625 ns 6792 ns 0.98
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 6750 ns 6208 ns 1.09
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA 240615.5 ns 237567.5 ns 1.01
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 5750 ns 5709 ns 1.01
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 5750 ns 5667 ns 1.01
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 5708 ns 5875 ns 0.97
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 5792 ns 5667 ns 1.02
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA 24487 ns 24038 ns 1.02
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 21375 ns 21958 ns 0.97
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 21542 ns 20875 ns 1.03
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 21542 ns 21625 ns 1.00
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 21917 ns 21125 ns 1.04
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA 263364.5 ns 260574.5 ns 1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 155396 ns 146812.5 ns 1.06
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 145875 ns 143875 ns 1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 150208 ns 145917 ns 1.03
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 146709 ns 178146 ns 0.82
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 167437.5 ns 166659.5 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1336584 ns 1355917 ns 0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1275583 ns 1329374.5 ns 0.96
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1335084 ns 861416.5 ns 1.55
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1360584 ns 1325916 ns 1.03
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 1355756.5 ns 1338261 ns 1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 24917 ns 23084 ns 1.08
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 24417 ns 21458 ns 1.14
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 25375 ns 24042 ns 1.06
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 24437.5 ns 23958 ns 1.02
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 291074 ns 350919.5 ns 0.83
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 118167 ns 179500 ns 0.66
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 131959 ns 120541 ns 1.09
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 177208 ns 118167 ns 1.50
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 137250 ns 151208 ns 0.91
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1471741 ns 1454020.5 ns 1.01
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 334 ns 292 ns 1.14
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 375 ns 292 ns 1.28
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 416 ns 375 ns 1.11
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 375 ns 291 ns 1.29
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA 23087.5 ns 22580 ns 1.02
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 6833 ns 6291 ns 1.09
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 6750 ns 6334 ns 1.07
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 6500 ns 6791 ns 0.96
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 6750 ns 6208 ns 1.09
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA 257954.5 ns 253799.5 ns 1.02
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 6792 ns 5042 ns 1.35
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 4708 ns 4250 ns 1.11
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 6000 ns 5833.5 ns 1.03
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 5895.5 ns 4666 ns 1.26
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA 256219 ns 254794.5 ns 1.01
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 10125 ns 10042 ns 1.01
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 10292 ns 10042 ns 1.02
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 10250 ns 10417 ns 0.98
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 10250 ns 10125 ns 1.01
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA 1358618 ns 1352736 ns 1.00
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/2 thread(s) 1583 ns 1625 ns 0.97
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/4 thread(s) 1583 ns 1583 ns 1
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/8 thread(s) 1584 ns 1584 ns 1
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/1 thread(s) 1625 ns 1542 ns 1.05
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/CUDA 23481 ns 23495 ns 1.00
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) 5584 ns 5708 ns 0.98
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) 5958 ns 5667 ns 1.05
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) 5709 ns 5750 ns 0.99
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) 5959 ns 5625 ns 1.06
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/CUDA 277247.5 ns 273637.5 ns 1.01
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) 6821583 ns 6842458 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) 6380125 ns 6343020.5 ns 1.01
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) 6552937.5 ns 6507417 ns 1.01
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) 7533458 ns 7623042 ns 0.99
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/GPU/CUDA 214794 ns 213659 ns 1.01
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) 24101916.5 ns 24131500 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) 21276833.5 ns 21298104 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) 20988834 ns 21004749.5 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) 29842250 ns 29792896 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/GPU/CUDA 2114879 ns 2117701 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) 37563979 ns 37668083 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) 45438542 ns 34323688 ns 1.32
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) 45648292 ns 45641000 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) 49527416.5 ns 38230313 ns 1.30
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) 7083 ns 6459 ns 1.10
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) 7209 ns 5250 ns 1.37
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) 7416.5 ns 7500 ns 0.99
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) 7000 ns 7458 ns 0.94
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA 236381.5 ns 235380.5 ns 1.00
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 8417 ns 8541 ns 0.99
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 8834 ns 7792 ns 1.13
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 8500 ns 8292 ns 1.03
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 8584 ns 9208 ns 0.93
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA 1061241 ns 1057995 ns 1.00
lenet(28, 28, 1, 128)/forward/CPU/2 thread(s) 1558709 ns 1525083 ns 1.02
lenet(28, 28, 1, 128)/forward/CPU/4 thread(s) 1267000 ns 1258604.5 ns 1.01
lenet(28, 28, 1, 128)/forward/CPU/8 thread(s) 1619916 ns 1613917 ns 1.00
lenet(28, 28, 1, 128)/forward/CPU/1 thread(s) 2135916 ns 2159167 ns 0.99
lenet(28, 28, 1, 128)/forward/GPU/CUDA 279335.5 ns 273469.5 ns 1.02
lenet(28, 28, 1, 128)/zygote/CPU/2 thread(s) 7913875 ns 7971979 ns 0.99
lenet(28, 28, 1, 128)/zygote/CPU/4 thread(s) 6585896 ns 6561833.5 ns 1.00
lenet(28, 28, 1, 128)/zygote/CPU/8 thread(s) 7119833 ns 7004875 ns 1.02
lenet(28, 28, 1, 128)/zygote/CPU/1 thread(s) 10553854 ns 10476458 ns 1.01
lenet(28, 28, 1, 128)/zygote/GPU/CUDA 1875832 ns 1860749 ns 1.01
batchedmm(128, Bsize=4)/forward/CPU/2 thread(s) 334271 ns 326083.5 ns 1.03
batchedmm(128, Bsize=4)/forward/CPU/4 thread(s) 347750 ns 347292 ns 1.00
batchedmm(128, Bsize=4)/forward/CPU/8 thread(s) 403041 ns 379020.5 ns 1.06
batchedmm(128, Bsize=4)/forward/CPU/1 thread(s) 325792 ns 343562.5 ns 0.95
batchedmm(128, Bsize=4)/forward/GPU/CUDA 47242 ns 46613.5 ns 1.01
batchedmm(128, Bsize=4)/zygote/CPU/2 thread(s) 747250 ns 745458 ns 1.00
batchedmm(128, Bsize=4)/zygote/CPU/4 thread(s) 790812.5 ns 781417 ns 1.01
batchedmm(128, Bsize=4)/zygote/CPU/8 thread(s) 1075687.5 ns 1067437.5 ns 1.01
batchedmm(128, Bsize=4)/zygote/CPU/1 thread(s) 772458 ns 751125 ns 1.03
batchedmm(128, Bsize=4)/zygote/GPU/CUDA 311274.5 ns 306721.5 ns 1.01
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/2 thread(s) 397166 ns 396333 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/4 thread(s) 288125 ns 287916 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/8 thread(s) 288125 ns 288062.5 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/1 thread(s) 750041 ns 751542 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/CUDA 44790 ns 43483 ns 1.03
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/2 thread(s) 663125 ns 646375 ns 1.03
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/4 thread(s) 527542 ns 531834 ns 0.99
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/8 thread(s) 532084 ns 530042 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/1 thread(s) 975042 ns 973417 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/CUDA 192626.5 ns 188389 ns 1.02
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 651958 ns 653542 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 645104.5 ns 639041.5 ns 1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 641500 ns 545542 ns 1.18
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 647583.5 ns 655584 ns 0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 132300 ns 131455.5 ns 1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2470625 ns 2529917 ns 0.98
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2451792 ns 2399708 ns 1.02
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2459958 ns 2436833 ns 1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2537146 ns 2460520.5 ns 1.03
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 1356668.5 ns 1513461 ns 0.90
batchedmm(2, Bsize=32)/forward/CPU/2 thread(s) 342875 ns 323146 ns 1.06
batchedmm(2, Bsize=32)/forward/CPU/4 thread(s) 342666 ns 343771 ns 1.00
batchedmm(2, Bsize=32)/forward/CPU/8 thread(s) 394791 ns 394750 ns 1.00
batchedmm(2, Bsize=32)/forward/CPU/1 thread(s) 290500 ns 310562 ns 0.94
batchedmm(2, Bsize=32)/forward/GPU/CUDA 16338 ns 15996 ns 1.02
batchedmm(2, Bsize=32)/zygote/CPU/2 thread(s) 699604.5 ns 699000 ns 1.00
batchedmm(2, Bsize=32)/zygote/CPU/4 thread(s) 725000 ns 717792 ns 1.01
batchedmm(2, Bsize=32)/zygote/CPU/8 thread(s) 1021833 ns 1016334 ns 1.01
batchedmm(2, Bsize=32)/zygote/CPU/1 thread(s) 650708 ns 649937 ns 1.00
batchedmm(2, Bsize=32)/zygote/GPU/CUDA 200343 ns 196510 ns 1.02
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 1466042 ns 1458958 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 1504208 ns 1506167 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 1504292 ns 1503458 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 1443375 ns 1442834 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 41211 ns 39862 ns 1.03
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 5127354.5 ns 5157334 ns 0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 5278875 ns 5010437.5 ns 1.05
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 5250459 ns 4993104 ns 1.05
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 5013583 ns 4988542 ns 1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 198454 ns 197580.5 ns 1.00
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/2 thread(s) 3666 ns 3709 ns 0.99
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/4 thread(s) 3667 ns 3667 ns 1
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/8 thread(s) 3667 ns 3667 ns 1
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/1 thread(s) 3708 ns 3708 ns 1
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/CUDA 33443 ns 32748 ns 1.02
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/2 thread(s) 15208 ns 14833 ns 1.03
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/4 thread(s) 15375 ns 15125 ns 1.02
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/8 thread(s) 15083 ns 15292 ns 0.99
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/1 thread(s) 15166 ns 15041 ns 1.01
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/CUDA 380259.5 ns 374855 ns 1.01
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/2 thread(s) 70979.5 ns 71625 ns 0.99
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/4 thread(s) 71084 ns 71333 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/8 thread(s) 71041 ns 71333 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/1 thread(s) 71125 ns 71333 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/CUDA 113289 ns 113422 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/2 thread(s) 317167 ns 326208 ns 0.97
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/4 thread(s) 319541 ns 318250 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/8 thread(s) 320167 ns 319375 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/1 thread(s) 326750 ns 317917 ns 1.03
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/CUDA 196353 ns 192316 ns 1.02
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 1083 ns 1000 ns 1.08
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 1042 ns 959 ns 1.09
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 958 ns 1083 ns 0.88
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 1042 ns 1000 ns 1.04
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA 23589 ns 23450 ns 1.01
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 8375 ns 8042 ns 1.04
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 8417 ns 7895.5 ns 1.07
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 8125 ns 8333 ns 0.98
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 8375 ns 7792 ns 1.07
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA 262304.5 ns 258455 ns 1.01
batchedmm(128, Bsize=32)/forward/CPU/2 thread(s) 472833 ns 465250 ns 1.02
batchedmm(128, Bsize=32)/forward/CPU/4 thread(s) 480833 ns 472750 ns 1.02
batchedmm(128, Bsize=32)/forward/CPU/8 thread(s) 551167 ns 547875 ns 1.01
batchedmm(128, Bsize=32)/forward/CPU/1 thread(s) 543125 ns 554667 ns 0.98
batchedmm(128, Bsize=32)/forward/GPU/CUDA 129864.5 ns 130091 ns 1.00
batchedmm(128, Bsize=32)/zygote/CPU/2 thread(s) 1386000 ns 1420208 ns 0.98
batchedmm(128, Bsize=32)/zygote/CPU/4 thread(s) 1384250 ns 1378895.5 ns 1.00
batchedmm(128, Bsize=32)/zygote/CPU/8 thread(s) 1602583 ns 1600250 ns 1.00
batchedmm(128, Bsize=32)/zygote/CPU/1 thread(s) 1624229.5 ns 1587791 ns 1.02
batchedmm(128, Bsize=32)/zygote/GPU/CUDA 274560 ns 274988 ns 1.00
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 375 ns 334 ns 1.12
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 375 ns 292 ns 1.28
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 333 ns 292 ns 1.14
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 375 ns 292 ns 1.28
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA 32065 ns 31336 ns 1.02
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 6542 ns 6625 ns 0.99
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 6584 ns 5959 ns 1.10
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 6209 ns 6354.5 ns 0.98
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 6584 ns 6166 ns 1.07
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA 268387 ns 261129.5 ns 1.03
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 1724500 ns 1730708 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 1724917 ns 1721229.5 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 1729625 ns 1723750 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 1771291 ns 1730229 ns 1.02
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 169401 ns 168441.5 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 4377917 ns 4400167 ns 0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 4335812.5 ns 4366354 ns 0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 4372021 ns 3903958 ns 1.12
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 4418583.5 ns 4358458 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 1249618 ns 1240708 ns 1.01
bias_activation(512, act=relu)(512 x 128)/forward/CPU/2 thread(s) 7041 ns 6792 ns 1.04
bias_activation(512, act=relu)(512 x 128)/forward/CPU/4 thread(s) 6729.5 ns 6584 ns 1.02
bias_activation(512, act=relu)(512 x 128)/forward/CPU/8 thread(s) 7166 ns 6833 ns 1.05
bias_activation(512, act=relu)(512 x 128)/forward/CPU/1 thread(s) 6833 ns 14542 ns 0.47
bias_activation(512, act=relu)(512 x 128)/forward/GPU/CUDA 20772 ns 20531 ns 1.01
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/2 thread(s) 52792 ns 32708 ns 1.61
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/4 thread(s) 47770.5 ns 67708 ns 0.71
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/8 thread(s) 50875 ns 32833 ns 1.55
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/1 thread(s) 52417 ns 51667 ns 1.01
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/CUDA 210358.5 ns 291979.5 ns 0.72
batchedmm(2, Bsize=512)/forward/CPU/2 thread(s) 356500 ns 336292 ns 1.06
batchedmm(2, Bsize=512)/forward/CPU/4 thread(s) 349062 ns 347187.5 ns 1.01
batchedmm(2, Bsize=512)/forward/CPU/8 thread(s) 421583 ns 415021 ns 1.02
batchedmm(2, Bsize=512)/forward/CPU/1 thread(s) 300208.5 ns 324666.5 ns 0.92
batchedmm(2, Bsize=512)/forward/GPU/CUDA 18809 ns 18102.5 ns 1.04
batchedmm(2, Bsize=512)/zygote/CPU/2 thread(s) 721458.5 ns 718416.5 ns 1.00
batchedmm(2, Bsize=512)/zygote/CPU/4 thread(s) 732250 ns 727250 ns 1.01
batchedmm(2, Bsize=512)/zygote/CPU/8 thread(s) 1033791 ns 1030292 ns 1.00
batchedmm(2, Bsize=512)/zygote/CPU/1 thread(s) 673417 ns 672709 ns 1.00
batchedmm(2, Bsize=512)/zygote/GPU/CUDA 344380.5 ns 346719.5 ns 0.99
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/2 thread(s) 75583 ns 75667 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/4 thread(s) 75291 ns 75208 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/8 thread(s) 75209 ns 75375 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/1 thread(s) 75145.5 ns 75000 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/CUDA 47560 ns 46739 ns 1.02
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/2 thread(s) 324250 ns 333209 ns 0.97
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/4 thread(s) 332667 ns 331291 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/8 thread(s) 327750 ns 332729.5 ns 0.99
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/1 thread(s) 333166 ns 324292 ns 1.03
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/CUDA 213112 ns 208913 ns 1.02
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 1490584 ns 1483875 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 1530958 ns 1531875 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 1530417 ns 1529458 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 1466709 ns 1467834 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 52219 ns 51266 ns 1.02
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 5129458.5 ns 5149875 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 5247062.5 ns 5290166.5 ns 0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 5269917 ns 5287000 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 5022250 ns 4982583 ns 1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 201172.5 ns 202737.5 ns 0.99
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/2 thread(s) 28125 ns 28291 ns 0.99
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/4 thread(s) 28167 ns 28167 ns 1
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/8 thread(s) 28250 ns 28291 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/1 thread(s) 28250 ns 28167 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/CUDA 24829 ns 24497 ns 1.01
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) 66292 ns 66625 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) 66500 ns 66542 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) 66833 ns 66500 ns 1.01
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) 66500 ns 66500 ns 1
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/CUDA 533991 ns 532969 ns 1.00
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/2 thread(s) 1498834 ns 1260875 ns 1.19
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/4 thread(s) 1142333 ns 1118417 ns 1.02
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/8 thread(s) 1135875 ns 1056541 ns 1.08
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/1 thread(s) 2243250 ns 2256375 ns 0.99
mlp7layer_bn(tanh)(32 x 256)/forward/GPU/CUDA 568748 ns 573252 ns 0.99
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/2 thread(s) 3097084 ns 3028208 ns 1.02
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/4 thread(s) 2582250 ns 2726937.5 ns 0.95
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/8 thread(s) 2755667 ns 2733875 ns 1.01
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/1 thread(s) 3877542 ns 3818500 ns 1.02
mlp7layer_bn(tanh)(32 x 256)/zygote/GPU/CUDA 2062479 ns 1997088 ns 1.03
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/2 thread(s) 8842146 ns 8958062.5 ns 0.99
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/4 thread(s) 8794833 ns 8813834 ns 1.00
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/8 thread(s) 8782250 ns 8742917 ns 1.00
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/1 thread(s) 6445917 ns 6350021 ns 1.02
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 82917 ns 82895.5 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 81166 ns 80270.5 ns 1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 85750 ns 82875 ns 1.03
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 93479.5 ns 80167 ns 1.17
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 192413.5 ns 192999 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2020417 ns 2045708.5 ns 0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2018916.5 ns 2026499.5 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2022875 ns 2015875 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2038583 ns 2005042 ns 1.02
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 792691 ns 797613 ns 0.99

This comment was automatically generated by workflow using github-action-benchmark.

@avik-pal avik-pal added the xla label Nov 2, 2024
@avik-pal avik-pal force-pushed the ap/initial_jax_bench branch 4 times, most recently from c933d28 to 4a02032 Compare November 3, 2024 19:42
@avik-pal
Copy link
Member Author

avik-pal commented Nov 4, 2024

use #1021 and remove the tracing part from lux extension

@avik-pal avik-pal force-pushed the ap/initial_jax_bench branch 4 times, most recently from f041d46 to 863de31 Compare November 5, 2024 13:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant