-
Notifications
You must be signed in to change notification settings - Fork 63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf: benchmarking our models against Jax (Flax) #1000
base: main
Are you sure you want to change the base?
Conversation
1dc9176
to
5a390e1
Compare
Benchmark Results (ASV)
Benchmark PlotsA plot of the benchmark results have been uploaded as an artifact to the workflow run for this PR. |
c35441f
to
5a28fc0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lux Benchmarks
Benchmark suite | Current: f041d46 | Previous: 409eda2 | Ratio |
---|---|---|---|
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
4333 ns |
4334 ns |
1.00 |
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
4042 ns |
4125 ns |
0.98 |
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
5416 ns |
5417 ns |
1.00 |
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
4334 ns |
4167 ns |
1.04 |
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA |
61224 ns |
59978 ns |
1.02 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
10584 ns |
10333 ns |
1.02 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
10916 ns |
10167 ns |
1.07 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
11125 ns |
10500 ns |
1.06 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
10750 ns |
10167 ns |
1.06 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA |
426403 ns |
416390 ns |
1.02 |
bias_activation(32, act=relu)(32 x 128)/forward/CPU/2 thread(s) |
1208 ns |
1166.5 ns |
1.04 |
bias_activation(32, act=relu)(32 x 128)/forward/CPU/4 thread(s) |
3042 ns |
3042 ns |
1 |
bias_activation(32, act=relu)(32 x 128)/forward/CPU/8 thread(s) |
1250 ns |
1208 ns |
1.03 |
bias_activation(32, act=relu)(32 x 128)/forward/CPU/1 thread(s) |
1000 ns |
1000 ns |
1 |
bias_activation(32, act=relu)(32 x 128)/forward/GPU/CUDA |
18376 ns |
18063 ns |
1.02 |
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/2 thread(s) |
4042 ns |
4084 ns |
0.99 |
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/4 thread(s) |
4083 ns |
3958 ns |
1.03 |
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/8 thread(s) |
4333 ns |
4250 ns |
1.02 |
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/1 thread(s) |
4062.5 ns |
4125 ns |
0.98 |
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/CUDA |
110978.5 ns |
109325.5 ns |
1.02 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
57479.5 ns |
56041 ns |
1.03 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
46084 ns |
46084 ns |
1 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
46833 ns |
46375 ns |
1.01 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
82541 ns |
81834 ns |
1.01 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
37554.5 ns |
36229 ns |
1.04 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2024292 ns |
2056625 ns |
0.98 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
2082458.5 ns |
2082416.5 ns |
1.00 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2089083 ns |
2056666.5 ns |
1.02 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
2009166 ns |
1995458 ns |
1.01 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
197120.5 ns |
192802 ns |
1.02 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
146917 ns |
172458 ns |
0.85 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
145958 ns |
144854.5 ns |
1.01 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
144958.5 ns |
148125 ns |
0.98 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
148500 ns |
146125 ns |
1.02 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
166245.5 ns |
166789 ns |
1.00 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1116625 ns |
1157666 ns |
0.96 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1115271 ns |
1110395.5 ns |
1.00 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1124042 ns |
1128416.5 ns |
1.00 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1137562.5 ns |
1120208 ns |
1.02 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
527410 ns |
516061 ns |
1.02 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
3750 ns |
3583 ns |
1.05 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
3354.5 ns |
3583.5 ns |
0.94 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
4541 ns |
4229.5 ns |
1.07 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
3917 ns |
3292 ns |
1.19 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA |
70257 ns |
69748 ns |
1.01 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
9542 ns |
8792 ns |
1.09 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
9750 ns |
9125 ns |
1.07 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
9375 ns |
9000 ns |
1.04 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
8959 ns |
9209 ns |
0.97 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA |
480558 ns |
470533 ns |
1.02 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
17167 ns |
15083 ns |
1.14 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
16958 ns |
14875 ns |
1.14 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
18041.5 ns |
16583 ns |
1.09 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
15417 ns |
14917 ns |
1.03 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
55098 ns |
53475 ns |
1.03 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
219500 ns |
222375 ns |
0.99 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
213417 ns |
213084 ns |
1.00 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
212958 ns |
213250 ns |
1.00 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
220083.5 ns |
213520.5 ns |
1.03 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
274263 ns |
267675 ns |
1.02 |
bias_activation(2, act=relu)(2 x 128)/forward/CPU/2 thread(s) |
625 ns |
500 ns |
1.25 |
bias_activation(2, act=relu)(2 x 128)/forward/CPU/4 thread(s) |
625 ns |
542 ns |
1.15 |
bias_activation(2, act=relu)(2 x 128)/forward/CPU/8 thread(s) |
750 ns |
584 ns |
1.28 |
bias_activation(2, act=relu)(2 x 128)/forward/CPU/1 thread(s) |
625 ns |
583 ns |
1.07 |
bias_activation(2, act=relu)(2 x 128)/forward/GPU/CUDA |
17758 ns |
17384 ns |
1.02 |
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/2 thread(s) |
1458 ns |
1500 ns |
0.97 |
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/4 thread(s) |
1417 ns |
1500 ns |
0.94 |
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/8 thread(s) |
1458 ns |
1750 ns |
0.83 |
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/1 thread(s) |
1750 ns |
1583 ns |
1.11 |
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/CUDA |
103393 ns |
103376 ns |
1.00 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
7041 ns |
7041 ns |
1 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
5958 ns |
5625 ns |
1.06 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
5917 ns |
5709 ns |
1.04 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
10167 ns |
9916 ns |
1.03 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
23878 ns |
23093 ns |
1.03 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
233625 ns |
227583.5 ns |
1.03 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
240250 ns |
230417 ns |
1.04 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
229083 ns |
228000 ns |
1.00 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
221791 ns |
215542 ns |
1.03 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
170258.5 ns |
166208.5 ns |
1.02 |
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/2 thread(s) |
3833 ns |
3916 ns |
0.98 |
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/4 thread(s) |
3875 ns |
3875 ns |
1 |
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/8 thread(s) |
3833 ns |
3834 ns |
1.00 |
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/1 thread(s) |
3875 ns |
3834 ns |
1.01 |
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/CUDA |
24141 ns |
23533 ns |
1.03 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/2 thread(s) |
16667 ns |
16708 ns |
1.00 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/4 thread(s) |
17000 ns |
16750 ns |
1.01 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/8 thread(s) |
16667 ns |
16791 ns |
0.99 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/1 thread(s) |
16792 ns |
16625 ns |
1.01 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/CUDA |
164532.5 ns |
160718 ns |
1.02 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/2 thread(s) |
579916 ns |
577333 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/4 thread(s) |
572084 ns |
573417 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/8 thread(s) |
580875 ns |
579000 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/1 thread(s) |
575500 ns |
574042 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/CUDA |
114358 ns |
113474 ns |
1.01 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) |
1423042 ns |
1432312.5 ns |
0.99 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) |
1416167 ns |
1426250 ns |
0.99 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) |
1424833 ns |
1425917 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) |
1428583 ns |
1418000 ns |
1.01 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/CUDA |
214116.5 ns |
211622 ns |
1.01 |
lenet(28, 28, 1, 64)/forward/CPU/2 thread(s) |
1079500 ns |
1046541 ns |
1.03 |
lenet(28, 28, 1, 64)/forward/CPU/4 thread(s) |
959437.5 ns |
965500 ns |
0.99 |
lenet(28, 28, 1, 64)/forward/CPU/8 thread(s) |
1349375 ns |
1347458 ns |
1.00 |
lenet(28, 28, 1, 64)/forward/CPU/1 thread(s) |
1290312.5 ns |
1290542 ns |
1.00 |
lenet(28, 28, 1, 64)/forward/GPU/CUDA |
277710 ns |
267857 ns |
1.04 |
lenet(28, 28, 1, 64)/zygote/CPU/2 thread(s) |
5926792 ns |
5895833.5 ns |
1.01 |
lenet(28, 28, 1, 64)/zygote/CPU/4 thread(s) |
4595334 ns |
4588042 ns |
1.00 |
lenet(28, 28, 1, 64)/zygote/CPU/8 thread(s) |
4957000 ns |
4928187 ns |
1.01 |
lenet(28, 28, 1, 64)/zygote/CPU/1 thread(s) |
5554125.5 ns |
5737167 ns |
0.97 |
lenet(28, 28, 1, 64)/zygote/GPU/CUDA |
1102113.5 ns |
1066176 ns |
1.03 |
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/2 thread(s) |
541 ns |
500 ns |
1.08 |
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/4 thread(s) |
541 ns |
500 ns |
1.08 |
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/8 thread(s) |
542 ns |
542 ns |
1 |
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/1 thread(s) |
542 ns |
500 ns |
1.08 |
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/CUDA |
23990 ns |
23460 ns |
1.02 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/2 thread(s) |
2167 ns |
2084 ns |
1.04 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/4 thread(s) |
2167 ns |
2125 ns |
1.02 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/8 thread(s) |
2167 ns |
2292 ns |
0.95 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/1 thread(s) |
2167 ns |
2125 ns |
1.02 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/CUDA |
172418 ns |
169490.5 ns |
1.02 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
6541 ns |
5458 ns |
1.20 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
6459 ns |
4000 ns |
1.61 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
5834 ns |
5687.5 ns |
1.03 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
4333 ns |
6250 ns |
0.69 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA |
65787 ns |
64594 ns |
1.02 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
11542 ns |
11083 ns |
1.04 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
11917 ns |
11333 ns |
1.05 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
11791 ns |
12041 ns |
0.98 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
11792 ns |
11083.5 ns |
1.06 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA |
451007 ns |
444224 ns |
1.02 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
8062.5 ns |
6708 ns |
1.20 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
7709 ns |
6416 ns |
1.20 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
8042 ns |
7875 ns |
1.02 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
7041 ns |
6500 ns |
1.08 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA |
52207 ns |
51136 ns |
1.02 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
18583 ns |
17583 ns |
1.06 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
17208 ns |
16958 ns |
1.01 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
17417 ns |
18145.5 ns |
0.96 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
17666 ns |
16916 ns |
1.04 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA |
303772 ns |
297812 ns |
1.02 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
583 ns |
500 ns |
1.17 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
500 ns |
500 ns |
1 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
541 ns |
583 ns |
0.93 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
542 ns |
500 ns |
1.08 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA |
32740 ns |
31896 ns |
1.03 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
9229.5 ns |
8916 ns |
1.04 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
8792 ns |
8667 ns |
1.01 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
8834 ns |
9250 ns |
0.96 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
8833 ns |
8645.5 ns |
1.02 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA |
160219.5 ns |
155805 ns |
1.03 |
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/2 thread(s) |
64500 ns |
64937.5 ns |
0.99 |
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/4 thread(s) |
64583 ns |
62625 ns |
1.03 |
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/8 thread(s) |
64708 ns |
64500 ns |
1.00 |
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/1 thread(s) |
64500 ns |
64667 ns |
1.00 |
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/CUDA |
112862 ns |
110478.5 ns |
1.02 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/2 thread(s) |
280333.5 ns |
294791 ns |
0.95 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/4 thread(s) |
282917 ns |
279125 ns |
1.01 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/8 thread(s) |
273667 ns |
275479.5 ns |
0.99 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/1 thread(s) |
295167 ns |
280854.5 ns |
1.05 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/CUDA |
189364.5 ns |
185224.5 ns |
1.02 |
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/2 thread(s) |
3317583.5 ns |
3152041.5 ns |
1.05 |
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/4 thread(s) |
3019208.5 ns |
3026187 ns |
1.00 |
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/8 thread(s) |
3016375 ns |
3022520.5 ns |
1.00 |
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/1 thread(s) |
4051917 ns |
3964167 ns |
1.02 |
mlp7layer_bn(gelu)(32 x 256)/forward/GPU/CUDA |
573433 ns |
573818.5 ns |
1.00 |
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/2 thread(s) |
7633708.5 ns |
7551166.5 ns |
1.01 |
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/4 thread(s) |
7443875 ns |
7449979 ns |
1.00 |
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/8 thread(s) |
7454416 ns |
7447000 ns |
1.00 |
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/1 thread(s) |
8302917 ns |
8208396 ns |
1.01 |
mlp7layer_bn(gelu)(32 x 256)/zygote/GPU/CUDA |
1360917.5 ns |
1327975 ns |
1.02 |
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/2 thread(s) |
18805792 ns |
18867458 ns |
1.00 |
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/4 thread(s) |
19114625 ns |
19142541 ns |
1.00 |
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/8 thread(s) |
19126000 ns |
19088834 ns |
1.00 |
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/1 thread(s) |
15867542 ns |
15711167 ns |
1.01 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) |
23869791.5 ns |
24315583.5 ns |
0.98 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) |
33627063 ns |
33983500 ns |
0.99 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) |
37312875 ns |
37046583.5 ns |
1.01 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) |
35492042 ns |
34841833 ns |
1.02 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/GPU/CUDA |
1852173 ns |
2130242 ns |
0.87 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) |
189567292 ns |
192387270.5 ns |
0.99 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) |
164695625 ns |
163943875 ns |
1.00 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) |
152795917 ns |
152577625 ns |
1.00 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) |
448899584 ns |
437847333 ns |
1.03 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/GPU/CUDA |
13897300 ns |
14119852 ns |
0.98 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) |
288986708.5 ns |
294725229.5 ns |
0.98 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) |
262891542 ns |
338344395.5 ns |
0.78 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) |
299964833 ns |
300590083.5 ns |
1.00 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) |
400223041.5 ns |
396800708.5 ns |
1.01 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
24354.5 ns |
23687.5 ns |
1.03 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
24667 ns |
23083 ns |
1.07 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
25291 ns |
24791 ns |
1.02 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
22500 ns |
23708 ns |
0.95 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
95040 ns |
95862 ns |
0.99 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
103104.5 ns |
103250 ns |
1.00 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
104541 ns |
103458 ns |
1.01 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
104291.5 ns |
103667 ns |
1.01 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
103083 ns |
102750 ns |
1.00 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
493349.5 ns |
494978 ns |
1.00 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
6937.5 ns |
7083 ns |
0.98 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
7042 ns |
5750 ns |
1.22 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
7459 ns |
6875 ns |
1.08 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
6958.5 ns |
7000 ns |
0.99 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA |
66551.5 ns |
67128 ns |
0.99 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
15416 ns |
15375 ns |
1.00 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
15979.5 ns |
15395.5 ns |
1.04 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
15917 ns |
16000 ns |
0.99 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
15459 ns |
14791.5 ns |
1.05 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA |
466767.5 ns |
467877 ns |
1.00 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) |
2917521 ns |
3009166.5 ns |
0.97 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) |
2054000 ns |
2067250 ns |
0.99 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) |
2294625 ns |
2279667 ns |
1.01 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) |
4822250 ns |
4832667 ns |
1.00 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/GPU/CUDA |
580968 ns |
581800.5 ns |
1.00 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) |
23583437.5 ns |
23921708.5 ns |
0.99 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) |
18092083 ns |
18037292 ns |
1.00 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) |
16977875 ns |
16963187.5 ns |
1.00 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) |
36087625 ns |
34623770.5 ns |
1.04 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/GPU/CUDA |
3231520 ns |
3105602 ns |
1.04 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) |
33370542 ns |
33780291 ns |
0.99 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) |
27613625.5 ns |
27715666.5 ns |
1.00 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) |
27378771 ns |
27451041 ns |
1.00 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) |
42452646 ns |
41640208 ns |
1.02 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
73875 ns |
80479 ns |
0.92 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
76083 ns |
72416 ns |
1.05 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
84042 ns |
78354 ns |
1.07 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
73208 ns |
74645.5 ns |
0.98 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
102315.5 ns |
100885 ns |
1.01 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
297250 ns |
311542 ns |
0.95 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
313813 ns |
224520.5 ns |
1.40 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
307833 ns |
209667 ns |
1.47 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
288708 ns |
257021 ns |
1.12 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
539612.5 ns |
539235 ns |
1.00 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
12708 ns |
12500 ns |
1.02 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
12625 ns |
11708 ns |
1.08 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
13500 ns |
12542 ns |
1.08 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
12000 ns |
12833.5 ns |
0.94 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA |
69202.5 ns |
70648 ns |
0.98 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
27083 ns |
26667 ns |
1.02 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
27270.5 ns |
26958.5 ns |
1.01 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
27416 ns |
27333.5 ns |
1.00 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
27417 ns |
26625 ns |
1.03 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA |
467808.5 ns |
470896 ns |
0.99 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
13042 ns |
12791 ns |
1.02 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
13208 ns |
12333 ns |
1.07 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
13375 ns |
13500 ns |
0.99 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
13042 ns |
12875 ns |
1.01 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA |
51727.5 ns |
52214 ns |
0.99 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
26312 ns |
25959 ns |
1.01 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
26000 ns |
25750 ns |
1.01 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
27438 ns |
26500 ns |
1.04 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
26125 ns |
26500 ns |
0.99 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA |
298006 ns |
300818.5 ns |
0.99 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
179854.5 ns |
180750 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
183021 ns |
179583 ns |
1.02 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
183834 ns |
183146 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
180208 ns |
179250 ns |
1.01 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
55691.5 ns |
56380 ns |
0.99 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
584666.5 ns |
593542 ns |
0.99 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
583687.5 ns |
582459 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
594563 ns |
585042 ns |
1.02 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
592166 ns |
594562 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
281460.5 ns |
284588 ns |
0.99 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
6958 ns |
6770.5 ns |
1.03 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
6875 ns |
5958 ns |
1.15 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
7375 ns |
7084 ns |
1.04 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
6083 ns |
7125 ns |
0.85 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA |
68775 ns |
70103 ns |
0.98 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
14500 ns |
14709 ns |
0.99 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
14500 ns |
14500 ns |
1 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
15292 ns |
15291.5 ns |
1.00 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
14833 ns |
13958 ns |
1.06 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA |
454520 ns |
460969.5 ns |
0.99 |
batchedmm(512, Bsize=4)/forward/CPU/2 thread(s) |
1192833.5 ns |
1217750 ns |
0.98 |
batchedmm(512, Bsize=4)/forward/CPU/4 thread(s) |
1266334 ns |
1209125 ns |
1.05 |
batchedmm(512, Bsize=4)/forward/CPU/8 thread(s) |
1252229.5 ns |
1249750 ns |
1.00 |
batchedmm(512, Bsize=4)/forward/CPU/1 thread(s) |
1308521 ns |
1326625 ns |
0.99 |
batchedmm(512, Bsize=4)/forward/GPU/CUDA |
301869 ns |
302841 ns |
1.00 |
batchedmm(512, Bsize=4)/zygote/CPU/2 thread(s) |
4121500 ns |
4351270.5 ns |
0.95 |
batchedmm(512, Bsize=4)/zygote/CPU/4 thread(s) |
4378666.5 ns |
4353042 ns |
1.01 |
batchedmm(512, Bsize=4)/zygote/CPU/8 thread(s) |
4521604 ns |
4630333 ns |
0.98 |
batchedmm(512, Bsize=4)/zygote/CPU/1 thread(s) |
4629374.5 ns |
4466479 ns |
1.04 |
batchedmm(512, Bsize=4)/zygote/GPU/CUDA |
1047678 ns |
1039570 ns |
1.01 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/2 thread(s) |
1833 ns |
1833 ns |
1 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/4 thread(s) |
1875 ns |
1792 ns |
1.05 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/8 thread(s) |
1916 ns |
1833 ns |
1.05 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/1 thread(s) |
1833 ns |
1875 ns |
0.98 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/CUDA |
23265 ns |
23644 ns |
0.98 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) |
4917 ns |
4875 ns |
1.01 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) |
4917 ns |
4875 ns |
1.01 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) |
5000 ns |
5042 ns |
0.99 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) |
4917 ns |
4875 ns |
1.01 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/CUDA |
186658 ns |
189061.5 ns |
0.99 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
6625 ns |
6021 ns |
1.10 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
6875 ns |
5708 ns |
1.20 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
7521 ns |
7042 ns |
1.07 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
6667 ns |
7416 ns |
0.90 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA |
54037.5 ns |
54998.5 ns |
0.98 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
11666 ns |
11437.5 ns |
1.02 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
11458 ns |
11084 ns |
1.03 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
11042 ns |
11666 ns |
0.95 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
11792 ns |
12333 ns |
0.96 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA |
325516.5 ns |
332242 ns |
0.98 |
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/2 thread(s) |
292 ns |
333 ns |
0.88 |
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/4 thread(s) |
292 ns |
292 ns |
1 |
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/8 thread(s) |
333 ns |
333 ns |
1 |
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/1 thread(s) |
292 ns |
292 ns |
1 |
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/CUDA |
22987.5 ns |
22998 ns |
1.00 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/2 thread(s) |
2667 ns |
2667 ns |
1 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/4 thread(s) |
3000 ns |
2750 ns |
1.09 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/8 thread(s) |
2709 ns |
2750 ns |
0.99 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/1 thread(s) |
2917 ns |
2709 ns |
1.08 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/CUDA |
156969 ns |
158762.5 ns |
0.99 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
13708 ns |
13687.5 ns |
1.00 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
14083 ns |
11208 ns |
1.26 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
14000 ns |
13958 ns |
1.00 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
12125 ns |
14125 ns |
0.86 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA |
55221 ns |
57325 ns |
0.96 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
25000 ns |
24625 ns |
1.02 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
25167 ns |
24250 ns |
1.04 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
25500 ns |
25500 ns |
1 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
25167 ns |
24875 ns |
1.01 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA |
287940 ns |
295945 ns |
0.97 |
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/2 thread(s) |
4166 ns |
4167 ns |
1.00 |
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/4 thread(s) |
4167 ns |
4166 ns |
1.00 |
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/8 thread(s) |
4166 ns |
4167 ns |
1.00 |
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/1 thread(s) |
4166 ns |
4125 ns |
1.01 |
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/CUDA |
24849 ns |
24912 ns |
1.00 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/2 thread(s) |
16250 ns |
16084 ns |
1.01 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/4 thread(s) |
16291 ns |
16209 ns |
1.01 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/8 thread(s) |
16375 ns |
16333.5 ns |
1.00 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/1 thread(s) |
15958 ns |
16208 ns |
0.98 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/CUDA |
193777.5 ns |
199034.5 ns |
0.97 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
5708 ns |
5708 ns |
1 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
5667 ns |
5584 ns |
1.01 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
5792 ns |
5708 ns |
1.01 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
5667 ns |
5708 ns |
0.99 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA |
32729 ns |
33099 ns |
0.99 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
20916 ns |
21166 ns |
0.99 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
20875 ns |
20458 ns |
1.02 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
20791 ns |
21333.5 ns |
0.97 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
21333 ns |
20875 ns |
1.02 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA |
172416.5 ns |
174613 ns |
0.99 |
batchedmm(16, Bsize=512)/forward/CPU/2 thread(s) |
397062.5 ns |
383042 ns |
1.04 |
batchedmm(16, Bsize=512)/forward/CPU/4 thread(s) |
374667 ns |
373541 ns |
1.00 |
batchedmm(16, Bsize=512)/forward/CPU/8 thread(s) |
485708 ns |
485896 ns |
1.00 |
batchedmm(16, Bsize=512)/forward/CPU/1 thread(s) |
509458 ns |
532854.5 ns |
0.96 |
batchedmm(16, Bsize=512)/forward/GPU/CUDA |
66848 ns |
66578.5 ns |
1.00 |
batchedmm(16, Bsize=512)/zygote/CPU/2 thread(s) |
994333 ns |
938166 ns |
1.06 |
batchedmm(16, Bsize=512)/zygote/CPU/4 thread(s) |
892041 ns |
847083 ns |
1.05 |
batchedmm(16, Bsize=512)/zygote/CPU/8 thread(s) |
1239979 ns |
1235042 ns |
1.00 |
batchedmm(16, Bsize=512)/zygote/CPU/1 thread(s) |
1415854 ns |
1418833 ns |
1.00 |
batchedmm(16, Bsize=512)/zygote/GPU/CUDA |
189858 ns |
191164 ns |
0.99 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
80604 ns |
81020.5 ns |
0.99 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
83208 ns |
80354.5 ns |
1.04 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
82875 ns |
82250 ns |
1.01 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
83458.5 ns |
132458 ns |
0.63 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
193036 ns |
192525 ns |
1.00 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1914708 ns |
1945166 ns |
0.98 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1920583 ns |
1909584 ns |
1.01 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1917958 ns |
1920333 ns |
1.00 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1922834 ns |
1914354.5 ns |
1.00 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
390299 ns |
402795 ns |
0.97 |
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/2 thread(s) |
292 ns |
292 ns |
1 |
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/4 thread(s) |
292 ns |
292 ns |
1 |
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/8 thread(s) |
292 ns |
292 ns |
1 |
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/1 thread(s) |
292 ns |
292 ns |
1 |
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/CUDA |
22056 ns |
21790 ns |
1.01 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/2 thread(s) |
1834 ns |
1792 ns |
1.02 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/4 thread(s) |
1834 ns |
1791 ns |
1.02 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/8 thread(s) |
1875 ns |
1916 ns |
0.98 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/1 thread(s) |
1833 ns |
1792 ns |
1.02 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/CUDA |
167168.5 ns |
172681 ns |
0.97 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
6833 ns |
8000 ns |
0.85 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
7459 ns |
6833 ns |
1.09 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
7834 ns |
8334 ns |
0.94 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
8250 ns |
7999.5 ns |
1.03 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA |
56962 ns |
62227.5 ns |
0.92 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
9292 ns |
9375 ns |
0.99 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
9334 ns |
8875 ns |
1.05 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
9708 ns |
9625 ns |
1.01 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
9666 ns |
9250 ns |
1.04 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA |
298435.5 ns |
315550.5 ns |
0.95 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) |
121682687 ns |
159022167 ns |
0.77 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) |
174477334 ns |
174256125 ns |
1.00 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) |
147749499.5 ns |
147914021 ns |
1.00 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) |
106196334 ns |
102407958 ns |
1.04 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/GPU/CUDA |
5464130 ns |
5468366 ns |
1.00 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) |
617964916.5 ns |
678096083 ns |
0.91 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) |
554980792 ns |
555598625 ns |
1.00 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) |
450462938 ns |
453528479 ns |
0.99 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) |
774770271 ns |
754205958.5 ns |
1.03 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/GPU/CUDA |
38232798 ns |
34940005 ns |
1.09 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) |
650752208 ns |
703546875 ns |
0.92 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) |
665180625 ns |
666832020.5 ns |
1.00 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) |
588795479 ns |
585927312.5 ns |
1.00 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) |
746732334 ns |
742692916 ns |
1.01 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
57792 ns |
57542 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
47375 ns |
47583 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
47542 ns |
47291 ns |
1.01 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
83625 ns |
82208 ns |
1.02 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
37861 ns |
37135 ns |
1.02 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1805209 ns |
1947333 ns |
0.93 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1972833.5 ns |
1971042 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1982166.5 ns |
1976458 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1895625 ns |
1893520.5 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
175561.5 ns |
171380.5 ns |
1.02 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
267979 ns |
272291 ns |
0.98 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
269854 ns |
265834 ns |
1.02 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
272271 ns |
289417 ns |
0.94 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
269270.5 ns |
267167 ns |
1.01 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
126577.5 ns |
135867.5 ns |
0.93 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
687417 ns |
671917 ns |
1.02 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
587708 ns |
596708 ns |
0.98 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
677146 ns |
696292 ns |
0.97 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
621875 ns |
692687.5 ns |
0.90 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
660721 ns |
737698 ns |
0.90 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
2217167 ns |
2231188 ns |
0.99 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
2243687.5 ns |
2215042 ns |
1.01 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
2207166.5 ns |
2207229 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
2210229 ns |
2243770.5 ns |
0.99 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
134472.5 ns |
133226 ns |
1.01 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
5500312.5 ns |
5572500 ns |
0.99 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
5481541.5 ns |
5486875 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
5533437.5 ns |
5511083 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
5557958.5 ns |
5495666.5 ns |
1.01 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
709361 ns |
759202.5 ns |
0.93 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/2 thread(s) |
646625 ns |
652833.5 ns |
0.99 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/4 thread(s) |
640708 ns |
657229 ns |
0.97 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/8 thread(s) |
644875 ns |
639500 ns |
1.01 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/1 thread(s) |
649667 ns |
639791 ns |
1.02 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/CUDA |
47317 ns |
46976 ns |
1.01 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) |
1821292 ns |
1799583 ns |
1.01 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) |
1717167 ns |
1724792 ns |
1.00 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) |
1720208 ns |
1722792 ns |
1.00 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) |
2099917 ns |
2103895.5 ns |
1.00 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/CUDA |
224949 ns |
221178.5 ns |
1.02 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
58167 ns |
56541 ns |
1.03 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
46208 ns |
46833 ns |
0.99 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
47125 ns |
46041 ns |
1.02 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
85042 ns |
83792 ns |
1.01 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
28856 ns |
28073 ns |
1.03 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2030667 ns |
2058250 ns |
0.99 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
2068666.5 ns |
2078709 ns |
1.00 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2091708.5 ns |
2093000 ns |
1.00 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1995624.5 ns |
1996646 ns |
1.00 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
190710.5 ns |
187152 ns |
1.02 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) |
13403541 ns |
13406125 ns |
1.00 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) |
12452354.5 ns |
12455458 ns |
1.00 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) |
12584874.5 ns |
12584792 ns |
1.00 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) |
15131208.5 ns |
14882959 ns |
1.02 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/GPU/CUDA |
515943 ns |
517201.5 ns |
1.00 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) |
47288708 ns |
47687000 ns |
0.99 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) |
41797750 ns |
41754625 ns |
1.00 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) |
41166292 ns |
40922625 ns |
1.01 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) |
59000458 ns |
58112708 ns |
1.02 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/GPU/CUDA |
3214571.5 ns |
3212087 ns |
1.00 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) |
74057125 ns |
74213479 ns |
1.00 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) |
91281417 ns |
68010000 ns |
1.34 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) |
90658500 ns |
90988625 ns |
1.00 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) |
99567250 ns |
76809750 ns |
1.30 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
58917 ns |
56917 ns |
1.04 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
46917 ns |
47042 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
47292 ns |
47041 ns |
1.01 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
84583 ns |
83375 ns |
1.01 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
47001.5 ns |
46301 ns |
1.02 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1909417 ns |
1939854 ns |
0.98 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1970125 ns |
1973333 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1977500 ns |
1974729.5 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1893666.5 ns |
1884375 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
193241.5 ns |
189579 ns |
1.02 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
334 ns |
291 ns |
1.15 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
292 ns |
333 ns |
0.88 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
334 ns |
375 ns |
0.89 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
375 ns |
250 ns |
1.50 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA |
32060 ns |
31617 ns |
1.01 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
6958 ns |
6229.5 ns |
1.12 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
6292 ns |
6167 ns |
1.02 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
6500 ns |
6458 ns |
1.01 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
6500 ns |
6167 ns |
1.05 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA |
167892 ns |
171396 ns |
0.98 |
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/2 thread(s) |
250 ns |
250 ns |
1 |
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/4 thread(s) |
291 ns |
250 ns |
1.16 |
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/8 thread(s) |
292 ns |
250 ns |
1.17 |
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/1 thread(s) |
292 ns |
250 ns |
1.17 |
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/CUDA |
31793 ns |
31328 ns |
1.01 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/2 thread(s) |
2750 ns |
2583 ns |
1.06 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/4 thread(s) |
2875 ns |
2625 ns |
1.10 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/8 thread(s) |
2625 ns |
2792 ns |
0.94 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/1 thread(s) |
2792 ns |
2625 ns |
1.06 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/CUDA |
158050.5 ns |
161410 ns |
0.98 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) |
286392208.5 ns |
324182500 ns |
0.88 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) |
339629083 ns |
339536042 ns |
1.00 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) |
314686979 ns |
314625854 ns |
1.00 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) |
270977208 ns |
273060250 ns |
0.99 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/GPU/CUDA |
7039044 ns |
7093070 ns |
0.99 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) |
987040583 ns |
1051455583 ns |
0.94 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) |
937877000 ns |
941830875 ns |
1.00 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) |
860491458.5 ns |
858538271 ns |
1.00 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) |
1177023417 ns |
1153691292 ns |
1.02 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/GPU/CUDA |
33913611 ns |
34020243.5 ns |
1.00 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) |
1312562792 ns |
1359481562.5 ns |
0.97 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) |
1685161292 ns |
1360673729 ns |
1.24 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) |
1644146958 ns |
1640965792 ns |
1.00 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) |
1659747291 ns |
1309802292 ns |
1.27 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
1406521 ns |
1414416.5 ns |
0.99 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
1407084 ns |
1409541 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
1411250 ns |
1408500 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
1418562.5 ns |
1453875 ns |
0.98 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
128151.5 ns |
127358 ns |
1.01 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
5012500 ns |
5056229 ns |
0.99 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
5022521 ns |
5013583 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
5040375 ns |
4954291 ns |
1.02 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
5037729 ns |
5017021 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
509438.5 ns |
601067 ns |
0.85 |
vgg16(32, 32, 3, 32)/forward/CPU/2 thread(s) |
168947750 ns |
170719208 ns |
0.99 |
vgg16(32, 32, 3, 32)/forward/CPU/4 thread(s) |
130945895.5 ns |
132607979.5 ns |
0.99 |
vgg16(32, 32, 3, 32)/forward/CPU/8 thread(s) |
131658333 ns |
124493437.5 ns |
1.06 |
vgg16(32, 32, 3, 32)/forward/CPU/1 thread(s) |
157330500 ns |
162230500 ns |
0.97 |
vgg16(32, 32, 3, 32)/forward/GPU/CUDA |
4922111 ns |
4886055.5 ns |
1.01 |
vgg16(32, 32, 3, 32)/zygote/CPU/2 thread(s) |
672521958 ns |
854987208 ns |
0.79 |
vgg16(32, 32, 3, 32)/zygote/CPU/4 thread(s) |
643855750 ns |
644456708 ns |
1.00 |
vgg16(32, 32, 3, 32)/zygote/CPU/8 thread(s) |
500540709 ns |
532057834 ns |
0.94 |
vgg16(32, 32, 3, 32)/zygote/CPU/1 thread(s) |
857337417 ns |
687805708 ns |
1.25 |
vgg16(32, 32, 3, 32)/zygote/GPU/CUDA |
16232508 ns |
16138006 ns |
1.01 |
batchedmm(512, Bsize=32)/forward/CPU/2 thread(s) |
8936646 ns |
9114041.5 ns |
0.98 |
batchedmm(512, Bsize=32)/forward/CPU/4 thread(s) |
8743917 ns |
8770313 ns |
1.00 |
batchedmm(512, Bsize=32)/forward/CPU/8 thread(s) |
7871374.5 ns |
7860292 ns |
1.00 |
batchedmm(512, Bsize=32)/forward/CPU/1 thread(s) |
10349042 ns |
10147292 ns |
1.02 |
batchedmm(512, Bsize=32)/forward/GPU/CUDA |
1606235 ns |
1612586 ns |
1.00 |
batchedmm(512, Bsize=32)/zygote/CPU/2 thread(s) |
36162708 ns |
37546375 ns |
0.96 |
batchedmm(512, Bsize=32)/zygote/CPU/4 thread(s) |
36945042 ns |
36886146 ns |
1.00 |
batchedmm(512, Bsize=32)/zygote/CPU/8 thread(s) |
33490646 ns |
33451021 ns |
1.00 |
batchedmm(512, Bsize=32)/zygote/CPU/1 thread(s) |
39951625 ns |
38875771 ns |
1.03 |
batchedmm(512, Bsize=32)/zygote/GPU/CUDA |
8915488 ns |
6459090.5 ns |
1.38 |
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/2 thread(s) |
47500 ns |
47458.5 ns |
1.00 |
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/4 thread(s) |
47459 ns |
49333 ns |
0.96 |
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/8 thread(s) |
47667 ns |
49583 ns |
0.96 |
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/1 thread(s) |
47416 ns |
47250 ns |
1.00 |
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/CUDA |
18767 ns |
18585 ns |
1.01 |
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/2 thread(s) |
50458 ns |
50584 ns |
1.00 |
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/4 thread(s) |
50542 ns |
50416 ns |
1.00 |
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/8 thread(s) |
50750 ns |
50708.5 ns |
1.00 |
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/1 thread(s) |
50500 ns |
50500 ns |
1 |
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/CUDA |
162985 ns |
216293 ns |
0.75 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
7708 ns |
7979.5 ns |
0.97 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
8042 ns |
6791 ns |
1.18 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
8333 ns |
8875 ns |
0.94 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
7750 ns |
8583 ns |
0.90 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA |
75765 ns |
106035 ns |
0.71 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
9958 ns |
10333 ns |
0.96 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
10437.5 ns |
9958 ns |
1.05 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
10791 ns |
10500 ns |
1.03 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
10375 ns |
10167 ns |
1.02 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA |
456610.5 ns |
612658 ns |
0.75 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
8209 ns |
8750 ns |
0.94 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
8667 ns |
6438 ns |
1.35 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
8875 ns |
8667 ns |
1.02 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
6750 ns |
5875 ns |
1.15 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA |
86529.5 ns |
119844.5 ns |
0.72 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
13166 ns |
13375 ns |
0.98 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
13458.5 ns |
13000 ns |
1.04 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
13542 ns |
13416 ns |
1.01 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
13333 ns |
12791 ns |
1.04 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA |
414607 ns |
517417.5 ns |
0.80 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
1083 ns |
1042 ns |
1.04 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
1042 ns |
958 ns |
1.09 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
1083 ns |
1042 ns |
1.04 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
1083 ns |
1042 ns |
1.04 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA |
32103 ns |
31817 ns |
1.01 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
8250 ns |
8041 ns |
1.03 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
8250 ns |
7750 ns |
1.06 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
8125 ns |
8333 ns |
0.98 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
8167 ns |
8292 ns |
0.98 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA |
187429 ns |
203048 ns |
0.92 |
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/2 thread(s) |
23334 ns |
23145.5 ns |
1.01 |
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/4 thread(s) |
23042 ns |
24541 ns |
0.94 |
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/8 thread(s) |
23750 ns |
24167 ns |
0.98 |
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/1 thread(s) |
23292 ns |
23334 ns |
1.00 |
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/CUDA |
18388 ns |
18371 ns |
1.00 |
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) |
52458 ns |
52542 ns |
1.00 |
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) |
53000 ns |
52416 ns |
1.01 |
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) |
54667 ns |
52500 ns |
1.04 |
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) |
52687.5 ns |
52334 ns |
1.01 |
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/CUDA |
232789 ns |
295739.5 ns |
0.79 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
1399583 ns |
1440625 ns |
0.97 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
1403500 ns |
1400291 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
1404584 ns |
1400875 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
1409833 ns |
1406313 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
195872 ns |
194620 ns |
1.01 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
4993083 ns |
5047479.5 ns |
0.99 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
4929125 ns |
5003458.5 ns |
0.99 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
5032375 ns |
4836292 ns |
1.04 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
5036542 ns |
4996708 ns |
1.01 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
540199 ns |
628014 ns |
0.86 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) |
3042896 ns |
3062438 ns |
0.99 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) |
2068771 ns |
2084417 ns |
0.99 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) |
2296625 ns |
2227208.5 ns |
1.03 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) |
4859000 ns |
4812250 ns |
1.01 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/GPU/CUDA |
582789 ns |
579246 ns |
1.01 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) |
24291292 ns |
24741125 ns |
0.98 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) |
18912083.5 ns |
18811521 ns |
1.01 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) |
18946541.5 ns |
18691437 ns |
1.01 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) |
37157750 ns |
36587416 ns |
1.02 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/GPU/CUDA |
3180435 ns |
3196070 ns |
1.00 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) |
34008000 ns |
34435312 ns |
0.99 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) |
28284166.5 ns |
28306583.5 ns |
1.00 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) |
28170875 ns |
28069750 ns |
1.00 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) |
42294291.5 ns |
41958375 ns |
1.01 |
batchedmm(512, Bsize=512)/forward/CPU/2 thread(s) |
144251875 ns |
145325041 ns |
0.99 |
batchedmm(512, Bsize=512)/forward/CPU/4 thread(s) |
142297416 ns |
141848041.5 ns |
1.00 |
batchedmm(512, Bsize=512)/forward/CPU/8 thread(s) |
125012271 ns |
123758375 ns |
1.01 |
batchedmm(512, Bsize=512)/forward/CPU/1 thread(s) |
174017792 ns |
173196604 ns |
1.00 |
batchedmm(512, Bsize=512)/forward/GPU/CUDA |
22792818 ns |
22560824 ns |
1.01 |
batchedmm(512, Bsize=512)/zygote/CPU/2 thread(s) |
1719710313 ns |
942531917 ns |
1.82 |
batchedmm(512, Bsize=512)/zygote/CPU/4 thread(s) |
1131712375 ns |
871530625 ns |
1.30 |
batchedmm(512, Bsize=512)/zygote/CPU/8 thread(s) |
737685000 ns |
1498315250 ns |
0.49 |
batchedmm(512, Bsize=512)/zygote/CPU/1 thread(s) |
687694667 ns |
674150833 ns |
1.02 |
batchedmm(512, Bsize=512)/zygote/GPU/CUDA |
118898101 ns |
118289465 ns |
1.01 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
76084 ns |
76208 ns |
1.00 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
74354.5 ns |
75041 ns |
0.99 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
78812.5 ns |
77875 ns |
1.01 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
78083 ns |
75417 ns |
1.04 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
190339.5 ns |
273038.5 ns |
0.70 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
290500 ns |
299708 ns |
0.97 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
288500 ns |
284646 ns |
1.01 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
207354 ns |
191687.5 ns |
1.08 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
292084 ns |
202979.5 ns |
1.44 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
1065883 ns |
1439967 ns |
0.74 |
batchedmm(512, Bsize=128)/forward/CPU/2 thread(s) |
35467167 ns |
36345458 ns |
0.98 |
batchedmm(512, Bsize=128)/forward/CPU/4 thread(s) |
35107250 ns |
35416645.5 ns |
0.99 |
batchedmm(512, Bsize=128)/forward/CPU/8 thread(s) |
32188375 ns |
32239562.5 ns |
1.00 |
batchedmm(512, Bsize=128)/forward/CPU/1 thread(s) |
41524750 ns |
40930312.5 ns |
1.01 |
batchedmm(512, Bsize=128)/forward/GPU/CUDA |
5849338 ns |
5849412 ns |
1.00 |
batchedmm(512, Bsize=128)/zygote/CPU/2 thread(s) |
146402291 ns |
151966416 ns |
0.96 |
batchedmm(512, Bsize=128)/zygote/CPU/4 thread(s) |
151223500 ns |
152232437.5 ns |
0.99 |
batchedmm(512, Bsize=128)/zygote/CPU/8 thread(s) |
134376625 ns |
136165208.5 ns |
0.99 |
batchedmm(512, Bsize=128)/zygote/CPU/1 thread(s) |
227426875 ns |
287396625 ns |
0.79 |
batchedmm(512, Bsize=128)/zygote/GPU/CUDA |
34896547 ns |
34914778 ns |
1.00 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) |
121785604 ns |
158627833 ns |
0.77 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) |
173641625 ns |
174511667 ns |
1.00 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) |
147809020.5 ns |
148215771.5 ns |
1.00 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) |
104816458 ns |
108212479 ns |
0.97 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/GPU/CUDA |
5458335 ns |
5459784 ns |
1.00 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) |
470304750 ns |
524328229.5 ns |
0.90 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) |
466493959 ns |
467038291 ns |
1.00 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) |
437765479 ns |
441190000 ns |
0.99 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) |
760795396 ns |
741818542 ns |
1.03 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/GPU/CUDA |
35146337 ns |
32279915 ns |
1.09 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) |
713005604 ns |
692549750 ns |
1.03 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) |
653623645.5 ns |
656203708.5 ns |
1.00 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) |
573970979.5 ns |
573625208 ns |
1.00 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) |
865330958 ns |
853537834 ns |
1.01 |
mlp7layer_bn(relu)(32 x 256)/forward/CPU/2 thread(s) |
1307604 ns |
1226937.5 ns |
1.07 |
mlp7layer_bn(relu)(32 x 256)/forward/CPU/4 thread(s) |
972645.5 ns |
992979 ns |
0.98 |
mlp7layer_bn(relu)(32 x 256)/forward/CPU/8 thread(s) |
954916.5 ns |
904625 ns |
1.06 |
mlp7layer_bn(relu)(32 x 256)/forward/CPU/1 thread(s) |
2066083 ns |
2085917 ns |
0.99 |
mlp7layer_bn(relu)(32 x 256)/forward/GPU/CUDA |
571599 ns |
566912.5 ns |
1.01 |
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/2 thread(s) |
2322750 ns |
2909667 ns |
0.80 |
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/4 thread(s) |
2616416 ns |
2628208 ns |
1.00 |
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/8 thread(s) |
2622375 ns |
2006333.5 ns |
1.31 |
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/1 thread(s) |
3780709 ns |
3693750.5 ns |
1.02 |
mlp7layer_bn(relu)(32 x 256)/zygote/GPU/CUDA |
1633470.5 ns |
1796011.5 ns |
0.91 |
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/2 thread(s) |
6650459 ns |
6757875 ns |
0.98 |
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/4 thread(s) |
6509729 ns |
6503250 ns |
1.00 |
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/8 thread(s) |
6513417 ns |
6239125 ns |
1.04 |
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/1 thread(s) |
4521042 ns |
4454771 ns |
1.01 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
7375 ns |
7250 ns |
1.02 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
5958 ns |
6167 ns |
0.97 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
6167 ns |
6208 ns |
0.99 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
10291 ns |
10250 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
25452.5 ns |
24809.5 ns |
1.03 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
212833.5 ns |
213666 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
220708 ns |
220313 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
220916 ns |
220125 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
206042 ns |
209542 ns |
0.98 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
235444 ns |
276995.5 ns |
0.85 |
vgg16(32, 32, 3, 64)/forward/CPU/2 thread(s) |
302269625 ns |
315354292 ns |
0.96 |
vgg16(32, 32, 3, 64)/forward/CPU/4 thread(s) |
221254250 ns |
221860750 ns |
1.00 |
vgg16(32, 32, 3, 64)/forward/CPU/8 thread(s) |
198336250 ns |
197740833.5 ns |
1.00 |
vgg16(32, 32, 3, 64)/forward/CPU/1 thread(s) |
309762604 ns |
312004542 ns |
0.99 |
vgg16(32, 32, 3, 64)/forward/GPU/CUDA |
7684423 ns |
7676221 ns |
1.00 |
vgg16(32, 32, 3, 64)/zygote/CPU/2 thread(s) |
1084217792 ns |
1085627020.5 ns |
1.00 |
vgg16(32, 32, 3, 64)/zygote/CPU/4 thread(s) |
893335249.5 ns |
891084375.5 ns |
1.00 |
vgg16(32, 32, 3, 64)/zygote/CPU/8 thread(s) |
869473667 ns |
865730125 ns |
1.00 |
vgg16(32, 32, 3, 64)/zygote/CPU/1 thread(s) |
1182096250 ns |
1163266979.5 ns |
1.02 |
vgg16(32, 32, 3, 64)/zygote/GPU/CUDA |
26585457 ns |
26544800.5 ns |
1.00 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
5584 ns |
6083 ns |
0.92 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
6959 ns |
5583 ns |
1.25 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
8792 ns |
7375 ns |
1.19 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
6541 ns |
5270.5 ns |
1.24 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA |
141680.5 ns |
178949 ns |
0.79 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
7500 ns |
7708 ns |
0.97 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
7500 ns |
7292 ns |
1.03 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
7375 ns |
7500 ns |
0.98 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
7958 ns |
6792 ns |
1.17 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA |
523009 ns |
667282.5 ns |
0.78 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
541 ns |
542 ns |
1.00 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
542 ns |
459 ns |
1.18 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
584 ns |
542 ns |
1.08 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
500 ns |
459 ns |
1.09 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA |
24058 ns |
23245 ns |
1.03 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
9375 ns |
9583.5 ns |
0.98 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
9604 ns |
9167 ns |
1.05 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
9833 ns |
9458.5 ns |
1.04 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
9729 ns |
8792 ns |
1.11 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA |
191258 ns |
227149 ns |
0.84 |
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/2 thread(s) |
353291 ns |
352521.5 ns |
1.00 |
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/4 thread(s) |
351041 ns |
352709 ns |
1.00 |
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/8 thread(s) |
351479.5 ns |
352958.5 ns |
1.00 |
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/1 thread(s) |
351542 ns |
352708 ns |
1.00 |
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/CUDA |
21523 ns |
21007 ns |
1.02 |
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) |
775875 ns |
828104 ns |
0.94 |
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) |
776000 ns |
820292 ns |
0.95 |
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) |
773667 ns |
773500 ns |
1.00 |
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) |
831708 ns |
828312 ns |
1.00 |
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/CUDA |
254955.5 ns |
289596 ns |
0.88 |
batchedmm(16, Bsize=32)/forward/CPU/2 thread(s) |
332666 ns |
312083.5 ns |
1.07 |
batchedmm(16, Bsize=32)/forward/CPU/4 thread(s) |
343000 ns |
340166.5 ns |
1.01 |
batchedmm(16, Bsize=32)/forward/CPU/8 thread(s) |
451416 ns |
445354 ns |
1.01 |
batchedmm(16, Bsize=32)/forward/CPU/1 thread(s) |
311792 ns |
333520.5 ns |
0.93 |
batchedmm(16, Bsize=32)/forward/GPU/CUDA |
18160 ns |
17918 ns |
1.01 |
batchedmm(16, Bsize=32)/zygote/CPU/2 thread(s) |
689103.5 ns |
691583 ns |
1.00 |
batchedmm(16, Bsize=32)/zygote/CPU/4 thread(s) |
738792 ns |
732334 ns |
1.01 |
batchedmm(16, Bsize=32)/zygote/CPU/8 thread(s) |
1027062.5 ns |
1026459 ns |
1.00 |
batchedmm(16, Bsize=32)/zygote/CPU/1 thread(s) |
692499.5 ns |
691042 ns |
1.00 |
batchedmm(16, Bsize=32)/zygote/GPU/CUDA |
225496 ns |
273557 ns |
0.82 |
batchedmm(16, Bsize=128)/forward/CPU/2 thread(s) |
348333 ns |
332396 ns |
1.05 |
batchedmm(16, Bsize=128)/forward/CPU/4 thread(s) |
356062.5 ns |
348875 ns |
1.02 |
batchedmm(16, Bsize=128)/forward/CPU/8 thread(s) |
408667 ns |
409541 ns |
1.00 |
batchedmm(16, Bsize=128)/forward/CPU/1 thread(s) |
349542 ns |
375250 ns |
0.93 |
batchedmm(16, Bsize=128)/forward/GPU/CUDA |
22957 ns |
22378 ns |
1.03 |
batchedmm(16, Bsize=128)/zygote/CPU/2 thread(s) |
753250 ns |
755875 ns |
1.00 |
batchedmm(16, Bsize=128)/zygote/CPU/4 thread(s) |
751666.5 ns |
743000 ns |
1.01 |
batchedmm(16, Bsize=128)/zygote/CPU/8 thread(s) |
1074208 ns |
1068417 ns |
1.01 |
batchedmm(16, Bsize=128)/zygote/CPU/1 thread(s) |
817958 ns |
822124.5 ns |
0.99 |
batchedmm(16, Bsize=128)/zygote/GPU/CUDA |
215397 ns |
239682 ns |
0.90 |
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/2 thread(s) |
3500 ns |
3625 ns |
0.97 |
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/4 thread(s) |
3625 ns |
3417 ns |
1.06 |
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/8 thread(s) |
3750 ns |
3583 ns |
1.05 |
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/1 thread(s) |
3542 ns |
3583 ns |
0.99 |
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/CUDA |
18313 ns |
17823 ns |
1.03 |
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/2 thread(s) |
4167 ns |
4208 ns |
0.99 |
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/4 thread(s) |
4416 ns |
4167 ns |
1.06 |
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/8 thread(s) |
4333 ns |
4375 ns |
0.99 |
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/1 thread(s) |
4208 ns |
4292 ns |
0.98 |
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/CUDA |
220540.5 ns |
271995 ns |
0.81 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
4416 ns |
4792 ns |
0.92 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
3833 ns |
3834 ns |
1.00 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
5229.5 ns |
5250 ns |
1.00 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
3625 ns |
3625 ns |
1 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA |
166773 ns |
214003.5 ns |
0.78 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
8125 ns |
8354.5 ns |
0.97 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
8667 ns |
8334 ns |
1.04 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
8500 ns |
8667 ns |
0.98 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
8625 ns |
8417 ns |
1.02 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA |
1013865.5 ns |
1200425 ns |
0.84 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
204458 ns |
204209 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
211917 ns |
210000 ns |
1.01 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
212625 ns |
211875 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
200542 ns |
199417 ns |
1.01 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
34728 ns |
34086 ns |
1.02 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
649167 ns |
608520.5 ns |
1.07 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
625833 ns |
620750 ns |
1.01 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
620583 ns |
620416 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
630375 ns |
628625 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
287189.5 ns |
347622 ns |
0.83 |
batchedmm(128, Bsize=128)/forward/CPU/2 thread(s) |
972479.5 ns |
980000 ns |
0.99 |
batchedmm(128, Bsize=128)/forward/CPU/4 thread(s) |
935500 ns |
929916.5 ns |
1.01 |
batchedmm(128, Bsize=128)/forward/CPU/8 thread(s) |
952291.5 ns |
954250 ns |
1.00 |
batchedmm(128, Bsize=128)/forward/CPU/1 thread(s) |
1300521 ns |
1278542 ns |
1.02 |
batchedmm(128, Bsize=128)/forward/GPU/CUDA |
207194 ns |
206777 ns |
1.00 |
batchedmm(128, Bsize=128)/zygote/CPU/2 thread(s) |
4514354.5 ns |
4651729 ns |
0.97 |
batchedmm(128, Bsize=128)/zygote/CPU/4 thread(s) |
4465500 ns |
4500083 ns |
0.99 |
batchedmm(128, Bsize=128)/zygote/CPU/8 thread(s) |
4301791.5 ns |
4296645.5 ns |
1.00 |
batchedmm(128, Bsize=128)/zygote/CPU/1 thread(s) |
6497562.5 ns |
6216979.5 ns |
1.05 |
batchedmm(128, Bsize=128)/zygote/GPU/CUDA |
973204.5 ns |
942518 ns |
1.03 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
4916.5 ns |
3916 ns |
1.26 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
3292 ns |
3375 ns |
0.98 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
5167 ns |
4667 ns |
1.11 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
4166 ns |
3354.5 ns |
1.24 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA |
167779.5 ns |
231395.5 ns |
0.73 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
7875 ns |
7375 ns |
1.07 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
7562.5 ns |
7292 ns |
1.04 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
7792 ns |
7667 ns |
1.02 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
7416 ns |
7000 ns |
1.06 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA |
854999 ns |
1002762 ns |
0.85 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) |
1611729.5 ns |
1644583 ns |
0.98 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) |
1152542 ns |
1174458 ns |
0.98 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) |
1367062.5 ns |
1323125 ns |
1.03 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) |
2452792 ns |
2461333.5 ns |
1.00 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/GPU/CUDA |
216363 ns |
213304.5 ns |
1.01 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) |
12329750 ns |
12444729.5 ns |
0.99 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) |
9540958 ns |
9564709 ns |
1.00 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) |
9263625 ns |
9234833 ns |
1.00 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) |
18057917 ns |
18020417 ns |
1.00 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/GPU/CUDA |
1943297 ns |
1940786 ns |
1.00 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) |
17406250 ns |
17431792 ns |
1.00 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) |
14327042 ns |
14392958.5 ns |
1.00 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) |
14326417 ns |
14240000 ns |
1.01 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) |
21194250 ns |
21049562.5 ns |
1.01 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
89187.5 ns |
90625 ns |
0.98 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
86854 ns |
88041 ns |
0.99 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
91709 ns |
92333 ns |
0.99 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
91270.5 ns |
136917 ns |
0.67 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
126235 ns |
125618 ns |
1.00 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2037042 ns |
2061125 ns |
0.99 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1748458 ns |
2018458 ns |
0.87 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1941042 ns |
1720042 ns |
1.13 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
2051500 ns |
2024104 ns |
1.01 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
855504 ns |
1024038 ns |
0.84 |
batchedmm(2, Bsize=4)/forward/CPU/2 thread(s) |
343916.5 ns |
331312 ns |
1.04 |
batchedmm(2, Bsize=4)/forward/CPU/4 thread(s) |
346500 ns |
343500 ns |
1.01 |
batchedmm(2, Bsize=4)/forward/CPU/8 thread(s) |
394208.5 ns |
395083 ns |
1.00 |
batchedmm(2, Bsize=4)/forward/CPU/1 thread(s) |
290375 ns |
310458.5 ns |
0.94 |
batchedmm(2, Bsize=4)/forward/GPU/CUDA |
15675 ns |
15733 ns |
1.00 |
batchedmm(2, Bsize=4)/zygote/CPU/2 thread(s) |
700250 ns |
699959 ns |
1.00 |
batchedmm(2, Bsize=4)/zygote/CPU/4 thread(s) |
731792 ns |
722062.5 ns |
1.01 |
batchedmm(2, Bsize=4)/zygote/CPU/8 thread(s) |
1023542 ns |
1018209 ns |
1.01 |
batchedmm(2, Bsize=4)/zygote/CPU/1 thread(s) |
649292 ns |
646375 ns |
1.00 |
batchedmm(2, Bsize=4)/zygote/GPU/CUDA |
154345 ns |
189475.5 ns |
0.81 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
7333 ns |
7167 ns |
1.02 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
5875 ns |
5958 ns |
0.99 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
6042 ns |
5875 ns |
1.03 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
10250 ns |
10000 ns |
1.02 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
33224 ns |
33239 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
215125.5 ns |
221625 ns |
0.97 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
220459 ns |
219959 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
220917 ns |
219750 ns |
1.01 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
240125 ns |
218375 ns |
1.10 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
265024.5 ns |
314279 ns |
0.84 |
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/2 thread(s) |
3667 ns |
3750 ns |
0.98 |
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/4 thread(s) |
3708 ns |
3667 ns |
1.01 |
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/8 thread(s) |
3667 ns |
3667 ns |
1 |
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/1 thread(s) |
3750 ns |
3667 ns |
1.02 |
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/CUDA |
22498 ns |
22722 ns |
0.99 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/2 thread(s) |
14417 ns |
14167 ns |
1.02 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/4 thread(s) |
14417 ns |
14334 ns |
1.01 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/8 thread(s) |
14375 ns |
14291 ns |
1.01 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/1 thread(s) |
14292 ns |
14375 ns |
0.99 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/CUDA |
375019.5 ns |
475447 ns |
0.79 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
92042 ns |
95166.5 ns |
0.97 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
93291 ns |
91833 ns |
1.02 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
96333 ns |
96125 ns |
1.00 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
94604.5 ns |
139167 ns |
0.68 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
125576 ns |
125450 ns |
1.00 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1923750 ns |
1948250 ns |
0.99 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1650625 ns |
1921104.5 ns |
0.86 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1929083 ns |
1669729.5 ns |
1.16 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1934583 ns |
1920708.5 ns |
1.01 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
790898.5 ns |
954893.5 ns |
0.83 |
lenet(28, 28, 1, 32)/forward/CPU/2 thread(s) |
870500 ns |
854375 ns |
1.02 |
lenet(28, 28, 1, 32)/forward/CPU/4 thread(s) |
827875 ns |
817542 ns |
1.01 |
lenet(28, 28, 1, 32)/forward/CPU/8 thread(s) |
1213792 ns |
1213833.5 ns |
1.00 |
lenet(28, 28, 1, 32)/forward/CPU/1 thread(s) |
935542 ns |
958895.5 ns |
0.98 |
lenet(28, 28, 1, 32)/forward/GPU/CUDA |
272131.5 ns |
276078 ns |
0.99 |
lenet(28, 28, 1, 32)/zygote/CPU/2 thread(s) |
2796417 ns |
2843334 ns |
0.98 |
lenet(28, 28, 1, 32)/zygote/CPU/4 thread(s) |
2438417 ns |
2456145.5 ns |
0.99 |
lenet(28, 28, 1, 32)/zygote/CPU/8 thread(s) |
3334749.5 ns |
3332000 ns |
1.00 |
lenet(28, 28, 1, 32)/zygote/CPU/1 thread(s) |
3398250 ns |
3419792 ns |
0.99 |
lenet(28, 28, 1, 32)/zygote/GPU/CUDA |
1412395 ns |
1629171 ns |
0.87 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
16937 ns |
15333 ns |
1.10 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
17187.5 ns |
14709 ns |
1.17 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
19208.5 ns |
17041 ns |
1.13 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
16500 ns |
14333 ns |
1.15 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
112153 ns |
142609.5 ns |
0.79 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
262584 ns |
262125 ns |
1.00 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
217708 ns |
215416.5 ns |
1.01 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
216167 ns |
215250 ns |
1.00 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
257875 ns |
221958 ns |
1.16 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
531938.5 ns |
641081.5 ns |
0.83 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
222209 ns |
221583.5 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
221417 ns |
218625 ns |
1.01 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
222958 ns |
222833 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
221625 ns |
221750 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
211097.5 ns |
271537.5 ns |
0.78 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
499479 ns |
497750 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
498250 ns |
494833 ns |
1.01 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
497500 ns |
497084 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
506791.5 ns |
509000 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
1218166.5 ns |
1365399 ns |
0.89 |
batchedmm(16, Bsize=4)/forward/CPU/2 thread(s) |
330729 ns |
315729 ns |
1.05 |
batchedmm(16, Bsize=4)/forward/CPU/4 thread(s) |
336917 ns |
333917 ns |
1.01 |
batchedmm(16, Bsize=4)/forward/CPU/8 thread(s) |
363625 ns |
375125 ns |
0.97 |
batchedmm(16, Bsize=4)/forward/CPU/1 thread(s) |
303145.5 ns |
322083 ns |
0.94 |
batchedmm(16, Bsize=4)/forward/GPU/CUDA |
16466 ns |
16846 ns |
0.98 |
batchedmm(16, Bsize=4)/zygote/CPU/2 thread(s) |
710584 ns |
710041 ns |
1.00 |
batchedmm(16, Bsize=4)/zygote/CPU/4 thread(s) |
729625 ns |
725063 ns |
1.01 |
batchedmm(16, Bsize=4)/zygote/CPU/8 thread(s) |
1023687.5 ns |
1022417 ns |
1.00 |
batchedmm(16, Bsize=4)/zygote/CPU/1 thread(s) |
661708 ns |
663021 ns |
1.00 |
batchedmm(16, Bsize=4)/zygote/GPU/CUDA |
188015 ns |
196884 ns |
0.95 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
21250 ns |
17625 ns |
1.21 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
17625 ns |
16708 ns |
1.05 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
20125 ns |
18792 ns |
1.07 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
18521 ns |
17625 ns |
1.05 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
143789 ns |
144721 ns |
0.99 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
213541 ns |
220104.5 ns |
0.97 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
218604 ns |
212792 ns |
1.03 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
213062.5 ns |
212750 ns |
1.00 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
239750 ns |
217250 ns |
1.10 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
905449 ns |
955774 ns |
0.95 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
6833 ns |
6042 ns |
1.13 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
6375 ns |
4250 ns |
1.50 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
6229 ns |
6958 ns |
0.90 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
6917 ns |
6541 ns |
1.06 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA |
186975 ns |
245177 ns |
0.76 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
10625 ns |
10583.5 ns |
1.00 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
10667 ns |
10250 ns |
1.04 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
11500 ns |
10708 ns |
1.07 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
10979 ns |
10084 ns |
1.09 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA |
1010754 ns |
1099715 ns |
0.92 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
3791.5 ns |
4542 ns |
0.83 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
3208 ns |
3208 ns |
1 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
4895.5 ns |
4834 ns |
1.01 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
4354 ns |
2875 ns |
1.51 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA |
207943 ns |
250616.5 ns |
0.83 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
7750 ns |
7125 ns |
1.09 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
7625 ns |
7375 ns |
1.03 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
8208 ns |
7750 ns |
1.06 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
7792 ns |
7375 ns |
1.06 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA |
1022804 ns |
1110249 ns |
0.92 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) |
23591167 ns |
24293729.5 ns |
0.97 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) |
34749750 ns |
34647499.5 ns |
1.00 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) |
37965979 ns |
38065167 ns |
1.00 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) |
35331667 ns |
34799687.5 ns |
1.02 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/GPU/CUDA |
1845478 ns |
1834951 ns |
1.01 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) |
184329000 ns |
187799375 ns |
0.98 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) |
159275792 ns |
159175458 ns |
1.00 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) |
146325020.5 ns |
146555271 ns |
1.00 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) |
421919833.5 ns |
415008291 ns |
1.02 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/GPU/CUDA |
16511002 ns |
16504056.5 ns |
1.00 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) |
426977958 ns |
437855250 ns |
0.98 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) |
253895666.5 ns |
254443000 ns |
1.00 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) |
232491541.5 ns |
231693624.5 ns |
1.00 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) |
496597729.5 ns |
485497958 ns |
1.02 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
182708 ns |
184229.5 ns |
0.99 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
183500 ns |
181916 ns |
1.01 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
186750 ns |
184084 ns |
1.01 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
183625.5 ns |
182167 ns |
1.01 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
180986.5 ns |
230730 ns |
0.78 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
599646 ns |
637084 ns |
0.94 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
599167 ns |
586270.5 ns |
1.02 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
586312.5 ns |
586583 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
631583 ns |
631542 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
1006207 ns |
1097701 ns |
0.92 |
batchedmm(128, Bsize=512)/forward/CPU/2 thread(s) |
3875729 ns |
3894562.5 ns |
1.00 |
batchedmm(128, Bsize=512)/forward/CPU/4 thread(s) |
3691625.5 ns |
3827292 ns |
0.96 |
batchedmm(128, Bsize=512)/forward/CPU/8 thread(s) |
3494000 ns |
3469958 ns |
1.01 |
batchedmm(128, Bsize=512)/forward/CPU/1 thread(s) |
5492270.5 ns |
5353020.5 ns |
1.03 |
batchedmm(128, Bsize=512)/forward/GPU/CUDA |
537118 ns |
535365 ns |
1.00 |
batchedmm(128, Bsize=512)/zygote/CPU/2 thread(s) |
17349416 ns |
18146250 ns |
0.96 |
batchedmm(128, Bsize=512)/zygote/CPU/4 thread(s) |
17179917 ns |
17166041.5 ns |
1.00 |
batchedmm(128, Bsize=512)/zygote/CPU/8 thread(s) |
16549228.5 ns |
16601417 ns |
1.00 |
batchedmm(128, Bsize=512)/zygote/CPU/1 thread(s) |
23178583 ns |
22202083 ns |
1.04 |
batchedmm(128, Bsize=512)/zygote/GPU/CUDA |
2620708.5 ns |
2616593 ns |
1.00 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
500 ns |
500 ns |
1 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
542 ns |
458 ns |
1.18 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
500 ns |
500 ns |
1 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
583 ns |
500 ns |
1.17 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA |
31700 ns |
32123 ns |
0.99 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
9584 ns |
9458 ns |
1.01 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
9417 ns |
8667 ns |
1.09 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
9458 ns |
9167 ns |
1.03 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
9937.5 ns |
9208 ns |
1.08 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA |
257065.5 ns |
267754 ns |
0.96 |
vgg16(32, 32, 3, 128)/forward/CPU/2 thread(s) |
495483250 ns |
580762562.5 ns |
0.85 |
vgg16(32, 32, 3, 128)/forward/CPU/4 thread(s) |
429872875 ns |
427173312.5 ns |
1.01 |
vgg16(32, 32, 3, 128)/forward/CPU/8 thread(s) |
433975833 ns |
376948624.5 ns |
1.15 |
vgg16(32, 32, 3, 128)/forward/CPU/1 thread(s) |
676305021 ns |
671986666.5 ns |
1.01 |
vgg16(32, 32, 3, 128)/forward/GPU/CUDA |
12479376 ns |
12479261 ns |
1.00 |
vgg16(32, 32, 3, 128)/zygote/CPU/2 thread(s) |
2043995812.5 ns |
2061821458.5 ns |
0.99 |
vgg16(32, 32, 3, 128)/zygote/CPU/4 thread(s) |
1631415458 ns |
1626836125 ns |
1.00 |
vgg16(32, 32, 3, 128)/zygote/CPU/8 thread(s) |
1493405541.5 ns |
1500724875 ns |
1.00 |
vgg16(32, 32, 3, 128)/zygote/CPU/1 thread(s) |
2222310229 ns |
2217147562.5 ns |
1.00 |
vgg16(32, 32, 3, 128)/zygote/GPU/CUDA |
49049243.5 ns |
48947892 ns |
1.00 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) |
1658542 ns |
1651250 ns |
1.00 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) |
1177291 ns |
1196959 ns |
0.98 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) |
1381521 ns |
1346187.5 ns |
1.03 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) |
2411375 ns |
2356042 ns |
1.02 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/GPU/CUDA |
215956.5 ns |
218070 ns |
0.99 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) |
12702500 ns |
12822417 ns |
0.99 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) |
9925417 ns |
9953541.5 ns |
1.00 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) |
9668750 ns |
9605000 ns |
1.01 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) |
18553500 ns |
18408062.5 ns |
1.01 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/GPU/CUDA |
2022588 ns |
2047696.5 ns |
0.99 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) |
17684438 ns |
17771104.5 ns |
1.00 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) |
14677583 ns |
14762729 ns |
0.99 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) |
14539333.5 ns |
14473917 ns |
1.00 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) |
21490417 ns |
21336042 ns |
1.01 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/2 thread(s) |
26208 ns |
26250 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/4 thread(s) |
26208 ns |
26209 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/8 thread(s) |
26208 ns |
26583 ns |
0.99 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/1 thread(s) |
26250 ns |
26209 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/CUDA |
24528 ns |
24922 ns |
0.98 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) |
66750 ns |
66792 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) |
67084 ns |
67000 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) |
67750 ns |
66791 ns |
1.01 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) |
66875 ns |
66916 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/CUDA |
381706 ns |
410676.5 ns |
0.93 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
204209 ns |
203542 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
208750 ns |
210583 ns |
0.99 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
209667 ns |
210500 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
199916 ns |
199958 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
26151 ns |
26405 ns |
0.99 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
649958 ns |
602333 ns |
1.08 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
624646 ns |
621292 ns |
1.01 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
666084 ns |
621250 ns |
1.07 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
587687.5 ns |
630584 ns |
0.93 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
308606 ns |
355627 ns |
0.87 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
650875 ns |
657646 ns |
0.99 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
634875 ns |
638729 ns |
0.99 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
648000 ns |
544125 ns |
1.19 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
654646 ns |
677396 ns |
0.97 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
131873 ns |
132242 ns |
1.00 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2262708.5 ns |
2305542 ns |
0.98 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1990917 ns |
2254292 ns |
0.88 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2254312.5 ns |
1426250 ns |
1.58 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
2295625 ns |
2248542 ns |
1.02 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
1112430.5 ns |
1182706 ns |
0.94 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
18688 ns |
17937.5 ns |
1.04 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
18063 ns |
17042 ns |
1.06 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
22542 ns |
19500 ns |
1.16 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
19375 ns |
16895.5 ns |
1.15 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
143353.5 ns |
144900 ns |
0.99 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
261250 ns |
220000 ns |
1.19 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
219104.5 ns |
218416.5 ns |
1.00 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
263062.5 ns |
219458 ns |
1.20 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
231875 ns |
261708 ns |
0.89 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
943050 ns |
1051792 ns |
0.90 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
542 ns |
459 ns |
1.18 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
583 ns |
459 ns |
1.27 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
583 ns |
542 ns |
1.08 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
583 ns |
458 ns |
1.27 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA |
23123 ns |
23475 ns |
0.99 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
10125 ns |
9520.5 ns |
1.06 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
10292 ns |
9541 ns |
1.08 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
9750 ns |
10166 ns |
0.96 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
10250 ns |
9375 ns |
1.09 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA |
253508 ns |
261505 ns |
0.97 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
6292 ns |
6542 ns |
0.96 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
7229.5 ns |
5292 ns |
1.37 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
6791 ns |
6625 ns |
1.03 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
6334 ns |
7416 ns |
0.85 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA |
187408.5 ns |
235631 ns |
0.80 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
7417 ns |
7000 ns |
1.06 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
7583 ns |
7291 ns |
1.04 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
7000 ns |
7250 ns |
0.97 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
7500 ns |
7208 ns |
1.04 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA |
734791.5 ns |
803793 ns |
0.91 |
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/2 thread(s) |
2333 ns |
2334 ns |
1.00 |
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/4 thread(s) |
2042 ns |
2041 ns |
1.00 |
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/8 thread(s) |
2500 ns |
2292 ns |
1.09 |
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/1 thread(s) |
2291.5 ns |
2333 ns |
0.98 |
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/CUDA |
17938 ns |
18245.5 ns |
0.98 |
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) |
6625 ns |
6750 ns |
0.98 |
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) |
6750 ns |
6459 ns |
1.05 |
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) |
6583 ns |
6667 ns |
0.99 |
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) |
6750 ns |
6625 ns |
1.02 |
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/CUDA |
306716.5 ns |
333087.5 ns |
0.92 |
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/2 thread(s) |
749833 ns |
748458 ns |
1.00 |
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/4 thread(s) |
746916 ns |
746645.5 ns |
1.00 |
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/8 thread(s) |
748791 ns |
746833 ns |
1.00 |
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/1 thread(s) |
749417 ns |
749417 ns |
1 |
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/CUDA |
21741 ns |
21817 ns |
1.00 |
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/2 thread(s) |
777958 ns |
789125.5 ns |
0.99 |
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/4 thread(s) |
795500 ns |
772625 ns |
1.03 |
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/8 thread(s) |
799667 ns |
775145.5 ns |
1.03 |
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/1 thread(s) |
784479 ns |
787875 ns |
1.00 |
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/CUDA |
294410 ns |
298327 ns |
0.99 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
7416 ns |
7291 ns |
1.02 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
5916 ns |
5959 ns |
0.99 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
6125 ns |
5750 ns |
1.07 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
10458 ns |
10792 ns |
0.97 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
32276 ns |
32858 ns |
0.98 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
262167 ns |
221541 ns |
1.18 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
230083 ns |
226958 ns |
1.01 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
248167 ns |
226625 ns |
1.10 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
226459 ns |
220292 ns |
1.03 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
314403 ns |
360131.5 ns |
0.87 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
13000 ns |
10250 ns |
1.27 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
12291 ns |
9917 ns |
1.24 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
12500 ns |
12459 ns |
1.00 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
10875 ns |
10583.5 ns |
1.03 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA |
198124.5 ns |
243730.5 ns |
0.81 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
25083.5 ns |
24834 ns |
1.01 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
25166 ns |
24833.5 ns |
1.01 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
25250 ns |
24750 ns |
1.02 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
25104.5 ns |
24666 ns |
1.02 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA |
1042743 ns |
1133764 ns |
0.92 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) |
106137000 ns |
107061375 ns |
0.99 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) |
117694792 ns |
116928479.5 ns |
1.01 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) |
120933167 ns |
121136000 ns |
1.00 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) |
117852479.5 ns |
117635875 ns |
1.00 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/GPU/CUDA |
2636050.5 ns |
2659433 ns |
0.99 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) |
390604709 ns |
396814083.5 ns |
0.98 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) |
366692333 ns |
366591458 ns |
1.00 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) |
424499104 ns |
425794499.5 ns |
1.00 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) |
486163875 ns |
482285959 ns |
1.01 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/GPU/CUDA |
15155612.5 ns |
15258375 ns |
0.99 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) |
762084521 ns |
769963270.5 ns |
0.99 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) |
754403541 ns |
576371708 ns |
1.31 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) |
747877125 ns |
745582312 ns |
1.00 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) |
959384083 ns |
765495854.5 ns |
1.25 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
8208 ns |
7333 ns |
1.12 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
7667 ns |
6334 ns |
1.21 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
7583 ns |
7750 ns |
0.98 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
7625 ns |
8333 ns |
0.92 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA |
226253.5 ns |
237972 ns |
0.95 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
13958 ns |
14125 ns |
0.99 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
14000 ns |
13209 ns |
1.06 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
14542 ns |
13417 ns |
1.08 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
14500 ns |
13459 ns |
1.08 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA |
1064461 ns |
1080162 ns |
0.99 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
8791.5 ns |
7667 ns |
1.15 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
8333 ns |
5583 ns |
1.49 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
7541 ns |
8167 ns |
0.92 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
7250 ns |
8291 ns |
0.87 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA |
236495 ns |
233794.5 ns |
1.01 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
12584 ns |
12542 ns |
1.00 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
12583 ns |
11875 ns |
1.06 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
12000 ns |
12645.5 ns |
0.95 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
12750 ns |
11875 ns |
1.07 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA |
795022.5 ns |
787815 ns |
1.01 |
batchedmm(2, Bsize=128)/forward/CPU/2 thread(s) |
343417 ns |
332667 ns |
1.03 |
batchedmm(2, Bsize=128)/forward/CPU/4 thread(s) |
344208 ns |
344396 ns |
1.00 |
batchedmm(2, Bsize=128)/forward/CPU/8 thread(s) |
391084 ns |
395770.5 ns |
0.99 |
batchedmm(2, Bsize=128)/forward/CPU/1 thread(s) |
294084 ns |
312500 ns |
0.94 |
batchedmm(2, Bsize=128)/forward/GPU/CUDA |
16891 ns |
16497 ns |
1.02 |
batchedmm(2, Bsize=128)/zygote/CPU/2 thread(s) |
706521 ns |
706958.5 ns |
1.00 |
batchedmm(2, Bsize=128)/zygote/CPU/4 thread(s) |
725708.5 ns |
725208 ns |
1.00 |
batchedmm(2, Bsize=128)/zygote/CPU/8 thread(s) |
1022083 ns |
1019750 ns |
1.00 |
batchedmm(2, Bsize=128)/zygote/CPU/1 thread(s) |
656916 ns |
658292 ns |
1.00 |
batchedmm(2, Bsize=128)/zygote/GPU/CUDA |
200720.5 ns |
198046.5 ns |
1.01 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
375 ns |
375 ns |
1 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
375 ns |
292 ns |
1.28 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
417 ns |
375 ns |
1.11 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
375 ns |
292 ns |
1.28 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA |
23487 ns |
22951 ns |
1.02 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
6875 ns |
6542 ns |
1.05 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
6667 ns |
6208 ns |
1.07 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
6625 ns |
6792 ns |
0.98 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
6750 ns |
6208 ns |
1.09 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA |
240615.5 ns |
237567.5 ns |
1.01 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
5750 ns |
5709 ns |
1.01 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
5750 ns |
5667 ns |
1.01 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
5708 ns |
5875 ns |
0.97 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
5792 ns |
5667 ns |
1.02 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA |
24487 ns |
24038 ns |
1.02 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
21375 ns |
21958 ns |
0.97 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
21542 ns |
20875 ns |
1.03 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
21542 ns |
21625 ns |
1.00 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
21917 ns |
21125 ns |
1.04 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA |
263364.5 ns |
260574.5 ns |
1.01 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
155396 ns |
146812.5 ns |
1.06 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
145875 ns |
143875 ns |
1.01 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
150208 ns |
145917 ns |
1.03 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
146709 ns |
178146 ns |
0.82 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
167437.5 ns |
166659.5 ns |
1.00 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1336584 ns |
1355917 ns |
0.99 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1275583 ns |
1329374.5 ns |
0.96 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1335084 ns |
861416.5 ns |
1.55 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1360584 ns |
1325916 ns |
1.03 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
1355756.5 ns |
1338261 ns |
1.01 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
24917 ns |
23084 ns |
1.08 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
24417 ns |
21458 ns |
1.14 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
25375 ns |
24042 ns |
1.06 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
24437.5 ns |
23958 ns |
1.02 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
291074 ns |
350919.5 ns |
0.83 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
118167 ns |
179500 ns |
0.66 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
131959 ns |
120541 ns |
1.09 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
177208 ns |
118167 ns |
1.50 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
137250 ns |
151208 ns |
0.91 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
1471741 ns |
1454020.5 ns |
1.01 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
334 ns |
292 ns |
1.14 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
375 ns |
292 ns |
1.28 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
416 ns |
375 ns |
1.11 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
375 ns |
291 ns |
1.29 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA |
23087.5 ns |
22580 ns |
1.02 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
6833 ns |
6291 ns |
1.09 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
6750 ns |
6334 ns |
1.07 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
6500 ns |
6791 ns |
0.96 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
6750 ns |
6208 ns |
1.09 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA |
257954.5 ns |
253799.5 ns |
1.02 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
6792 ns |
5042 ns |
1.35 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
4708 ns |
4250 ns |
1.11 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
6000 ns |
5833.5 ns |
1.03 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
5895.5 ns |
4666 ns |
1.26 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA |
256219 ns |
254794.5 ns |
1.01 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
10125 ns |
10042 ns |
1.01 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
10292 ns |
10042 ns |
1.02 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
10250 ns |
10417 ns |
0.98 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
10250 ns |
10125 ns |
1.01 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA |
1358618 ns |
1352736 ns |
1.00 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/2 thread(s) |
1583 ns |
1625 ns |
0.97 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/4 thread(s) |
1583 ns |
1583 ns |
1 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/8 thread(s) |
1584 ns |
1584 ns |
1 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/1 thread(s) |
1625 ns |
1542 ns |
1.05 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/CUDA |
23481 ns |
23495 ns |
1.00 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) |
5584 ns |
5708 ns |
0.98 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) |
5958 ns |
5667 ns |
1.05 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) |
5709 ns |
5750 ns |
0.99 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) |
5959 ns |
5625 ns |
1.06 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/CUDA |
277247.5 ns |
273637.5 ns |
1.01 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) |
6821583 ns |
6842458 ns |
1.00 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) |
6380125 ns |
6343020.5 ns |
1.01 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) |
6552937.5 ns |
6507417 ns |
1.01 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) |
7533458 ns |
7623042 ns |
0.99 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/GPU/CUDA |
214794 ns |
213659 ns |
1.01 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) |
24101916.5 ns |
24131500 ns |
1.00 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) |
21276833.5 ns |
21298104 ns |
1.00 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) |
20988834 ns |
21004749.5 ns |
1.00 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) |
29842250 ns |
29792896 ns |
1.00 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/GPU/CUDA |
2114879 ns |
2117701 ns |
1.00 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) |
37563979 ns |
37668083 ns |
1.00 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) |
45438542 ns |
34323688 ns |
1.32 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) |
45648292 ns |
45641000 ns |
1.00 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) |
49527416.5 ns |
38230313 ns |
1.30 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
7083 ns |
6459 ns |
1.10 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
7209 ns |
5250 ns |
1.37 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
7416.5 ns |
7500 ns |
0.99 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
7000 ns |
7458 ns |
0.94 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA |
236381.5 ns |
235380.5 ns |
1.00 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
8417 ns |
8541 ns |
0.99 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
8834 ns |
7792 ns |
1.13 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
8500 ns |
8292 ns |
1.03 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
8584 ns |
9208 ns |
0.93 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA |
1061241 ns |
1057995 ns |
1.00 |
lenet(28, 28, 1, 128)/forward/CPU/2 thread(s) |
1558709 ns |
1525083 ns |
1.02 |
lenet(28, 28, 1, 128)/forward/CPU/4 thread(s) |
1267000 ns |
1258604.5 ns |
1.01 |
lenet(28, 28, 1, 128)/forward/CPU/8 thread(s) |
1619916 ns |
1613917 ns |
1.00 |
lenet(28, 28, 1, 128)/forward/CPU/1 thread(s) |
2135916 ns |
2159167 ns |
0.99 |
lenet(28, 28, 1, 128)/forward/GPU/CUDA |
279335.5 ns |
273469.5 ns |
1.02 |
lenet(28, 28, 1, 128)/zygote/CPU/2 thread(s) |
7913875 ns |
7971979 ns |
0.99 |
lenet(28, 28, 1, 128)/zygote/CPU/4 thread(s) |
6585896 ns |
6561833.5 ns |
1.00 |
lenet(28, 28, 1, 128)/zygote/CPU/8 thread(s) |
7119833 ns |
7004875 ns |
1.02 |
lenet(28, 28, 1, 128)/zygote/CPU/1 thread(s) |
10553854 ns |
10476458 ns |
1.01 |
lenet(28, 28, 1, 128)/zygote/GPU/CUDA |
1875832 ns |
1860749 ns |
1.01 |
batchedmm(128, Bsize=4)/forward/CPU/2 thread(s) |
334271 ns |
326083.5 ns |
1.03 |
batchedmm(128, Bsize=4)/forward/CPU/4 thread(s) |
347750 ns |
347292 ns |
1.00 |
batchedmm(128, Bsize=4)/forward/CPU/8 thread(s) |
403041 ns |
379020.5 ns |
1.06 |
batchedmm(128, Bsize=4)/forward/CPU/1 thread(s) |
325792 ns |
343562.5 ns |
0.95 |
batchedmm(128, Bsize=4)/forward/GPU/CUDA |
47242 ns |
46613.5 ns |
1.01 |
batchedmm(128, Bsize=4)/zygote/CPU/2 thread(s) |
747250 ns |
745458 ns |
1.00 |
batchedmm(128, Bsize=4)/zygote/CPU/4 thread(s) |
790812.5 ns |
781417 ns |
1.01 |
batchedmm(128, Bsize=4)/zygote/CPU/8 thread(s) |
1075687.5 ns |
1067437.5 ns |
1.01 |
batchedmm(128, Bsize=4)/zygote/CPU/1 thread(s) |
772458 ns |
751125 ns |
1.03 |
batchedmm(128, Bsize=4)/zygote/GPU/CUDA |
311274.5 ns |
306721.5 ns |
1.01 |
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/2 thread(s) |
397166 ns |
396333 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/4 thread(s) |
288125 ns |
287916 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/8 thread(s) |
288125 ns |
288062.5 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/1 thread(s) |
750041 ns |
751542 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/CUDA |
44790 ns |
43483 ns |
1.03 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/2 thread(s) |
663125 ns |
646375 ns |
1.03 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/4 thread(s) |
527542 ns |
531834 ns |
0.99 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/8 thread(s) |
532084 ns |
530042 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/1 thread(s) |
975042 ns |
973417 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/CUDA |
192626.5 ns |
188389 ns |
1.02 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
651958 ns |
653542 ns |
1.00 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
645104.5 ns |
639041.5 ns |
1.01 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
641500 ns |
545542 ns |
1.18 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
647583.5 ns |
655584 ns |
0.99 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
132300 ns |
131455.5 ns |
1.01 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2470625 ns |
2529917 ns |
0.98 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
2451792 ns |
2399708 ns |
1.02 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2459958 ns |
2436833 ns |
1.01 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
2537146 ns |
2460520.5 ns |
1.03 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
1356668.5 ns |
1513461 ns |
0.90 |
batchedmm(2, Bsize=32)/forward/CPU/2 thread(s) |
342875 ns |
323146 ns |
1.06 |
batchedmm(2, Bsize=32)/forward/CPU/4 thread(s) |
342666 ns |
343771 ns |
1.00 |
batchedmm(2, Bsize=32)/forward/CPU/8 thread(s) |
394791 ns |
394750 ns |
1.00 |
batchedmm(2, Bsize=32)/forward/CPU/1 thread(s) |
290500 ns |
310562 ns |
0.94 |
batchedmm(2, Bsize=32)/forward/GPU/CUDA |
16338 ns |
15996 ns |
1.02 |
batchedmm(2, Bsize=32)/zygote/CPU/2 thread(s) |
699604.5 ns |
699000 ns |
1.00 |
batchedmm(2, Bsize=32)/zygote/CPU/4 thread(s) |
725000 ns |
717792 ns |
1.01 |
batchedmm(2, Bsize=32)/zygote/CPU/8 thread(s) |
1021833 ns |
1016334 ns |
1.01 |
batchedmm(2, Bsize=32)/zygote/CPU/1 thread(s) |
650708 ns |
649937 ns |
1.00 |
batchedmm(2, Bsize=32)/zygote/GPU/CUDA |
200343 ns |
196510 ns |
1.02 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
1466042 ns |
1458958 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
1504208 ns |
1506167 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
1504292 ns |
1503458 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
1443375 ns |
1442834 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
41211 ns |
39862 ns |
1.03 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
5127354.5 ns |
5157334 ns |
0.99 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
5278875 ns |
5010437.5 ns |
1.05 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
5250459 ns |
4993104 ns |
1.05 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
5013583 ns |
4988542 ns |
1.01 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
198454 ns |
197580.5 ns |
1.00 |
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/2 thread(s) |
3666 ns |
3709 ns |
0.99 |
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/4 thread(s) |
3667 ns |
3667 ns |
1 |
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/8 thread(s) |
3667 ns |
3667 ns |
1 |
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/1 thread(s) |
3708 ns |
3708 ns |
1 |
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/CUDA |
33443 ns |
32748 ns |
1.02 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/2 thread(s) |
15208 ns |
14833 ns |
1.03 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/4 thread(s) |
15375 ns |
15125 ns |
1.02 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/8 thread(s) |
15083 ns |
15292 ns |
0.99 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/1 thread(s) |
15166 ns |
15041 ns |
1.01 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/CUDA |
380259.5 ns |
374855 ns |
1.01 |
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/2 thread(s) |
70979.5 ns |
71625 ns |
0.99 |
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/4 thread(s) |
71084 ns |
71333 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/8 thread(s) |
71041 ns |
71333 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/1 thread(s) |
71125 ns |
71333 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/CUDA |
113289 ns |
113422 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/2 thread(s) |
317167 ns |
326208 ns |
0.97 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/4 thread(s) |
319541 ns |
318250 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/8 thread(s) |
320167 ns |
319375 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/1 thread(s) |
326750 ns |
317917 ns |
1.03 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/CUDA |
196353 ns |
192316 ns |
1.02 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
1083 ns |
1000 ns |
1.08 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
1042 ns |
959 ns |
1.09 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
958 ns |
1083 ns |
0.88 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
1042 ns |
1000 ns |
1.04 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA |
23589 ns |
23450 ns |
1.01 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
8375 ns |
8042 ns |
1.04 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
8417 ns |
7895.5 ns |
1.07 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
8125 ns |
8333 ns |
0.98 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
8375 ns |
7792 ns |
1.07 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA |
262304.5 ns |
258455 ns |
1.01 |
batchedmm(128, Bsize=32)/forward/CPU/2 thread(s) |
472833 ns |
465250 ns |
1.02 |
batchedmm(128, Bsize=32)/forward/CPU/4 thread(s) |
480833 ns |
472750 ns |
1.02 |
batchedmm(128, Bsize=32)/forward/CPU/8 thread(s) |
551167 ns |
547875 ns |
1.01 |
batchedmm(128, Bsize=32)/forward/CPU/1 thread(s) |
543125 ns |
554667 ns |
0.98 |
batchedmm(128, Bsize=32)/forward/GPU/CUDA |
129864.5 ns |
130091 ns |
1.00 |
batchedmm(128, Bsize=32)/zygote/CPU/2 thread(s) |
1386000 ns |
1420208 ns |
0.98 |
batchedmm(128, Bsize=32)/zygote/CPU/4 thread(s) |
1384250 ns |
1378895.5 ns |
1.00 |
batchedmm(128, Bsize=32)/zygote/CPU/8 thread(s) |
1602583 ns |
1600250 ns |
1.00 |
batchedmm(128, Bsize=32)/zygote/CPU/1 thread(s) |
1624229.5 ns |
1587791 ns |
1.02 |
batchedmm(128, Bsize=32)/zygote/GPU/CUDA |
274560 ns |
274988 ns |
1.00 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
375 ns |
334 ns |
1.12 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
375 ns |
292 ns |
1.28 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
333 ns |
292 ns |
1.14 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
375 ns |
292 ns |
1.28 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA |
32065 ns |
31336 ns |
1.02 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
6542 ns |
6625 ns |
0.99 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
6584 ns |
5959 ns |
1.10 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
6209 ns |
6354.5 ns |
0.98 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
6584 ns |
6166 ns |
1.07 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA |
268387 ns |
261129.5 ns |
1.03 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
1724500 ns |
1730708 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
1724917 ns |
1721229.5 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
1729625 ns |
1723750 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
1771291 ns |
1730229 ns |
1.02 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
169401 ns |
168441.5 ns |
1.01 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
4377917 ns |
4400167 ns |
0.99 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
4335812.5 ns |
4366354 ns |
0.99 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
4372021 ns |
3903958 ns |
1.12 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
4418583.5 ns |
4358458 ns |
1.01 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
1249618 ns |
1240708 ns |
1.01 |
bias_activation(512, act=relu)(512 x 128)/forward/CPU/2 thread(s) |
7041 ns |
6792 ns |
1.04 |
bias_activation(512, act=relu)(512 x 128)/forward/CPU/4 thread(s) |
6729.5 ns |
6584 ns |
1.02 |
bias_activation(512, act=relu)(512 x 128)/forward/CPU/8 thread(s) |
7166 ns |
6833 ns |
1.05 |
bias_activation(512, act=relu)(512 x 128)/forward/CPU/1 thread(s) |
6833 ns |
14542 ns |
0.47 |
bias_activation(512, act=relu)(512 x 128)/forward/GPU/CUDA |
20772 ns |
20531 ns |
1.01 |
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/2 thread(s) |
52792 ns |
32708 ns |
1.61 |
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/4 thread(s) |
47770.5 ns |
67708 ns |
0.71 |
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/8 thread(s) |
50875 ns |
32833 ns |
1.55 |
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/1 thread(s) |
52417 ns |
51667 ns |
1.01 |
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/CUDA |
210358.5 ns |
291979.5 ns |
0.72 |
batchedmm(2, Bsize=512)/forward/CPU/2 thread(s) |
356500 ns |
336292 ns |
1.06 |
batchedmm(2, Bsize=512)/forward/CPU/4 thread(s) |
349062 ns |
347187.5 ns |
1.01 |
batchedmm(2, Bsize=512)/forward/CPU/8 thread(s) |
421583 ns |
415021 ns |
1.02 |
batchedmm(2, Bsize=512)/forward/CPU/1 thread(s) |
300208.5 ns |
324666.5 ns |
0.92 |
batchedmm(2, Bsize=512)/forward/GPU/CUDA |
18809 ns |
18102.5 ns |
1.04 |
batchedmm(2, Bsize=512)/zygote/CPU/2 thread(s) |
721458.5 ns |
718416.5 ns |
1.00 |
batchedmm(2, Bsize=512)/zygote/CPU/4 thread(s) |
732250 ns |
727250 ns |
1.01 |
batchedmm(2, Bsize=512)/zygote/CPU/8 thread(s) |
1033791 ns |
1030292 ns |
1.00 |
batchedmm(2, Bsize=512)/zygote/CPU/1 thread(s) |
673417 ns |
672709 ns |
1.00 |
batchedmm(2, Bsize=512)/zygote/GPU/CUDA |
344380.5 ns |
346719.5 ns |
0.99 |
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/2 thread(s) |
75583 ns |
75667 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/4 thread(s) |
75291 ns |
75208 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/8 thread(s) |
75209 ns |
75375 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/1 thread(s) |
75145.5 ns |
75000 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/CUDA |
47560 ns |
46739 ns |
1.02 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/2 thread(s) |
324250 ns |
333209 ns |
0.97 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/4 thread(s) |
332667 ns |
331291 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/8 thread(s) |
327750 ns |
332729.5 ns |
0.99 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/1 thread(s) |
333166 ns |
324292 ns |
1.03 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/CUDA |
213112 ns |
208913 ns |
1.02 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
1490584 ns |
1483875 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
1530958 ns |
1531875 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
1530417 ns |
1529458 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
1466709 ns |
1467834 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
52219 ns |
51266 ns |
1.02 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
5129458.5 ns |
5149875 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
5247062.5 ns |
5290166.5 ns |
0.99 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
5269917 ns |
5287000 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
5022250 ns |
4982583 ns |
1.01 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
201172.5 ns |
202737.5 ns |
0.99 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/2 thread(s) |
28125 ns |
28291 ns |
0.99 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/4 thread(s) |
28167 ns |
28167 ns |
1 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/8 thread(s) |
28250 ns |
28291 ns |
1.00 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/1 thread(s) |
28250 ns |
28167 ns |
1.00 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/CUDA |
24829 ns |
24497 ns |
1.01 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) |
66292 ns |
66625 ns |
1.00 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) |
66500 ns |
66542 ns |
1.00 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) |
66833 ns |
66500 ns |
1.01 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) |
66500 ns |
66500 ns |
1 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/CUDA |
533991 ns |
532969 ns |
1.00 |
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/2 thread(s) |
1498834 ns |
1260875 ns |
1.19 |
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/4 thread(s) |
1142333 ns |
1118417 ns |
1.02 |
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/8 thread(s) |
1135875 ns |
1056541 ns |
1.08 |
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/1 thread(s) |
2243250 ns |
2256375 ns |
0.99 |
mlp7layer_bn(tanh)(32 x 256)/forward/GPU/CUDA |
568748 ns |
573252 ns |
0.99 |
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/2 thread(s) |
3097084 ns |
3028208 ns |
1.02 |
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/4 thread(s) |
2582250 ns |
2726937.5 ns |
0.95 |
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/8 thread(s) |
2755667 ns |
2733875 ns |
1.01 |
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/1 thread(s) |
3877542 ns |
3818500 ns |
1.02 |
mlp7layer_bn(tanh)(32 x 256)/zygote/GPU/CUDA |
2062479 ns |
1997088 ns |
1.03 |
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/2 thread(s) |
8842146 ns |
8958062.5 ns |
0.99 |
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/4 thread(s) |
8794833 ns |
8813834 ns |
1.00 |
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/8 thread(s) |
8782250 ns |
8742917 ns |
1.00 |
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/1 thread(s) |
6445917 ns |
6350021 ns |
1.02 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
82917 ns |
82895.5 ns |
1.00 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
81166 ns |
80270.5 ns |
1.01 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
85750 ns |
82875 ns |
1.03 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
93479.5 ns |
80167 ns |
1.17 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
192413.5 ns |
192999 ns |
1.00 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2020417 ns |
2045708.5 ns |
0.99 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
2018916.5 ns |
2026499.5 ns |
1.00 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2022875 ns |
2015875 ns |
1.00 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
2038583 ns |
2005042 ns |
1.02 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
792691 ns |
797613 ns |
0.99 |
This comment was automatically generated by workflow using github-action-benchmark.
c933d28
to
4a02032
Compare
use #1021 and remove the tracing part from lux extension |
f041d46
to
863de31
Compare
[skip ci] [skip docs] [skip benchmarks] [skip tests]
863de31
to
b88f628
Compare
needs EnzymeAD/Reactant.jl#216