Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
chore: bump crate-ci/typos from 1.26.8 to 1.27.0 (#1022)
Bumps [crate-ci/typos](https://github.com/crate-ci/typos) from 1.26.8 to 1.27.0. - [Release notes](https://github.com/crate-ci/typos/releases) - [Changelog](https://github.com/crate-ci/typos/blob/master/CHANGELOG.md) - [Commits](crate-ci/typos@v1.26.8...v1.27.0) --- updated-dependencies: - dependency-name: crate-ci/typos dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
- Loading branch information
8bfa628
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lux Benchmarks
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
4625
ns4334
ns1.07
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
4084
ns4125
ns0.99
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
5791
ns5417
ns1.07
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
4292
ns4167
ns1.03
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA
60959
ns59978
ns1.02
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
10125
ns10333
ns0.98
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
9959
ns10167
ns0.98
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
10375
ns10500
ns0.99
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
10666
ns10167
ns1.05
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA
427044
ns416390
ns1.03
bias_activation(32, act=relu)(32 x 128)/forward/CPU/2 thread(s)
1167
ns1166.5
ns1.00
bias_activation(32, act=relu)(32 x 128)/forward/CPU/4 thread(s)
1250
ns3042
ns0.41
bias_activation(32, act=relu)(32 x 128)/forward/CPU/8 thread(s)
1458
ns1208
ns1.21
bias_activation(32, act=relu)(32 x 128)/forward/CPU/1 thread(s)
3542
ns1000
ns3.54
bias_activation(32, act=relu)(32 x 128)/forward/GPU/CUDA
18260
ns18063
ns1.01
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/2 thread(s)
4125
ns4084
ns1.01
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/4 thread(s)
3833
ns3958
ns0.97
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/8 thread(s)
4125
ns4250
ns0.97
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/1 thread(s)
4000
ns4125
ns0.97
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/CUDA
111381
ns109325.5
ns1.02
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
57709
ns56041
ns1.03
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
47250
ns46084
ns1.03
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
38250
ns46375
ns0.82
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
80333
ns81834
ns0.98
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
37655
ns36229
ns1.04
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2026167
ns2056625
ns0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2092708.5
ns2082416.5
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2059625.5
ns2056666.5
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1993416
ns1995458
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
197377
ns192802
ns1.02
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
152958
ns172458
ns0.89
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
148250
ns144854.5
ns1.02
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
146417
ns148125
ns0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
150375
ns146125
ns1.03
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
167595
ns166789
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1098542
ns1157666
ns0.95
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1124250
ns1110395.5
ns1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1116146
ns1128416.5
ns0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1107229.5
ns1120208
ns0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
523151
ns516061
ns1.01
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
3584
ns3583
ns1.00
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
3625
ns3583.5
ns1.01
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
5708.5
ns4229.5
ns1.35
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
3417
ns3292
ns1.04
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA
70157
ns69748
ns1.01
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
8834
ns8792
ns1.00
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
8667
ns9125
ns0.95
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
9291
ns9000
ns1.03
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
9042
ns9209
ns0.98
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA
492826.5
ns470533
ns1.05
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
17000
ns15083
ns1.13
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
16375
ns14875
ns1.10
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
18667
ns16583
ns1.13
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
17083
ns14917
ns1.15
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
54850
ns53475
ns1.03
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
213146
ns222375
ns0.96
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
216104
ns213084
ns1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
214167
ns213250
ns1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
225333
ns213520.5
ns1.06
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
272672.5
ns267675
ns1.02
bias_activation(2, act=relu)(2 x 128)/forward/CPU/2 thread(s)
459
ns500
ns0.92
bias_activation(2, act=relu)(2 x 128)/forward/CPU/4 thread(s)
542
ns542
ns1
bias_activation(2, act=relu)(2 x 128)/forward/CPU/8 thread(s)
709
ns584
ns1.21
bias_activation(2, act=relu)(2 x 128)/forward/CPU/1 thread(s)
583
ns583
ns1
bias_activation(2, act=relu)(2 x 128)/forward/GPU/CUDA
17542
ns17384
ns1.01
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/2 thread(s)
1708
ns1500
ns1.14
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/4 thread(s)
1458
ns1500
ns0.97
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/8 thread(s)
1625
ns1750
ns0.93
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/1 thread(s)
1750
ns1583
ns1.11
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/CUDA
104205
ns103376
ns1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
7250
ns7041
ns1.03
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
5833
ns5625
ns1.04
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
5209
ns5709
ns0.91
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
4000
ns9916
ns0.40
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
23961
ns23093
ns1.04
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
228750.5
ns227583.5
ns1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
228333
ns230417
ns0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
228500
ns228000
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
226334
ns215542
ns1.05
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
170956
ns166208.5
ns1.03
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/2 thread(s)
3875
ns3916
ns0.99
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/4 thread(s)
3875
ns3875
ns1
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/8 thread(s)
3916
ns3834
ns1.02
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/1 thread(s)
3834
ns3834
ns1
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/CUDA
23832
ns23533
ns1.01
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/2 thread(s)
16833
ns16708
ns1.01
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/4 thread(s)
16708
ns16750
ns1.00
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/8 thread(s)
16708
ns16791
ns1.00
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/1 thread(s)
16958
ns16625
ns1.02
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/CUDA
165501.5
ns160718
ns1.03
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/2 thread(s)
579042
ns577333
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/4 thread(s)
574375
ns573417
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/8 thread(s)
575083
ns579000
ns0.99
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/1 thread(s)
576292
ns574042
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/CUDA
113664
ns113474
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/2 thread(s)
1417708
ns1432312.5
ns0.99
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/4 thread(s)
1429333
ns1426250
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/8 thread(s)
1425729.5
ns1425917
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/1 thread(s)
1422208
ns1418000
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/CUDA
214791
ns211622
ns1.01
lenet(28, 28, 1, 64)/forward/CPU/2 thread(s)
1082104
ns1046541
ns1.03
lenet(28, 28, 1, 64)/forward/CPU/4 thread(s)
959958.5
ns965500
ns0.99
lenet(28, 28, 1, 64)/forward/CPU/8 thread(s)
1341792
ns1347458
ns1.00
lenet(28, 28, 1, 64)/forward/CPU/1 thread(s)
1294792
ns1290542
ns1.00
lenet(28, 28, 1, 64)/forward/GPU/CUDA
281583.5
ns267857
ns1.05
lenet(28, 28, 1, 64)/zygote/CPU/2 thread(s)
5777875
ns5895833.5
ns0.98
lenet(28, 28, 1, 64)/zygote/CPU/4 thread(s)
4456083
ns4588042
ns0.97
lenet(28, 28, 1, 64)/zygote/CPU/8 thread(s)
4934792
ns4928187
ns1.00
lenet(28, 28, 1, 64)/zygote/CPU/1 thread(s)
5627500
ns5737167
ns0.98
lenet(28, 28, 1, 64)/zygote/GPU/CUDA
1106964
ns1066176
ns1.04
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/2 thread(s)
500
ns500
ns1
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/4 thread(s)
500
ns500
ns1
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/8 thread(s)
542
ns542
ns1
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/1 thread(s)
542
ns500
ns1.08
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/CUDA
23988
ns23460
ns1.02
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/2 thread(s)
2084
ns2084
ns1
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/4 thread(s)
2083
ns2125
ns0.98
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/8 thread(s)
2125
ns2292
ns0.93
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/1 thread(s)
2125
ns2125
ns1
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/CUDA
179026
ns169490.5
ns1.06
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
6084
ns5458
ns1.11
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
6167
ns4000
ns1.54
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
7041
ns5687.5
ns1.24
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
6375
ns6250
ns1.02
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA
66163.5
ns64594
ns1.02
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
11291
ns11083
ns1.02
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
10791
ns11333
ns0.95
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
12125
ns12041
ns1.01
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
11354.5
ns11083.5
ns1.02
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA
456626.5
ns444224
ns1.03
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
7000
ns6708
ns1.04
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
7042
ns6416
ns1.10
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
8375
ns7875
ns1.06
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
7042
ns6500
ns1.08
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA
52652
ns51136
ns1.03
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
17375
ns17583
ns0.99
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
17167
ns16958
ns1.01
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
17770.5
ns18145.5
ns0.98
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
18708
ns16916
ns1.11
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA
306093.5
ns297812
ns1.03
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s)
459
ns500
ns0.92
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s)
459
ns500
ns0.92
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s)
583
ns583
ns1
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s)
542
ns500
ns1.08
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA
33004
ns31896
ns1.03
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
8583
ns8916
ns0.96
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
8208
ns8667
ns0.95
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
9583
ns9250
ns1.04
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
9042
ns8645.5
ns1.05
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA
162492.5
ns155805
ns1.04
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/2 thread(s)
64542
ns64937.5
ns0.99
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/4 thread(s)
64417
ns62625
ns1.03
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/8 thread(s)
64625
ns64500
ns1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/1 thread(s)
64750
ns64667
ns1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/CUDA
112347.5
ns110478.5
ns1.02
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/2 thread(s)
277542
ns294791
ns0.94
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/4 thread(s)
281625
ns279125
ns1.01
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/8 thread(s)
288750
ns275479.5
ns1.05
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/1 thread(s)
275500
ns280854.5
ns0.98
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/CUDA
189809
ns185224.5
ns1.02
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/2 thread(s)
3285583
ns3152041.5
ns1.04
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/4 thread(s)
3022333.5
ns3026187
ns1.00
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/8 thread(s)
2780375
ns3022520.5
ns0.92
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/1 thread(s)
4038625
ns3964167
ns1.02
mlp7layer_bn(gelu)(32 x 256)/forward/GPU/CUDA
573967
ns573818.5
ns1.00
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/2 thread(s)
7586208.5
ns7551166.5
ns1.00
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/4 thread(s)
7415437
ns7449979
ns1.00
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/8 thread(s)
7333375
ns7447000
ns0.98
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/1 thread(s)
8220958
ns8208396
ns1.00
mlp7layer_bn(gelu)(32 x 256)/zygote/GPU/CUDA
1351752.5
ns1327975
ns1.02
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/2 thread(s)
18835167
ns18867458
ns1.00
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/4 thread(s)
19044834
ns19142541
ns0.99
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/8 thread(s)
19135125
ns19088834
ns1.00
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/1 thread(s)
15633417
ns15711167
ns1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s)
23661916.5
ns24315583.5
ns0.97
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s)
33965500
ns33983500
ns1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s)
41107417
ns37046583.5
ns1.11
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s)
34858709
ns34841833
ns1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/GPU/CUDA
1862815
ns2130242
ns0.87
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s)
189289541
ns192387270.5
ns0.98
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s)
164224708
ns163943875
ns1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s)
157847979
ns152577625
ns1.03
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s)
438904833
ns437847333
ns1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/GPU/CUDA
13913764
ns14119852
ns0.99
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s)
289733584
ns294725229.5
ns0.98
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s)
338173667
ns338344395.5
ns1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s)
307489541.5
ns300590083.5
ns1.02
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s)
393585937.5
ns396800708.5
ns0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
21708.5
ns23687.5
ns0.92
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
24458
ns23083
ns1.06
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
25937
ns24791
ns1.05
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
24229
ns23708
ns1.02
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
96907
ns95862
ns1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
103750
ns103250
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
105292
ns103458
ns1.02
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
104208
ns103667
ns1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
151250
ns102750
ns1.47
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
504189
ns494978
ns1.02
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
6583
ns7083
ns0.93
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
7292
ns5750
ns1.27
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
7959
ns6875
ns1.16
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
6958
ns7000
ns0.99
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA
68581
ns67128
ns1.02
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
14916.5
ns15375
ns0.97
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
14709
ns15395.5
ns0.96
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
16666
ns16000
ns1.04
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
14292
ns14791.5
ns0.97
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA
483895
ns467877
ns1.03
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s)
3017937
ns3009166.5
ns1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s)
2022458
ns2067250
ns0.98
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s)
2307959
ns2279667
ns1.01
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s)
4846645.5
ns4832667
ns1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/GPU/CUDA
585796
ns581800.5
ns1.01
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s)
23617917
ns23921708.5
ns0.99
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s)
17975417
ns18037292
ns1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s)
18323812.5
ns16963187.5
ns1.08
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s)
35597209
ns34623770.5
ns1.03
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/GPU/CUDA
3109235
ns3105602
ns1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s)
33405687.5
ns33780291
ns0.99
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s)
27693604
ns27715666.5
ns1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s)
27860958
ns27451041
ns1.01
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s)
42002937.5
ns41640208
ns1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
72375
ns80479
ns0.90
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
84624.5
ns72416
ns1.17
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
83250
ns78354
ns1.06
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
73750
ns74645.5
ns0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
102852
ns100885
ns1.02
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
218167
ns311542
ns0.70
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
309979
ns224520.5
ns1.38
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
317479
ns209667
ns1.51
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
288875
ns257021
ns1.12
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
550996
ns539235
ns1.02
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
12041
ns12500
ns0.96
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
12729.5
ns11708
ns1.09
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
13833
ns12542
ns1.10
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
11666.5
ns12833.5
ns0.91
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA
71604
ns70648
ns1.01
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
26625
ns26667
ns1.00
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
26959
ns26958.5
ns1.00
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
28292
ns27333.5
ns1.04
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
26458
ns26625
ns0.99
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA
484486.5
ns470896
ns1.03
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
12417
ns12791
ns0.97
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
12542
ns12333
ns1.02
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
14584
ns13500
ns1.08
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
13041.5
ns12875
ns1.01
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA
53694
ns52214
ns1.03
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
26312.5
ns25959
ns1.01
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
26270.5
ns25750
ns1.02
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
26667
ns26500
ns1.01
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
26333
ns26500
ns0.99
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA
309291.5
ns300818.5
ns1.03
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
178770.5
ns180750
ns0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
182334
ns179583
ns1.02
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
184895.5
ns183146
ns1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
179750
ns179250
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
57908
ns56380
ns1.03
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
587125
ns593542
ns0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
596500
ns582459
ns1.02
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
593770.5
ns585042
ns1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
583166
ns594562
ns0.98
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
290369.5
ns284588
ns1.02
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s)
7354.5
ns6770.5
ns1.09
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s)
7167
ns5958
ns1.20
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s)
7875
ns7084
ns1.11
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s)
6833
ns7125
ns0.96
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA
70829
ns70103
ns1.01
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
14375
ns14709
ns0.98
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
14708
ns14500
ns1.01
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
15625
ns15291.5
ns1.02
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
14083
ns13958
ns1.01
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA
471312.5
ns460969.5
ns1.02
batchedmm(512, Bsize=4)/forward/CPU/2 thread(s)
1235042
ns1217750
ns1.01
batchedmm(512, Bsize=4)/forward/CPU/4 thread(s)
1283583
ns1209125
ns1.06
batchedmm(512, Bsize=4)/forward/CPU/8 thread(s)
1282875
ns1249750
ns1.03
batchedmm(512, Bsize=4)/forward/CPU/1 thread(s)
1325208
ns1326625
ns1.00
batchedmm(512, Bsize=4)/forward/GPU/CUDA
301270
ns302841
ns0.99
batchedmm(512, Bsize=4)/zygote/CPU/2 thread(s)
4111125
ns4351270.5
ns0.94
batchedmm(512, Bsize=4)/zygote/CPU/4 thread(s)
4361625
ns4353042
ns1.00
batchedmm(512, Bsize=4)/zygote/CPU/8 thread(s)
4786395.5
ns4630333
ns1.03
batchedmm(512, Bsize=4)/zygote/CPU/1 thread(s)
4453229.5
ns4466479
ns1.00
batchedmm(512, Bsize=4)/zygote/GPU/CUDA
1047552
ns1039570
ns1.01
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/2 thread(s)
1792
ns1833
ns0.98
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/4 thread(s)
1750
ns1792
ns0.98
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/8 thread(s)
1834
ns1833
ns1.00
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/1 thread(s)
1834
ns1875
ns0.98
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/CUDA
23328
ns23644
ns0.99
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/2 thread(s)
4833
ns4875
ns0.99
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/4 thread(s)
4792
ns4875
ns0.98
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/8 thread(s)
4917
ns5042
ns0.98
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/1 thread(s)
4917
ns4875
ns1.01
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/CUDA
186698
ns189061.5
ns0.99
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
7208.5
ns6021
ns1.20
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
5584
ns5708
ns0.98
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
8667
ns7042
ns1.23
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
7312.5
ns7416
ns0.99
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA
54539
ns54998.5
ns0.99
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
10833
ns11437.5
ns0.95
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
10834
ns11084
ns0.98
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
12375
ns11666
ns1.06
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
11916
ns12333
ns0.97
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA
329099
ns332242
ns0.99
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/2 thread(s)
292
ns333
ns0.88
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/4 thread(s)
292
ns292
ns1
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/8 thread(s)
334
ns333
ns1.00
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/1 thread(s)
333
ns292
ns1.14
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/CUDA
22753
ns22998
ns0.99
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/2 thread(s)
2708
ns2667
ns1.02
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/4 thread(s)
2667
ns2750
ns0.97
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/8 thread(s)
2959
ns2750
ns1.08
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/1 thread(s)
3000
ns2709
ns1.11
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/CUDA
157496
ns158762.5
ns0.99
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
13167
ns13687.5
ns0.96
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
13166
ns11208
ns1.17
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
15000
ns13958
ns1.07
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
13792
ns14125
ns0.98
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA
55218
ns57325
ns0.96
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
24833
ns24625
ns1.01
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
24542
ns24250
ns1.01
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
25375
ns25500
ns1.00
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
24709
ns24875
ns0.99
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA
289966
ns295945
ns0.98
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/2 thread(s)
4083
ns4167
ns0.98
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/4 thread(s)
4166
ns4166
ns1
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/8 thread(s)
4167
ns4167
ns1
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/1 thread(s)
4125
ns4125
ns1
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/CUDA
24660
ns24912
ns0.99
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/2 thread(s)
15958
ns16084
ns0.99
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/4 thread(s)
16417
ns16209
ns1.01
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/8 thread(s)
16042
ns16333.5
ns0.98
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/1 thread(s)
16125
ns16208
ns0.99
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/CUDA
194045.5
ns199034.5
ns0.97
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
5667
ns5708
ns0.99
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
5625
ns5584
ns1.01
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
5750
ns5708
ns1.01
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
5791
ns5708
ns1.01
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA
32989
ns33099
ns1.00
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
21125
ns21166
ns1.00
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
20459
ns20458
ns1.00
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
21542
ns21333.5
ns1.01
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
20875
ns20875
ns1
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA
174273
ns174613
ns1.00
batchedmm(16, Bsize=512)/forward/CPU/2 thread(s)
403209
ns383042
ns1.05
batchedmm(16, Bsize=512)/forward/CPU/4 thread(s)
371125
ns373541
ns0.99
batchedmm(16, Bsize=512)/forward/CPU/8 thread(s)
474292
ns485896
ns0.98
batchedmm(16, Bsize=512)/forward/CPU/1 thread(s)
539604.5
ns532854.5
ns1.01
batchedmm(16, Bsize=512)/forward/GPU/CUDA
66734
ns66578.5
ns1.00
batchedmm(16, Bsize=512)/zygote/CPU/2 thread(s)
1011917
ns938166
ns1.08
batchedmm(16, Bsize=512)/zygote/CPU/4 thread(s)
884896
ns847083
ns1.04
batchedmm(16, Bsize=512)/zygote/CPU/8 thread(s)
1220125
ns1235042
ns0.99
batchedmm(16, Bsize=512)/zygote/CPU/1 thread(s)
1400208
ns1418833
ns0.99
batchedmm(16, Bsize=512)/zygote/GPU/CUDA
190566.5
ns191164
ns1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
82917
ns81020.5
ns1.02
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
82791
ns80354.5
ns1.03
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
88958.5
ns82250
ns1.08
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
83187.5
ns132458
ns0.63
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
192556.5
ns192525
ns1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1921500
ns1945166
ns0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1696166
ns1909584
ns0.89
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1938083
ns1920333
ns1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1915875
ns1914354.5
ns1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
393732
ns402795
ns0.98
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/2 thread(s)
291
ns292
ns1.00
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/4 thread(s)
291
ns292
ns1.00
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/8 thread(s)
292
ns292
ns1
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/1 thread(s)
292
ns292
ns1
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/CUDA
21580
ns21790
ns0.99
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/2 thread(s)
1792
ns1792
ns1
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/4 thread(s)
1792
ns1791
ns1.00
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/8 thread(s)
1833
ns1916
ns0.96
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/1 thread(s)
1833
ns1792
ns1.02
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/CUDA
165924
ns172681
ns0.96
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
6708
ns8000
ns0.84
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
6250
ns6833
ns0.91
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
9750
ns8334
ns1.17
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
8125
ns7999.5
ns1.02
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA
56950.5
ns62227.5
ns0.92
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
8916.5
ns9375
ns0.95
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
8958
ns8875
ns1.01
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
9625
ns9625
ns1
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
9542
ns9250
ns1.03
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA
299584.5
ns315550.5
ns0.95
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s)
120035854.5
ns159022167
ns0.75
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s)
174382959
ns174256125
ns1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s)
154831333
ns147914021
ns1.05
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s)
103109500
ns102407958
ns1.01
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/GPU/CUDA
5474606
ns5468366
ns1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s)
617124000
ns678096083
ns0.91
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s)
555612167
ns555598625
ns1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s)
468382792
ns453528479
ns1.03
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s)
756087750
ns754205958.5
ns1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/GPU/CUDA
38213656
ns34940005
ns1.09
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s)
651747459
ns703546875
ns0.93
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s)
666674583.5
ns666832020.5
ns1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s)
602170708.5
ns585927312.5
ns1.03
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s)
734251875
ns742692916
ns0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
57208
ns57542
ns0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
48167
ns47583
ns1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
39167
ns47291
ns0.83
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
83958
ns82208
ns1.02
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
37250
ns37135
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1929792
ns1947333
ns0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1973292
ns1971042
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1984249.5
ns1976458
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1881417
ns1893520.5
ns0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
171491
ns171380.5
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
273354
ns272291
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
267959
ns265834
ns1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
270687.5
ns289417
ns0.94
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
268834
ns267167
ns1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
124192.5
ns135867.5
ns0.91
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
658333
ns671917
ns0.98
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
674854.5
ns596708
ns1.13
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
665333
ns696292
ns0.96
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
670500
ns692687.5
ns0.97
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
664813
ns737698
ns0.90
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
2190167
ns2231188
ns0.98
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
2214354.5
ns2215042
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
2216958.5
ns2207229
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
2099979
ns2243770.5
ns0.94
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
133238
ns133226
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
5505354.5
ns5572500
ns0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
5504750
ns5486875
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
5565292
ns5511083
ns1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
5499708
ns5495666.5
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
740235
ns759202.5
ns0.98
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/2 thread(s)
650417
ns652833.5
ns1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/4 thread(s)
649020.5
ns657229
ns0.99
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/8 thread(s)
640625
ns639500
ns1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/1 thread(s)
648292
ns639791
ns1.01
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/CUDA
47265
ns46976
ns1.01
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/2 thread(s)
1821708
ns1799583
ns1.01
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/4 thread(s)
1720959
ns1724792
ns1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/8 thread(s)
1675729.5
ns1722792
ns0.97
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/1 thread(s)
2108500
ns2103895.5
ns1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/CUDA
224014
ns221178.5
ns1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
58583
ns56541
ns1.04
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
46645.5
ns46833
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
38750
ns46041
ns0.84
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
83834
ns83792
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
28947
ns28073
ns1.03
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2024916
ns2058250
ns0.98
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2086188
ns2078709
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2100521
ns2093000
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1993416.5
ns1996646
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
191815.5
ns187152
ns1.02
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s)
13473875
ns13406125
ns1.01
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s)
12547041.5
ns12455458
ns1.01
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s)
12559604
ns12584792
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s)
15213416.5
ns14882959
ns1.02
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/GPU/CUDA
517805
ns517201.5
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s)
47353458
ns47687000
ns0.99
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s)
41833334
ns41754625
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s)
41118750
ns40922625
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s)
58300041
ns58112708
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/GPU/CUDA
3203904
ns3212087
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s)
74077042
ns74213479
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s)
68022250
ns68010000
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s)
90906749.5
ns90988625
ns1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s)
99115937.5
ns76809750
ns1.29
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
58958
ns56917
ns1.04
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
47375
ns47042
ns1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
38729.5
ns47041
ns0.82
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
83500
ns83375
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
47777
ns46301
ns1.03
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1923375
ns1939854
ns0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1961541
ns1973333
ns0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1980229
ns1974729.5
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1890354
ns1884375
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
194350.5
ns189579
ns1.03
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s)
291
ns291
ns1
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s)
291
ns333
ns0.87
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s)
375
ns375
ns1
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s)
375
ns250
ns1.50
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA
32617.5
ns31617
ns1.03
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
6208.5
ns6229.5
ns1.00
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
5958
ns6167
ns0.97
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
6708
ns6458
ns1.04
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
6437.5
ns6167
ns1.04
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA
173722.5
ns171396
ns1.01
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/2 thread(s)
250
ns250
ns1
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/4 thread(s)
250
ns250
ns1
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/8 thread(s)
292
ns250
ns1.17
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/1 thread(s)
292
ns250
ns1.17
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/CUDA
32110
ns31328
ns1.02
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/2 thread(s)
2583
ns2583
ns1
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/4 thread(s)
2542
ns2625
ns0.97
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/8 thread(s)
2833
ns2792
ns1.01
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/1 thread(s)
2833
ns2625
ns1.08
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/CUDA
161891
ns161410
ns1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s)
286335145.5
ns324182500
ns0.88
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s)
339870250
ns339536042
ns1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s)
320445937.5
ns314625854
ns1.02
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s)
272825875
ns273060250
ns1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/GPU/CUDA
7113314
ns7093070
ns1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s)
990386709
ns1051455583
ns0.94
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s)
938484666
ns941830875
ns1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s)
868613416.5
ns858538271
ns1.01
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s)
1158749666
ns1153691292
ns1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/GPU/CUDA
33903874
ns34020243.5
ns1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s)
1310266104.5
ns1359481562.5
ns0.96
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s)
1325766333.5
ns1360673729
ns0.97
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s)
1623996500
ns1640965792
ns0.99
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s)
1663239334
ns1309802292
ns1.27
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
1461479
ns1414416.5
ns1.03
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
1415750
ns1409541
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
1429167
ns1408500
ns1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
1414437.5
ns1453875
ns0.97
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
128213
ns127358
ns1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
5019792
ns5056229
ns0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
5022458
ns5013583
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
5050000
ns4954291
ns1.02
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
5006541.5
ns5017021
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
557532
ns601067
ns0.93
vgg16(32, 32, 3, 32)/forward/CPU/2 thread(s)
175263520.5
ns170719208
ns1.03
vgg16(32, 32, 3, 32)/forward/CPU/4 thread(s)
129816208.5
ns132607979.5
ns0.98
vgg16(32, 32, 3, 32)/forward/CPU/8 thread(s)
145953208.5
ns124493437.5
ns1.17
vgg16(32, 32, 3, 32)/forward/CPU/1 thread(s)
164619104.5
ns162230500
ns1.01
vgg16(32, 32, 3, 32)/forward/GPU/CUDA
4883992
ns4886055.5
ns1.00
vgg16(32, 32, 3, 32)/zygote/CPU/2 thread(s)
831528333
ns854987208
ns0.97
vgg16(32, 32, 3, 32)/zygote/CPU/4 thread(s)
497840084
ns644456708
ns0.77
vgg16(32, 32, 3, 32)/zygote/CPU/8 thread(s)
556789916
ns532057834
ns1.05
vgg16(32, 32, 3, 32)/zygote/CPU/1 thread(s)
679969833
ns687805708
ns0.99
vgg16(32, 32, 3, 32)/zygote/GPU/CUDA
16195623
ns16138006
ns1.00
batchedmm(512, Bsize=32)/forward/CPU/2 thread(s)
8914083
ns9114041.5
ns0.98
batchedmm(512, Bsize=32)/forward/CPU/4 thread(s)
8769917
ns8770313
ns1.00
batchedmm(512, Bsize=32)/forward/CPU/8 thread(s)
8216313
ns7860292
ns1.05
batchedmm(512, Bsize=32)/forward/CPU/1 thread(s)
10158000
ns10147292
ns1.00
batchedmm(512, Bsize=32)/forward/GPU/CUDA
1595526
ns1612586
ns0.99
batchedmm(512, Bsize=32)/zygote/CPU/2 thread(s)
35894250
ns37546375
ns0.96
batchedmm(512, Bsize=32)/zygote/CPU/4 thread(s)
36843625
ns36886146
ns1.00
batchedmm(512, Bsize=32)/zygote/CPU/8 thread(s)
34476562
ns33451021
ns1.03
batchedmm(512, Bsize=32)/zygote/CPU/1 thread(s)
38802729
ns38875771
ns1.00
batchedmm(512, Bsize=32)/zygote/GPU/CUDA
6454567.5
ns6459090.5
ns1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/2 thread(s)
47396
ns47458.5
ns1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/4 thread(s)
49334
ns49333
ns1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/8 thread(s)
47542
ns49583
ns0.96
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/1 thread(s)
47417
ns47250
ns1.00
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/CUDA
19457
ns18585
ns1.05
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/2 thread(s)
50292
ns50584
ns0.99
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/4 thread(s)
50520.5
ns50416
ns1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/8 thread(s)
50584
ns50708.5
ns1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/1 thread(s)
50250
ns50500
ns1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/CUDA
189575
ns216293
ns0.88
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
8104
ns7979.5
ns1.02
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
6791
ns6791
ns1
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
9125
ns8875
ns1.03
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
7333
ns8583
ns0.85
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA
86829.5
ns106035
ns0.82
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
9875
ns10333
ns0.96
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
9583
ns9958
ns0.96
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
10375
ns10500
ns0.99
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
10208
ns10167
ns1.00
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA
537525
ns612658
ns0.88
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
8208
ns8750
ns0.94
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
8250
ns6438
ns1.28
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
9812.5
ns8667
ns1.13
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
6375
ns5875
ns1.09
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA
113788.5
ns119844.5
ns0.95
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
13333.5
ns13375
ns1.00
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
12625
ns13000
ns0.97
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
13584
ns13416
ns1.01
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
13208
ns12791
ns1.03
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA
479705.5
ns517417.5
ns0.93
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
958
ns1042
ns0.92
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
958
ns958
ns1
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
1042
ns1042
ns1
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
1083
ns1042
ns1.04
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA
32580
ns31817
ns1.02
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
7750
ns8041
ns0.96
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
7625
ns7750
ns0.98
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
8542
ns8333
ns1.03
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
8208
ns8292
ns0.99
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA
201701.5
ns203048
ns0.99
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/2 thread(s)
23250
ns23145.5
ns1.00
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/4 thread(s)
23042
ns24541
ns0.94
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/8 thread(s)
23500
ns24167
ns0.97
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/1 thread(s)
23167
ns23334
ns0.99
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/CUDA
18765.5
ns18371
ns1.02
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/2 thread(s)
52875
ns52542
ns1.01
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/4 thread(s)
52292
ns52416
ns1.00
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/8 thread(s)
52792
ns52500
ns1.01
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/1 thread(s)
52459
ns52334
ns1.00
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/CUDA
260844.5
ns295739.5
ns0.88
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
1400229
ns1440625
ns0.97
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
1398666.5
ns1400291
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
1400708
ns1400875
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
1398917
ns1406313
ns0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
196521.5
ns194620
ns1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
5018604
ns5047479.5
ns0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
5004729.5
ns5003458.5
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
5044229.5
ns4836292
ns1.04
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
5001271
ns4996708
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
595122
ns628014
ns0.95
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s)
3043083
ns3062438
ns0.99
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s)
2094042
ns2084417
ns1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s)
2287146
ns2227208.5
ns1.03
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s)
4530875
ns4812250
ns0.94
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/GPU/CUDA
582703
ns579246
ns1.01
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s)
24366625
ns24741125
ns0.98
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s)
18829583
ns18811521
ns1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s)
19120291
ns18691437
ns1.02
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s)
36653000
ns36587416
ns1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/GPU/CUDA
3189516.5
ns3196070
ns1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s)
33943229
ns34435312
ns0.99
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s)
28373417
ns28306583.5
ns1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s)
28357208
ns28069750
ns1.01
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s)
41659750
ns41958375
ns0.99
batchedmm(512, Bsize=512)/forward/CPU/2 thread(s)
144299750
ns145325041
ns0.99
batchedmm(512, Bsize=512)/forward/CPU/4 thread(s)
142248375
ns141848041.5
ns1.00
batchedmm(512, Bsize=512)/forward/CPU/8 thread(s)
126632146
ns123758375
ns1.02
batchedmm(512, Bsize=512)/forward/CPU/1 thread(s)
173840291.5
ns173196604
ns1.00
batchedmm(512, Bsize=512)/forward/GPU/CUDA
22781482
ns22560824
ns1.01
batchedmm(512, Bsize=512)/zygote/CPU/2 thread(s)
1307941437.5
ns942531917
ns1.39
batchedmm(512, Bsize=512)/zygote/CPU/4 thread(s)
1133574500.5
ns871530625
ns1.30
batchedmm(512, Bsize=512)/zygote/CPU/8 thread(s)
711240125
ns1498315250
ns0.47
batchedmm(512, Bsize=512)/zygote/CPU/1 thread(s)
670828250
ns674150833
ns1.00
batchedmm(512, Bsize=512)/zygote/GPU/CUDA
118499942
ns118289465
ns1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
74542
ns76208
ns0.98
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
73917
ns75041
ns0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
83125
ns77875
ns1.07
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
72916.5
ns75417
ns0.97
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
225032.5
ns273038.5
ns0.82
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
202979.5
ns299708
ns0.68
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
282792
ns284646
ns0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
253479.5
ns191687.5
ns1.32
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
244146
ns202979.5
ns1.20
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
1201754
ns1439967
ns0.83
batchedmm(512, Bsize=128)/forward/CPU/2 thread(s)
35408938
ns36345458
ns0.97
batchedmm(512, Bsize=128)/forward/CPU/4 thread(s)
35449645.5
ns35416645.5
ns1.00
batchedmm(512, Bsize=128)/forward/CPU/8 thread(s)
32512083
ns32239562.5
ns1.01
batchedmm(512, Bsize=128)/forward/CPU/1 thread(s)
41003541.5
ns40930312.5
ns1.00
batchedmm(512, Bsize=128)/forward/GPU/CUDA
5848198
ns5849412
ns1.00
batchedmm(512, Bsize=128)/zygote/CPU/2 thread(s)
146608875
ns151966416
ns0.96
batchedmm(512, Bsize=128)/zygote/CPU/4 thread(s)
151542938
ns152232437.5
ns1.00
batchedmm(512, Bsize=128)/zygote/CPU/8 thread(s)
138849083
ns136165208.5
ns1.02
batchedmm(512, Bsize=128)/zygote/CPU/1 thread(s)
287439584
ns287396625
ns1.00
batchedmm(512, Bsize=128)/zygote/GPU/CUDA
34913824
ns34914778
ns1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s)
121086291.5
ns158627833
ns0.76
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s)
174190000
ns174511667
ns1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s)
155717667
ns148215771.5
ns1.05
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s)
106488666.5
ns108212479
ns0.98
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/GPU/CUDA
5478422
ns5459784
ns1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s)
611208666
ns524328229.5
ns1.17
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s)
466441167
ns467038291
ns1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s)
453562937.5
ns441190000
ns1.03
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s)
741621625
ns741818542
ns1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/GPU/CUDA
35157227
ns32279915
ns1.09
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s)
648662584
ns692549750
ns0.94
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s)
657411208
ns656203708.5
ns1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s)
585962375
ns573625208
ns1.02
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s)
845072208
ns853537834
ns0.99
mlp7layer_bn(relu)(32 x 256)/forward/CPU/2 thread(s)
1304708
ns1226937.5
ns1.06
mlp7layer_bn(relu)(32 x 256)/forward/CPU/4 thread(s)
965666
ns992979
ns0.97
mlp7layer_bn(relu)(32 x 256)/forward/CPU/8 thread(s)
744354
ns904625
ns0.82
mlp7layer_bn(relu)(32 x 256)/forward/CPU/1 thread(s)
1944604
ns2085917
ns0.93
mlp7layer_bn(relu)(32 x 256)/forward/GPU/CUDA
572387
ns566912.5
ns1.01
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/2 thread(s)
2974271
ns2909667
ns1.02
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/4 thread(s)
2531646
ns2628208
ns0.96
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/8 thread(s)
2512854
ns2006333.5
ns1.25
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/1 thread(s)
3691334
ns3693750.5
ns1.00
mlp7layer_bn(relu)(32 x 256)/zygote/GPU/CUDA
1817474
ns1796011.5
ns1.01
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/2 thread(s)
6642416
ns6757875
ns0.98
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/4 thread(s)
6630792
ns6503250
ns1.02
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/8 thread(s)
6466375
ns6239125
ns1.04
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/1 thread(s)
4443145.5
ns4454771
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
7334
ns7250
ns1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
6208
ns6167
ns1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
5458
ns6208
ns0.88
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
10167
ns10250
ns0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
25916
ns24809.5
ns1.04
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
212104
ns213666
ns0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
219562.5
ns220313
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
220667
ns220125
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
206291
ns209542
ns0.98
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
257490
ns276995.5
ns0.93
vgg16(32, 32, 3, 64)/forward/CPU/2 thread(s)
301772791.5
ns315354292
ns0.96
vgg16(32, 32, 3, 64)/forward/CPU/4 thread(s)
222879750
ns221860750
ns1.00
vgg16(32, 32, 3, 64)/forward/CPU/8 thread(s)
222700312.5
ns197740833.5
ns1.13
vgg16(32, 32, 3, 64)/forward/CPU/1 thread(s)
311773125
ns312004542
ns1.00
vgg16(32, 32, 3, 64)/forward/GPU/CUDA
7676597.5
ns7676221
ns1.00
vgg16(32, 32, 3, 64)/zygote/CPU/2 thread(s)
1082870459
ns1085627020.5
ns1.00
vgg16(32, 32, 3, 64)/zygote/CPU/4 thread(s)
892532250
ns891084375.5
ns1.00
vgg16(32, 32, 3, 64)/zygote/CPU/8 thread(s)
883941208.5
ns865730125
ns1.02
vgg16(32, 32, 3, 64)/zygote/CPU/1 thread(s)
1154293562
ns1163266979.5
ns0.99
vgg16(32, 32, 3, 64)/zygote/GPU/CUDA
26959026
ns26544800.5
ns1.02
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
6459
ns6083
ns1.06
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
5209
ns5583
ns0.93
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
10000
ns7375
ns1.36
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
5708.5
ns5270.5
ns1.08
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA
168546.5
ns178949
ns0.94
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
7458
ns7708
ns0.97
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
6792
ns7292
ns0.93
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
7542
ns7500
ns1.01
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
7792
ns6792
ns1.15
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA
639812.5
ns667282.5
ns0.96
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s)
458
ns542
ns0.85
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s)
458
ns459
ns1.00
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s)
542
ns542
ns1
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s)
542
ns459
ns1.18
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA
24361
ns23245
ns1.05
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
9000
ns9583.5
ns0.94
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
9000
ns9167
ns0.98
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
9583
ns9458.5
ns1.01
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
9708
ns8792
ns1.10
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA
234125.5
ns227149
ns1.03
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/2 thread(s)
351500
ns352521.5
ns1.00
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/4 thread(s)
351500
ns352709
ns1.00
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/8 thread(s)
351916
ns352958.5
ns1.00
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/1 thread(s)
356625
ns352708
ns1.01
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/CUDA
21502
ns21007
ns1.02
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/2 thread(s)
811270.5
ns828104
ns0.98
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/4 thread(s)
774958.5
ns820292
ns0.94
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/8 thread(s)
776584
ns773500
ns1.00
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/1 thread(s)
821875
ns828312
ns0.99
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/CUDA
315795.5
ns289596
ns1.09
batchedmm(16, Bsize=32)/forward/CPU/2 thread(s)
335896
ns312083.5
ns1.08
batchedmm(16, Bsize=32)/forward/CPU/4 thread(s)
338208.5
ns340166.5
ns0.99
batchedmm(16, Bsize=32)/forward/CPU/8 thread(s)
441167
ns445354
ns0.99
batchedmm(16, Bsize=32)/forward/CPU/1 thread(s)
331375
ns333520.5
ns0.99
batchedmm(16, Bsize=32)/forward/GPU/CUDA
18761.5
ns17918
ns1.05
batchedmm(16, Bsize=32)/zygote/CPU/2 thread(s)
695166
ns691583
ns1.01
batchedmm(16, Bsize=32)/zygote/CPU/4 thread(s)
738208
ns732334
ns1.01
batchedmm(16, Bsize=32)/zygote/CPU/8 thread(s)
1036458
ns1026459
ns1.01
batchedmm(16, Bsize=32)/zygote/CPU/1 thread(s)
692396
ns691042
ns1.00
batchedmm(16, Bsize=32)/zygote/GPU/CUDA
292461.5
ns273557
ns1.07
batchedmm(16, Bsize=128)/forward/CPU/2 thread(s)
354166.5
ns332396
ns1.07
batchedmm(16, Bsize=128)/forward/CPU/4 thread(s)
346771
ns348875
ns0.99
batchedmm(16, Bsize=128)/forward/CPU/8 thread(s)
433791
ns409541
ns1.06
batchedmm(16, Bsize=128)/forward/CPU/1 thread(s)
370250
ns375250
ns0.99
batchedmm(16, Bsize=128)/forward/GPU/CUDA
23121
ns22378
ns1.03
batchedmm(16, Bsize=128)/zygote/CPU/2 thread(s)
757417
ns755875
ns1.00
batchedmm(16, Bsize=128)/zygote/CPU/4 thread(s)
749625
ns743000
ns1.01
batchedmm(16, Bsize=128)/zygote/CPU/8 thread(s)
1070562.5
ns1068417
ns1.00
batchedmm(16, Bsize=128)/zygote/CPU/1 thread(s)
828458
ns822124.5
ns1.01
batchedmm(16, Bsize=128)/zygote/GPU/CUDA
257074.5
ns239682
ns1.07
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/2 thread(s)
3292
ns3625
ns0.91
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/4 thread(s)
3458
ns3417
ns1.01
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/8 thread(s)
3750
ns3583
ns1.05
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/1 thread(s)
3417
ns3583
ns0.95
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/CUDA
18586
ns17823
ns1.04
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/2 thread(s)
4167
ns4208
ns0.99
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/4 thread(s)
4375
ns4167
ns1.05
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/8 thread(s)
4417
ns4375
ns1.01
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/1 thread(s)
4250
ns4292
ns0.99
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/CUDA
296700.5
ns271995
ns1.09
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s)
3625
ns4792
ns0.76
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s)
3750
ns3834
ns0.98
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s)
6541
ns5250
ns1.25
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s)
6354.5
ns3625
ns1.75
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA
232189.5
ns214003.5
ns1.08
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
8187.5
ns8354.5
ns0.98
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
8000
ns8334
ns0.96
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
8458
ns8667
ns0.98
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
8500
ns8417
ns1.01
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA
1227082
ns1200425
ns1.02
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
203417
ns204209
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
209541.5
ns210000
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
208250
ns211875
ns0.98
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
198709
ns199417
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
35300
ns34086
ns1.04
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
612417
ns608520.5
ns1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
623292
ns620750
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
623250
ns620416
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
630166
ns628625
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
347973
ns347622
ns1.00
batchedmm(128, Bsize=128)/forward/CPU/2 thread(s)
977646
ns980000
ns1.00
batchedmm(128, Bsize=128)/forward/CPU/4 thread(s)
935437.5
ns929916.5
ns1.01
batchedmm(128, Bsize=128)/forward/CPU/8 thread(s)
970083
ns954250
ns1.02
batchedmm(128, Bsize=128)/forward/CPU/1 thread(s)
1286374.5
ns1278542
ns1.01
batchedmm(128, Bsize=128)/forward/GPU/CUDA
209031
ns206777
ns1.01
batchedmm(128, Bsize=128)/zygote/CPU/2 thread(s)
4514333
ns4651729
ns0.97
batchedmm(128, Bsize=128)/zygote/CPU/4 thread(s)
4466146
ns4500083
ns0.99
batchedmm(128, Bsize=128)/zygote/CPU/8 thread(s)
4452875
ns4296645.5
ns1.04
batchedmm(128, Bsize=128)/zygote/CPU/1 thread(s)
6260416.5
ns6216979.5
ns1.01
batchedmm(128, Bsize=128)/zygote/GPU/CUDA
947144.5
ns942518
ns1.00
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s)
3542
ns3916
ns0.90
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s)
3417
ns3375
ns1.01
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s)
5896
ns4667
ns1.26
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s)
6667
ns3354.5
ns1.99
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA
219336.5
ns231395.5
ns0.95
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
6917
ns7375
ns0.94
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
6958
ns7292
ns0.95
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
7708
ns7667
ns1.01
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
7291
ns7000
ns1.04
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA
1020167.5
ns1002762
ns1.02
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s)
1635042
ns1644583
ns0.99
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s)
1200395.5
ns1174458
ns1.02
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s)
1363584
ns1323125
ns1.03
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s)
2345187.5
ns2461333.5
ns0.95
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/GPU/CUDA
215784.5
ns213304.5
ns1.01
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s)
12316854.5
ns12444729.5
ns0.99
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s)
9564000
ns9564709
ns1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s)
9378437.5
ns9234833
ns1.02
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s)
17989542
ns18020417
ns1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/GPU/CUDA
1948181
ns1940786
ns1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s)
17368125
ns17431792
ns1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s)
14382958
ns14392958.5
ns1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s)
14502250
ns14240000
ns1.02
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s)
21085917
ns21049562.5
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
90917
ns90625
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
89500
ns88041
ns1.02
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
91833
ns92333
ns0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
113437.5
ns136917
ns0.83
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
126891
ns125618
ns1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2009625
ns2061125
ns0.98
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2030000
ns2018458
ns1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2039270.5
ns1720042
ns1.19
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1871125
ns2024104
ns0.92
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
1032563
ns1024038
ns1.01
batchedmm(2, Bsize=4)/forward/CPU/2 thread(s)
342166.5
ns331312
ns1.03
batchedmm(2, Bsize=4)/forward/CPU/4 thread(s)
343375
ns343500
ns1.00
batchedmm(2, Bsize=4)/forward/CPU/8 thread(s)
406458
ns395083
ns1.03
batchedmm(2, Bsize=4)/forward/CPU/1 thread(s)
311729
ns310458.5
ns1.00
batchedmm(2, Bsize=4)/forward/GPU/CUDA
16465.5
ns15733
ns1.05
batchedmm(2, Bsize=4)/zygote/CPU/2 thread(s)
706208
ns699959
ns1.01
batchedmm(2, Bsize=4)/zygote/CPU/4 thread(s)
728542
ns722062.5
ns1.01
batchedmm(2, Bsize=4)/zygote/CPU/8 thread(s)
1018584
ns1018209
ns1.00
batchedmm(2, Bsize=4)/zygote/CPU/1 thread(s)
650375
ns646375
ns1.01
batchedmm(2, Bsize=4)/zygote/GPU/CUDA
195366.5
ns189475.5
ns1.03
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
7375
ns7167
ns1.03
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
5875
ns5958
ns0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
5416
ns5875
ns0.92
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
10000
ns10000
ns1
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
34591
ns33239
ns1.04
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
243791
ns221625
ns1.10
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
220125
ns219959
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
221083
ns219750
ns1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
239167
ns218375
ns1.10
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
327793
ns314279
ns1.04
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/2 thread(s)
3667
ns3750
ns0.98
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/4 thread(s)
3667
ns3667
ns1
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/8 thread(s)
3709
ns3667
ns1.01
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/1 thread(s)
3708
ns3667
ns1.01
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/CUDA
22616
ns22722
ns1.00
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/2 thread(s)
14292
ns14167
ns1.01
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/4 thread(s)
14416
ns14334
ns1.01
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/8 thread(s)
14208
ns14291
ns0.99
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/1 thread(s)
14417
ns14375
ns1.00
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/CUDA
480334.5
ns475447
ns1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
94458
ns95166.5
ns0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
92625
ns91833
ns1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
96875
ns96125
ns1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
96229.5
ns139167
ns0.69
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
126007
ns125450
ns1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1714792
ns1948250
ns0.88
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1926792
ns1921104.5
ns1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1913291.5
ns1669729.5
ns1.15
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1711417
ns1920708.5
ns0.89
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
1034230
ns954893.5
ns1.08
lenet(28, 28, 1, 32)/forward/CPU/2 thread(s)
876916.5
ns854375
ns1.03
lenet(28, 28, 1, 32)/forward/CPU/4 thread(s)
817791
ns817542
ns1.00
lenet(28, 28, 1, 32)/forward/CPU/8 thread(s)
1169438
ns1213833.5
ns0.96
lenet(28, 28, 1, 32)/forward/CPU/1 thread(s)
966187.5
ns958895.5
ns1.01
lenet(28, 28, 1, 32)/forward/GPU/CUDA
275657.5
ns276078
ns1.00
lenet(28, 28, 1, 32)/zygote/CPU/2 thread(s)
2828583
ns2843334
ns0.99
lenet(28, 28, 1, 32)/zygote/CPU/4 thread(s)
2474833
ns2456145.5
ns1.01
lenet(28, 28, 1, 32)/zygote/CPU/8 thread(s)
3335750
ns3332000
ns1.00
lenet(28, 28, 1, 32)/zygote/CPU/1 thread(s)
3304292
ns3419792
ns0.97
lenet(28, 28, 1, 32)/zygote/GPU/CUDA
1618381.5
ns1629171
ns0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
16709
ns15333
ns1.09
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
15625
ns14709
ns1.06
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
18667
ns17041
ns1.10
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
15583
ns14333
ns1.09
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
142594
ns142609.5
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
228750
ns262125
ns0.87
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
215750
ns215416.5
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
217625
ns215250
ns1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
255500
ns221958
ns1.15
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
641543.5
ns641081.5
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
222458
ns221583.5
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
221500
ns218625
ns1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
223458.5
ns222833
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
222604.5
ns221750
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
269850.5
ns271537.5
ns0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
537583
ns497750
ns1.08
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
497334
ns494833
ns1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
499583
ns497084
ns1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
526833
ns509000
ns1.04
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
1430878.5
ns1365399
ns1.05
batchedmm(16, Bsize=4)/forward/CPU/2 thread(s)
330125
ns315729
ns1.05
batchedmm(16, Bsize=4)/forward/CPU/4 thread(s)
332834
ns333917
ns1.00
batchedmm(16, Bsize=4)/forward/CPU/8 thread(s)
435458.5
ns375125
ns1.16
batchedmm(16, Bsize=4)/forward/CPU/1 thread(s)
315917
ns322083
ns0.98
batchedmm(16, Bsize=4)/forward/GPU/CUDA
16581
ns16846
ns0.98
batchedmm(16, Bsize=4)/zygote/CPU/2 thread(s)
717084
ns710041
ns1.01
batchedmm(16, Bsize=4)/zygote/CPU/4 thread(s)
728166.5
ns725063
ns1.00
batchedmm(16, Bsize=4)/zygote/CPU/8 thread(s)
1021104
ns1022417
ns1.00
batchedmm(16, Bsize=4)/zygote/CPU/1 thread(s)
662729.5
ns663021
ns1.00
batchedmm(16, Bsize=4)/zygote/GPU/CUDA
195479.5
ns196884
ns0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
17875
ns17625
ns1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
17167
ns16708
ns1.03
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
20250
ns18792
ns1.08
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
17208
ns17625
ns0.98
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
145639
ns144721
ns1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
223750
ns220104.5
ns1.02
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
212417
ns212792
ns1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
214041
ns212750
ns1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
221917
ns217250
ns1.02
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
1035551.5
ns955774
ns1.08
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s)
6708
ns6042
ns1.11
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s)
6333
ns4250
ns1.49
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s)
7208
ns6958
ns1.04
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s)
6625
ns6541
ns1.01
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA
240542
ns245177
ns0.98
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
10584
ns10583.5
ns1.00
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
9917
ns10250
ns0.97
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
11166.5
ns10708
ns1.04
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
10917
ns10084
ns1.08
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA
1097401.5
ns1099715
ns1.00
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
3500
ns4542
ns0.77
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
3208
ns3208
ns1
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
6333.5
ns4834
ns1.31
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
6750
ns2875
ns2.35
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA
250006
ns250616.5
ns1.00
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
7625
ns7125
ns1.07
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
7084
ns7375
ns0.96
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
8125
ns7750
ns1.05
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
7500
ns7375
ns1.02
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA
1102649
ns1110249
ns0.99
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s)
23315625
ns24293729.5
ns0.96
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s)
34529125
ns34647499.5
ns1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s)
41513333.5
ns38065167
ns1.09
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s)
34929834
ns34799687.5
ns1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/GPU/CUDA
1838602
ns1834951
ns1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s)
184421875
ns187799375
ns0.98
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s)
159459792
ns159175458
ns1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s)
151225083
ns146555271
ns1.03
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s)
413223958
ns415008291
ns1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/GPU/CUDA
16387494
ns16504056.5
ns0.99
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s)
428743125
ns437855250
ns0.98
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s)
252439020.5
ns254443000
ns0.99
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s)
233017396
ns231693624.5
ns1.01
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s)
484197291
ns485497958
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
183584
ns184229.5
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
182750
ns181916
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
186625
ns184084
ns1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
183146
ns182167
ns1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
228677.5
ns230730
ns0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
596083
ns637084
ns0.94
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
586292
ns586270.5
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
589770.5
ns586583
ns1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
631958
ns631542
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
1119701
ns1097701
ns1.02
batchedmm(128, Bsize=512)/forward/CPU/2 thread(s)
3838833
ns3894562.5
ns0.99
batchedmm(128, Bsize=512)/forward/CPU/4 thread(s)
3643375.5
ns3827292
ns0.95
batchedmm(128, Bsize=512)/forward/CPU/8 thread(s)
3563521
ns3469958
ns1.03
batchedmm(128, Bsize=512)/forward/CPU/1 thread(s)
5359750
ns5353020.5
ns1.00
batchedmm(128, Bsize=512)/forward/GPU/CUDA
537722
ns535365
ns1.00
batchedmm(128, Bsize=512)/zygote/CPU/2 thread(s)
17412417
ns18146250
ns0.96
batchedmm(128, Bsize=512)/zygote/CPU/4 thread(s)
17190667
ns17166041.5
ns1.00
batchedmm(128, Bsize=512)/zygote/CPU/8 thread(s)
17100375
ns16601417
ns1.03
batchedmm(128, Bsize=512)/zygote/CPU/1 thread(s)
22144083
ns22202083
ns1.00
batchedmm(128, Bsize=512)/zygote/GPU/CUDA
2612799
ns2616593
ns1.00
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
542
ns500
ns1.08
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
458
ns458
ns1
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
542
ns500
ns1.08
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
583
ns500
ns1.17
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA
32035
ns32123
ns1.00
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
9208
ns9458
ns0.97
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
8542
ns8667
ns0.99
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
10208
ns9167
ns1.11
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
9459
ns9208
ns1.03
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA
264327.5
ns267754
ns0.99
vgg16(32, 32, 3, 128)/forward/CPU/2 thread(s)
504274209
ns580762562.5
ns0.87
vgg16(32, 32, 3, 128)/forward/CPU/4 thread(s)
430218396
ns427173312.5
ns1.01
vgg16(32, 32, 3, 128)/forward/CPU/8 thread(s)
471374500
ns376948624.5
ns1.25
vgg16(32, 32, 3, 128)/forward/CPU/1 thread(s)
672994208.5
ns671986666.5
ns1.00
vgg16(32, 32, 3, 128)/forward/GPU/CUDA
12486595
ns12479261
ns1.00
vgg16(32, 32, 3, 128)/zygote/CPU/2 thread(s)
2049529562.5
ns2061821458.5
ns0.99
vgg16(32, 32, 3, 128)/zygote/CPU/4 thread(s)
1632649709
ns1626836125
ns1.00
vgg16(32, 32, 3, 128)/zygote/CPU/8 thread(s)
1536417708
ns1500724875
ns1.02
vgg16(32, 32, 3, 128)/zygote/CPU/1 thread(s)
2205666041.5
ns2217147562.5
ns0.99
vgg16(32, 32, 3, 128)/zygote/GPU/CUDA
49389302
ns48947892
ns1.01
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s)
1657645.5
ns1651250
ns1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s)
1189208.5
ns1196959
ns0.99
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s)
1382000
ns1346187.5
ns1.03
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s)
2334125
ns2356042
ns0.99
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/GPU/CUDA
214982
ns218070
ns0.99
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s)
12688500
ns12822417
ns0.99
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s)
9942000
ns9953541.5
ns1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s)
9748312.5
ns9605000
ns1.01
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s)
18407312
ns18408062.5
ns1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/GPU/CUDA
2050613
ns2047696.5
ns1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s)
17691583.5
ns17771104.5
ns1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s)
14746041.5
ns14762729
ns1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s)
14804417
ns14473917
ns1.02
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s)
21386084
ns21336042
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/2 thread(s)
26167
ns26250
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/4 thread(s)
26292
ns26209
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/8 thread(s)
26291
ns26583
ns0.99
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/1 thread(s)
26291
ns26209
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/CUDA
24125
ns24922
ns0.97
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/2 thread(s)
66875
ns66792
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/4 thread(s)
66917
ns67000
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/8 thread(s)
67083
ns66791
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/1 thread(s)
67209
ns66916
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/CUDA
398847.5
ns410676.5
ns0.97
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
202667
ns203542
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
209000
ns210583
ns0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
209167
ns210500
ns0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
199583
ns199958
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
26392
ns26405
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
612416.5
ns602333
ns1.02
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
627416.5
ns621292
ns1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
667979
ns621250
ns1.08
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
631250
ns630584
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
353043.5
ns355627
ns0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
645542
ns657646
ns0.98
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
643375
ns638729
ns1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
664187.5
ns544125
ns1.22
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
540834
ns677396
ns0.80
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
132126
ns132242
ns1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2247375
ns2305542
ns0.97
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2239958
ns2254292
ns0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2302917
ns1426250
ns1.61
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
2219000
ns2248542
ns0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
1328726
ns1182706
ns1.12
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
17667
ns17937.5
ns0.98
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
16979.5
ns17042
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
20792
ns19500
ns1.07
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
18500
ns16895.5
ns1.09
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
146392.5
ns144900
ns1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
229708
ns220000
ns1.04
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
225333
ns218416.5
ns1.03
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
229292
ns219458
ns1.04
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
259083
ns261708
ns0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
1081671
ns1051792
ns1.03
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
500
ns459
ns1.09
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
459
ns459
ns1
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
542
ns542
ns1
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
542
ns458
ns1.18
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA
23645
ns23475
ns1.01
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
9833.5
ns9520.5
ns1.03
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
9542
ns9541
ns1.00
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
10708
ns10166
ns1.05
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
9916
ns9375
ns1.06
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA
262941
ns261505
ns1.01
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s)
7291
ns6542
ns1.11
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s)
5833
ns5292
ns1.10
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s)
9625
ns6625
ns1.45
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s)
7250
ns7416
ns0.98
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA
234003
ns235631
ns0.99
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
7333
ns7000
ns1.05
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
7000
ns7291
ns0.96
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
7833
ns7250
ns1.08
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
7250
ns7208
ns1.01
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA
810029.5
ns803793
ns1.01
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/2 thread(s)
2042
ns2334
ns0.87
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/4 thread(s)
2000
ns2041
ns0.98
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/8 thread(s)
2375
ns2292
ns1.04
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/1 thread(s)
2208
ns2333
ns0.95
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/CUDA
18218
ns18245.5
ns1.00
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/2 thread(s)
6542
ns6750
ns0.97
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/4 thread(s)
6500
ns6459
ns1.01
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/8 thread(s)
6708
ns6667
ns1.01
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/1 thread(s)
6750
ns6625
ns1.02
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/CUDA
335368
ns333087.5
ns1.01
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/2 thread(s)
750166
ns748458
ns1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/4 thread(s)
746604.5
ns746645.5
ns1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/8 thread(s)
751041
ns746833
ns1.01
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/1 thread(s)
761417
ns749417
ns1.02
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/CUDA
21856
ns21817
ns1.00
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/2 thread(s)
775334
ns789125.5
ns0.98
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/4 thread(s)
775042
ns772625
ns1.00
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/8 thread(s)
804792
ns775145.5
ns1.04
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/1 thread(s)
791625
ns787875
ns1.00
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/CUDA
299022
ns298327
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
7375
ns7291
ns1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
5875
ns5959
ns0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
5208
ns5750
ns0.91
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
10125
ns10792
ns0.94
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
32492
ns32858
ns0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
233188
ns221541
ns1.05
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
227750
ns226958
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
254458
ns226625
ns1.12
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
255583
ns220292
ns1.16
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
359227
ns360131.5
ns1.00
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
11042
ns10250
ns1.08
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
12458
ns9917
ns1.26
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
12959
ns12459
ns1.04
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
12000
ns10583.5
ns1.13
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA
245075.5
ns243730.5
ns1.01
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
24875
ns24834
ns1.00
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
24458
ns24833.5
ns0.98
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
25458
ns24750
ns1.03
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
24583.5
ns24666
ns1.00
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA
1120608
ns1133764
ns0.99
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s)
106980458
ns107061375
ns1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s)
118006979.5
ns116928479.5
ns1.01
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s)
123940208
ns121136000
ns1.02
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s)
118407959
ns117635875
ns1.01
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/GPU/CUDA
2661574
ns2659433
ns1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s)
394378313
ns396814083.5
ns0.99
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s)
368164500
ns366591458
ns1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s)
358657167
ns425794499.5
ns0.84
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s)
482282708
ns482285959
ns1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/GPU/CUDA
15138278
ns15258375
ns0.99
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s)
759267583
ns769963270.5
ns0.99
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s)
577881125
ns576371708
ns1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s)
749378833
ns745582312
ns1.01
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s)
945671312.5
ns765495854.5
ns1.24
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s)
7458
ns7333
ns1.02
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s)
7958
ns6334
ns1.26
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s)
8750
ns7750
ns1.13
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s)
7333
ns8333
ns0.88
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA
235620
ns237972
ns0.99
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
14500
ns14125
ns1.03
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
13333
ns13209
ns1.01
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
15041
ns13417
ns1.12
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
14292
ns13459
ns1.06
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA
1078273.5
ns1080162
ns1.00
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s)
8542
ns7667
ns1.11
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s)
7792
ns5583
ns1.40
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s)
9187.5
ns8167
ns1.12
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s)
7833.5
ns8291
ns0.94
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA
235827.5
ns233794.5
ns1.01
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
13167
ns12542
ns1.05
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
12084
ns11875
ns1.02
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
13084
ns12645.5
ns1.03
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
12833
ns11875
ns1.08
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA
787391.5
ns787815
ns1.00
batchedmm(2, Bsize=128)/forward/CPU/2 thread(s)
347250
ns332667
ns1.04
batchedmm(2, Bsize=128)/forward/CPU/4 thread(s)
344875
ns344396
ns1.00
batchedmm(2, Bsize=128)/forward/CPU/8 thread(s)
409896
ns395770.5
ns1.04
batchedmm(2, Bsize=128)/forward/CPU/1 thread(s)
310562
ns312500
ns0.99
batchedmm(2, Bsize=128)/forward/GPU/CUDA
16566
ns16497
ns1.00
batchedmm(2, Bsize=128)/zygote/CPU/2 thread(s)
713833.5
ns706958.5
ns1.01
batchedmm(2, Bsize=128)/zygote/CPU/4 thread(s)
727291
ns725208
ns1.00
batchedmm(2, Bsize=128)/zygote/CPU/8 thread(s)
1023416
ns1019750
ns1.00
batchedmm(2, Bsize=128)/zygote/CPU/1 thread(s)
654959
ns658292
ns0.99
batchedmm(2, Bsize=128)/zygote/GPU/CUDA
197250.5
ns198046.5
ns1.00
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s)
292
ns375
ns0.78
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s)
291
ns292
ns1.00
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s)
375
ns375
ns1
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s)
375
ns292
ns1.28
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA
23066
ns22951
ns1.01
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
6250
ns6542
ns0.96
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
6334
ns6208
ns1.02
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
6750
ns6792
ns0.99
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
6791
ns6208
ns1.09
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA
238420
ns237567.5
ns1.00
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
5750
ns5709
ns1.01
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
5750
ns5667
ns1.01
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
5875
ns5875
ns1
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
5834
ns5667
ns1.03
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA
23863
ns24038
ns0.99
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
21750
ns21958
ns0.99
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
21000
ns20875
ns1.01
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
21958
ns21625
ns1.02
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
21708
ns21125
ns1.03
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA
261085
ns260574.5
ns1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
152146
ns146812.5
ns1.04
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
145250
ns143875
ns1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
149541
ns145917
ns1.02
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
145937
ns178146
ns0.82
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
166536.5
ns166659.5
ns1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1328792
ns1355917
ns0.98
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1319083.5
ns1329374.5
ns0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1350812.5
ns861416.5
ns1.57
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1317084
ns1325916
ns0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
1336276
ns1338261
ns1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
24917
ns23084
ns1.08
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
24208
ns21458
ns1.13
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
25708
ns24042
ns1.07
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
24208.5
ns23958
ns1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
351114.5
ns350919.5
ns1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
131125
ns179500
ns0.73
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
117791
ns120541
ns0.98
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
172917
ns118167
ns1.46
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
177334
ns151208
ns1.17
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
1465398.5
ns1454020.5
ns1.01
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
333
ns292
ns1.14
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
292
ns292
ns1
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
375
ns375
ns1
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
375
ns291
ns1.29
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA
22926
ns22580
ns1.02
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
6417
ns6291
ns1.02
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
6458
ns6334
ns1.02
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
6917
ns6791
ns1.02
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
6542
ns6208
ns1.05
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA
254551
ns253799.5
ns1.00
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
7625
ns5042
ns1.51
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
4167
ns4250
ns0.98
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
7708.5
ns5833.5
ns1.32
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
7375
ns4666
ns1.58
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA
250274.5
ns254794.5
ns0.98
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
10042
ns10042
ns1
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
9708
ns10042
ns0.97
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
10333
ns10417
ns0.99
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
10250
ns10125
ns1.01
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA
1345295
ns1352736
ns0.99
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/2 thread(s)
1584
ns1625
ns0.97
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/4 thread(s)
1625
ns1583
ns1.03
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/8 thread(s)
1625
ns1584
ns1.03
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/1 thread(s)
1625
ns1542
ns1.05
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/CUDA
22897
ns23495
ns0.97
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/2 thread(s)
5625
ns5708
ns0.99
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/4 thread(s)
5584
ns5667
ns0.99
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/8 thread(s)
5959
ns5750
ns1.04
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/1 thread(s)
5958
ns5625
ns1.06
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/CUDA
271438.5
ns273637.5
ns0.99
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s)
6886125
ns6842458
ns1.01
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s)
6378229
ns6343020.5
ns1.01
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s)
6526875
ns6507417
ns1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s)
7602250
ns7623042
ns1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/GPU/CUDA
213111
ns213659
ns1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s)
24073062
ns24131500
ns1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s)
21283625
ns21298104
ns1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s)
21045584
ns21004749.5
ns1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s)
29677875
ns29792896
ns1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/GPU/CUDA
2108165
ns2117701
ns1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s)
37353145.5
ns37668083
ns0.99
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s)
34386667
ns34323688
ns1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s)
45930020.5
ns45641000
ns1.01
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s)
49322334
ns38230313
ns1.29
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s)
7708.5
ns6459
ns1.19
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s)
5875
ns5250
ns1.12
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s)
8333
ns7500
ns1.11
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s)
7062.5
ns7458
ns0.95
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA
238522.5
ns235380.5
ns1.01
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
8458
ns8541
ns0.99
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
8042
ns7792
ns1.03
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
8583
ns8292
ns1.04
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
8292
ns9208
ns0.90
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA
1070850
ns1057995
ns1.01
lenet(28, 28, 1, 128)/forward/CPU/2 thread(s)
1544374.5
ns1525083
ns1.01
lenet(28, 28, 1, 128)/forward/CPU/4 thread(s)
1259666.5
ns1258604.5
ns1.00
lenet(28, 28, 1, 128)/forward/CPU/8 thread(s)
1632771
ns1613917
ns1.01
lenet(28, 28, 1, 128)/forward/CPU/1 thread(s)
2150667
ns2159167
ns1.00
lenet(28, 28, 1, 128)/forward/GPU/CUDA
278945
ns273469.5
ns1.02
lenet(28, 28, 1, 128)/zygote/CPU/2 thread(s)
7908937.5
ns7971979
ns0.99
lenet(28, 28, 1, 128)/zygote/CPU/4 thread(s)
6609937
ns6561833.5
ns1.01
lenet(28, 28, 1, 128)/zygote/CPU/8 thread(s)
7237750.5
ns7004875
ns1.03
lenet(28, 28, 1, 128)/zygote/CPU/1 thread(s)
10434334
ns10476458
ns1.00
lenet(28, 28, 1, 128)/zygote/GPU/CUDA
1889956
ns1860749
ns1.02
batchedmm(128, Bsize=4)/forward/CPU/2 thread(s)
340979
ns326083.5
ns1.05
batchedmm(128, Bsize=4)/forward/CPU/4 thread(s)
345792
ns347292
ns1.00
batchedmm(128, Bsize=4)/forward/CPU/8 thread(s)
417125
ns379020.5
ns1.10
batchedmm(128, Bsize=4)/forward/CPU/1 thread(s)
345833
ns343562.5
ns1.01
batchedmm(128, Bsize=4)/forward/GPU/CUDA
42448
ns46613.5
ns0.91
batchedmm(128, Bsize=4)/zygote/CPU/2 thread(s)
746500.5
ns745458
ns1.00
batchedmm(128, Bsize=4)/zygote/CPU/4 thread(s)
784542
ns781417
ns1.00
batchedmm(128, Bsize=4)/zygote/CPU/8 thread(s)
1073250
ns1067437.5
ns1.01
batchedmm(128, Bsize=4)/zygote/CPU/1 thread(s)
761062.5
ns751125
ns1.01
batchedmm(128, Bsize=4)/zygote/GPU/CUDA
303720.5
ns306721.5
ns0.99
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/2 thread(s)
397500
ns396333
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/4 thread(s)
288250
ns287916
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/8 thread(s)
212666
ns288062.5
ns0.74
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/1 thread(s)
756084
ns751542
ns1.01
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/CUDA
43887
ns43483
ns1.01
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/2 thread(s)
671083
ns646375
ns1.04
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/4 thread(s)
530083
ns531834
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/8 thread(s)
470667
ns530042
ns0.89
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/1 thread(s)
974750
ns973417
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/CUDA
188388.5
ns188389
ns1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
679250
ns653542
ns1.04
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
645333.5
ns639041.5
ns1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
642458
ns545542
ns1.18
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
638562.5
ns655584
ns0.97
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
131530
ns131455.5
ns1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2409292
ns2529917
ns0.95
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2456416.5
ns2399708
ns1.02
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2514583
ns2436833
ns1.03
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
2456292
ns2460520.5
ns1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
1277300
ns1513461
ns0.84
batchedmm(2, Bsize=32)/forward/CPU/2 thread(s)
345146
ns323146
ns1.07
batchedmm(2, Bsize=32)/forward/CPU/4 thread(s)
343583
ns343771
ns1.00
batchedmm(2, Bsize=32)/forward/CPU/8 thread(s)
403708.5
ns394750
ns1.02
batchedmm(2, Bsize=32)/forward/CPU/1 thread(s)
312208
ns310562
ns1.01
batchedmm(2, Bsize=32)/forward/GPU/CUDA
16009
ns15996
ns1.00
batchedmm(2, Bsize=32)/zygote/CPU/2 thread(s)
709667
ns699000
ns1.02
batchedmm(2, Bsize=32)/zygote/CPU/4 thread(s)
724500
ns717792
ns1.01
batchedmm(2, Bsize=32)/zygote/CPU/8 thread(s)
1022687.5
ns1016334
ns1.01
batchedmm(2, Bsize=32)/zygote/CPU/1 thread(s)
650417
ns649937
ns1.00
batchedmm(2, Bsize=32)/zygote/GPU/CUDA
195917
ns196510
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
1460417
ns1458958
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
1500812.5
ns1506167
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
1496375
ns1503458
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
1438708
ns1442834
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
40600
ns39862
ns1.02
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
5128791
ns5157334
ns0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
5302375
ns5010437.5
ns1.06
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
5313000
ns4993104
ns1.06
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
4970208.5
ns4988542
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
196206.5
ns197580.5
ns0.99
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/2 thread(s)
3667
ns3709
ns0.99
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/4 thread(s)
3667
ns3667
ns1
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/8 thread(s)
3709
ns3667
ns1.01
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/1 thread(s)
3708
ns3708
ns1
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/CUDA
32895
ns32748
ns1.00
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/2 thread(s)
15167
ns14833
ns1.02
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/4 thread(s)
15083
ns15125
ns1.00
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/8 thread(s)
15083
ns15292
ns0.99
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/1 thread(s)
15375
ns15041
ns1.02
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/CUDA
376729
ns374855
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/2 thread(s)
71459
ns71625
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/4 thread(s)
71250
ns71333
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/8 thread(s)
71375
ns71333
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/1 thread(s)
70708
ns71333
ns0.99
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/CUDA
113177.5
ns113422
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/2 thread(s)
317917
ns326208
ns0.97
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/4 thread(s)
320417
ns318250
ns1.01
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/8 thread(s)
325333
ns319375
ns1.02
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/1 thread(s)
320916
ns317917
ns1.01
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/CUDA
193043
ns192316
ns1.00
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
958
ns1000
ns0.96
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
958
ns959
ns1.00
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
1083
ns1083
ns1
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
1042
ns1000
ns1.04
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA
23363
ns23450
ns1.00
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
8083
ns8042
ns1.01
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
7792
ns7895.5
ns0.99
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
8750
ns8333
ns1.05
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
8750
ns7792
ns1.12
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA
260535.5
ns258455
ns1.01
batchedmm(128, Bsize=32)/forward/CPU/2 thread(s)
475499.5
ns465250
ns1.02
batchedmm(128, Bsize=32)/forward/CPU/4 thread(s)
470520.5
ns472750
ns1.00
batchedmm(128, Bsize=32)/forward/CPU/8 thread(s)
557125
ns547875
ns1.02
batchedmm(128, Bsize=32)/forward/CPU/1 thread(s)
557959
ns554667
ns1.01
batchedmm(128, Bsize=32)/forward/GPU/CUDA
129404
ns130091
ns0.99
batchedmm(128, Bsize=32)/zygote/CPU/2 thread(s)
1399270.5
ns1420208
ns0.99
batchedmm(128, Bsize=32)/zygote/CPU/4 thread(s)
1382375
ns1378895.5
ns1.00
batchedmm(128, Bsize=32)/zygote/CPU/8 thread(s)
1611125
ns1600250
ns1.01
batchedmm(128, Bsize=32)/zygote/CPU/1 thread(s)
1582104.5
ns1587791
ns1.00
batchedmm(128, Bsize=32)/zygote/GPU/CUDA
274924
ns274988
ns1.00
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
375
ns334
ns1.12
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
250
ns292
ns0.86
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
375
ns292
ns1.28
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
375
ns292
ns1.28
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA
31647
ns31336
ns1.01
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
6375
ns6625
ns0.96
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
6042
ns5959
ns1.01
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
6666
ns6354.5
ns1.05
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
6625
ns6166
ns1.07
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA
262541.5
ns261129.5
ns1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
1761833
ns1730708
ns1.02
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
1723396
ns1721229.5
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
1733812.5
ns1723750
ns1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
1730625
ns1730229
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
169477.5
ns168441.5
ns1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
4358625
ns4400167
ns0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
4358708
ns4366354
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
4403062.5
ns3903958
ns1.13
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
4373875
ns4358458
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
1208123
ns1240708
ns0.97
bias_activation(512, act=relu)(512 x 128)/forward/CPU/2 thread(s)
7167
ns6792
ns1.06
bias_activation(512, act=relu)(512 x 128)/forward/CPU/4 thread(s)
6875
ns6584
ns1.04
bias_activation(512, act=relu)(512 x 128)/forward/CPU/8 thread(s)
6916
ns6833
ns1.01
bias_activation(512, act=relu)(512 x 128)/forward/CPU/1 thread(s)
6750
ns14542
ns0.46
bias_activation(512, act=relu)(512 x 128)/forward/GPU/CUDA
20662
ns20531
ns1.01
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/2 thread(s)
51625
ns32708
ns1.58
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/4 thread(s)
32917
ns67708
ns0.49
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/8 thread(s)
48208.5
ns32833
ns1.47
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/1 thread(s)
51417
ns51667
ns1.00
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/CUDA
292106.5
ns291979.5
ns1.00
batchedmm(2, Bsize=512)/forward/CPU/2 thread(s)
354562.5
ns336292
ns1.05
batchedmm(2, Bsize=512)/forward/CPU/4 thread(s)
348666.5
ns347187.5
ns1.00
batchedmm(2, Bsize=512)/forward/CPU/8 thread(s)
433333
ns415021
ns1.04
batchedmm(2, Bsize=512)/forward/CPU/1 thread(s)
322041.5
ns324666.5
ns0.99
batchedmm(2, Bsize=512)/forward/GPU/CUDA
18353
ns18102.5
ns1.01
batchedmm(2, Bsize=512)/zygote/CPU/2 thread(s)
724625
ns718416.5
ns1.01
batchedmm(2, Bsize=512)/zygote/CPU/4 thread(s)
730583
ns727250
ns1.00
batchedmm(2, Bsize=512)/zygote/CPU/8 thread(s)
1038687.5
ns1030292
ns1.01
batchedmm(2, Bsize=512)/zygote/CPU/1 thread(s)
675333
ns672709
ns1.00
batchedmm(2, Bsize=512)/zygote/GPU/CUDA
335730.5
ns346719.5
ns0.97
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/2 thread(s)
75458
ns75667
ns1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/4 thread(s)
75333
ns75208
ns1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/8 thread(s)
75375
ns75375
ns1
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/1 thread(s)
74584
ns75000
ns0.99
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/CUDA
46864.5
ns46739
ns1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/2 thread(s)
325166
ns333209
ns0.98
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/4 thread(s)
324250
ns331291
ns0.98
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/8 thread(s)
336875
ns332729.5
ns1.01
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/1 thread(s)
325125
ns324292
ns1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/CUDA
209059.5
ns208913
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
1485709
ns1483875
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
1526833
ns1531875
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
1522792
ns1529458
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
1462625
ns1467834
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
51397
ns51266
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
5113395.5
ns5149875
ns0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
5295292
ns5290166.5
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
5300812.5
ns5287000
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
5001042
ns4982583
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
202971.5
ns202737.5
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/2 thread(s)
28250
ns28291
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/4 thread(s)
28208
ns28167
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/8 thread(s)
28208
ns28291
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/1 thread(s)
28209
ns28167
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/CUDA
24514.5
ns24497
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/2 thread(s)
66417
ns66625
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/4 thread(s)
66458
ns66542
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/8 thread(s)
66500
ns66500
ns1
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/1 thread(s)
66500
ns66500
ns1
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/CUDA
505942
ns532969
ns0.95
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/2 thread(s)
1502084
ns1260875
ns1.19
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/4 thread(s)
1124250
ns1118417
ns1.01
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/8 thread(s)
944270.5
ns1056541
ns0.89
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/1 thread(s)
2255250
ns2256375
ns1.00
mlp7layer_bn(tanh)(32 x 256)/forward/GPU/CUDA
566674
ns573252
ns0.99
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/2 thread(s)
3090791
ns3028208
ns1.02
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/4 thread(s)
2751542
ns2726937.5
ns1.01
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/8 thread(s)
2628896
ns2733875
ns0.96
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/1 thread(s)
3819709
ns3818500
ns1.00
mlp7layer_bn(tanh)(32 x 256)/zygote/GPU/CUDA
1979936
ns1997088
ns0.99
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/2 thread(s)
8847333
ns8958062.5
ns0.99
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/4 thread(s)
8768375
ns8813834
ns0.99
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/8 thread(s)
8750250
ns8742917
ns1.00
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/1 thread(s)
6340375
ns6350021
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
85125
ns82895.5
ns1.03
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
83021
ns80270.5
ns1.03
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
85708.5
ns82875
ns1.03
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
83562.5
ns80167
ns1.04
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
192703
ns192999
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2012875
ns2045708.5
ns0.98
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2024062.5
ns2026499.5
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2038542
ns2015875
ns1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
2008812
ns2005042
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
791664.5
ns797613
ns0.99
This comment was automatically generated by workflow using github-action-benchmark.