Skip to content

Commit

Permalink
chore: bump crate-ci/typos from 1.26.8 to 1.27.0 (#1022)
Browse files Browse the repository at this point in the history
Bumps [crate-ci/typos](https://github.com/crate-ci/typos) from 1.26.8 to 1.27.0.
- [Release notes](https://github.com/crate-ci/typos/releases)
- [Changelog](https://github.com/crate-ci/typos/blob/master/CHANGELOG.md)
- [Commits](crate-ci/typos@v1.26.8...v1.27.0)

---
updated-dependencies:
- dependency-name: crate-ci/typos
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
  • Loading branch information
dependabot[bot] authored Nov 4, 2024
1 parent 89a7083 commit 8bfa628
Showing 1 changed file with 1 addition and 1 deletion.
2 changes: 1 addition & 1 deletion .github/workflows/QualityCheck.yml
Original file line number Diff line number Diff line change
Expand Up @@ -16,4 +16,4 @@ jobs:
- name: Checkout Actions Repository
uses: actions/checkout@v4
- name: Check spelling
uses: crate-ci/typos@v1.26.8
uses: crate-ci/typos@v1.27.0

1 comment on commit 8bfa628

@github-actions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lux Benchmarks

Benchmark suite Current: 8bfa628 Previous: 409eda2 Ratio
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 4625 ns 4334 ns 1.07
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 4084 ns 4125 ns 0.99
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 5791 ns 5417 ns 1.07
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 4292 ns 4167 ns 1.03
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA 60959 ns 59978 ns 1.02
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 10125 ns 10333 ns 0.98
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 9959 ns 10167 ns 0.98
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 10375 ns 10500 ns 0.99
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 10666 ns 10167 ns 1.05
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA 427044 ns 416390 ns 1.03
bias_activation(32, act=relu)(32 x 128)/forward/CPU/2 thread(s) 1167 ns 1166.5 ns 1.00
bias_activation(32, act=relu)(32 x 128)/forward/CPU/4 thread(s) 1250 ns 3042 ns 0.41
bias_activation(32, act=relu)(32 x 128)/forward/CPU/8 thread(s) 1458 ns 1208 ns 1.21
bias_activation(32, act=relu)(32 x 128)/forward/CPU/1 thread(s) 3542 ns 1000 ns 3.54
bias_activation(32, act=relu)(32 x 128)/forward/GPU/CUDA 18260 ns 18063 ns 1.01
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/2 thread(s) 4125 ns 4084 ns 1.01
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/4 thread(s) 3833 ns 3958 ns 0.97
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/8 thread(s) 4125 ns 4250 ns 0.97
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/1 thread(s) 4000 ns 4125 ns 0.97
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/CUDA 111381 ns 109325.5 ns 1.02
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 57709 ns 56041 ns 1.03
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 47250 ns 46084 ns 1.03
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 38250 ns 46375 ns 0.82
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 80333 ns 81834 ns 0.98
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 37655 ns 36229 ns 1.04
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2026167 ns 2056625 ns 0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2092708.5 ns 2082416.5 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2059625.5 ns 2056666.5 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1993416 ns 1995458 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 197377 ns 192802 ns 1.02
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 152958 ns 172458 ns 0.89
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 148250 ns 144854.5 ns 1.02
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 146417 ns 148125 ns 0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 150375 ns 146125 ns 1.03
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 167595 ns 166789 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1098542 ns 1157666 ns 0.95
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1124250 ns 1110395.5 ns 1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1116146 ns 1128416.5 ns 0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1107229.5 ns 1120208 ns 0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 523151 ns 516061 ns 1.01
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 3584 ns 3583 ns 1.00
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 3625 ns 3583.5 ns 1.01
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 5708.5 ns 4229.5 ns 1.35
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 3417 ns 3292 ns 1.04
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA 70157 ns 69748 ns 1.01
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 8834 ns 8792 ns 1.00
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 8667 ns 9125 ns 0.95
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 9291 ns 9000 ns 1.03
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 9042 ns 9209 ns 0.98
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA 492826.5 ns 470533 ns 1.05
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 17000 ns 15083 ns 1.13
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 16375 ns 14875 ns 1.10
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 18667 ns 16583 ns 1.13
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 17083 ns 14917 ns 1.15
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 54850 ns 53475 ns 1.03
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 213146 ns 222375 ns 0.96
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 216104 ns 213084 ns 1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 214167 ns 213250 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 225333 ns 213520.5 ns 1.06
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 272672.5 ns 267675 ns 1.02
bias_activation(2, act=relu)(2 x 128)/forward/CPU/2 thread(s) 459 ns 500 ns 0.92
bias_activation(2, act=relu)(2 x 128)/forward/CPU/4 thread(s) 542 ns 542 ns 1
bias_activation(2, act=relu)(2 x 128)/forward/CPU/8 thread(s) 709 ns 584 ns 1.21
bias_activation(2, act=relu)(2 x 128)/forward/CPU/1 thread(s) 583 ns 583 ns 1
bias_activation(2, act=relu)(2 x 128)/forward/GPU/CUDA 17542 ns 17384 ns 1.01
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/2 thread(s) 1708 ns 1500 ns 1.14
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/4 thread(s) 1458 ns 1500 ns 0.97
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/8 thread(s) 1625 ns 1750 ns 0.93
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/1 thread(s) 1750 ns 1583 ns 1.11
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/CUDA 104205 ns 103376 ns 1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7250 ns 7041 ns 1.03
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 5833 ns 5625 ns 1.04
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 5209 ns 5709 ns 0.91
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 4000 ns 9916 ns 0.40
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 23961 ns 23093 ns 1.04
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 228750.5 ns 227583.5 ns 1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 228333 ns 230417 ns 0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 228500 ns 228000 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 226334 ns 215542 ns 1.05
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 170956 ns 166208.5 ns 1.03
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/2 thread(s) 3875 ns 3916 ns 0.99
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/4 thread(s) 3875 ns 3875 ns 1
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/8 thread(s) 3916 ns 3834 ns 1.02
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/1 thread(s) 3834 ns 3834 ns 1
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/CUDA 23832 ns 23533 ns 1.01
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/2 thread(s) 16833 ns 16708 ns 1.01
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/4 thread(s) 16708 ns 16750 ns 1.00
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/8 thread(s) 16708 ns 16791 ns 1.00
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/1 thread(s) 16958 ns 16625 ns 1.02
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/CUDA 165501.5 ns 160718 ns 1.03
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/2 thread(s) 579042 ns 577333 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/4 thread(s) 574375 ns 573417 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/8 thread(s) 575083 ns 579000 ns 0.99
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/1 thread(s) 576292 ns 574042 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/CUDA 113664 ns 113474 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) 1417708 ns 1432312.5 ns 0.99
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) 1429333 ns 1426250 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) 1425729.5 ns 1425917 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) 1422208 ns 1418000 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/CUDA 214791 ns 211622 ns 1.01
lenet(28, 28, 1, 64)/forward/CPU/2 thread(s) 1082104 ns 1046541 ns 1.03
lenet(28, 28, 1, 64)/forward/CPU/4 thread(s) 959958.5 ns 965500 ns 0.99
lenet(28, 28, 1, 64)/forward/CPU/8 thread(s) 1341792 ns 1347458 ns 1.00
lenet(28, 28, 1, 64)/forward/CPU/1 thread(s) 1294792 ns 1290542 ns 1.00
lenet(28, 28, 1, 64)/forward/GPU/CUDA 281583.5 ns 267857 ns 1.05
lenet(28, 28, 1, 64)/zygote/CPU/2 thread(s) 5777875 ns 5895833.5 ns 0.98
lenet(28, 28, 1, 64)/zygote/CPU/4 thread(s) 4456083 ns 4588042 ns 0.97
lenet(28, 28, 1, 64)/zygote/CPU/8 thread(s) 4934792 ns 4928187 ns 1.00
lenet(28, 28, 1, 64)/zygote/CPU/1 thread(s) 5627500 ns 5737167 ns 0.98
lenet(28, 28, 1, 64)/zygote/GPU/CUDA 1106964 ns 1066176 ns 1.04
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/2 thread(s) 500 ns 500 ns 1
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/4 thread(s) 500 ns 500 ns 1
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/8 thread(s) 542 ns 542 ns 1
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/1 thread(s) 542 ns 500 ns 1.08
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/CUDA 23988 ns 23460 ns 1.02
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/2 thread(s) 2084 ns 2084 ns 1
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/4 thread(s) 2083 ns 2125 ns 0.98
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/8 thread(s) 2125 ns 2292 ns 0.93
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/1 thread(s) 2125 ns 2125 ns 1
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/CUDA 179026 ns 169490.5 ns 1.06
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 6084 ns 5458 ns 1.11
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 6167 ns 4000 ns 1.54
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 7041 ns 5687.5 ns 1.24
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 6375 ns 6250 ns 1.02
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA 66163.5 ns 64594 ns 1.02
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 11291 ns 11083 ns 1.02
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 10791 ns 11333 ns 0.95
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 12125 ns 12041 ns 1.01
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 11354.5 ns 11083.5 ns 1.02
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA 456626.5 ns 444224 ns 1.03
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 7000 ns 6708 ns 1.04
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 7042 ns 6416 ns 1.10
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 8375 ns 7875 ns 1.06
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 7042 ns 6500 ns 1.08
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA 52652 ns 51136 ns 1.03
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 17375 ns 17583 ns 0.99
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 17167 ns 16958 ns 1.01
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 17770.5 ns 18145.5 ns 0.98
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 18708 ns 16916 ns 1.11
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA 306093.5 ns 297812 ns 1.03
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) 459 ns 500 ns 0.92
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) 459 ns 500 ns 0.92
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) 583 ns 583 ns 1
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) 542 ns 500 ns 1.08
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA 33004 ns 31896 ns 1.03
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 8583 ns 8916 ns 0.96
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 8208 ns 8667 ns 0.95
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 9583 ns 9250 ns 1.04
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 9042 ns 8645.5 ns 1.05
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA 162492.5 ns 155805 ns 1.04
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/2 thread(s) 64542 ns 64937.5 ns 0.99
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/4 thread(s) 64417 ns 62625 ns 1.03
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/8 thread(s) 64625 ns 64500 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/1 thread(s) 64750 ns 64667 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/CUDA 112347.5 ns 110478.5 ns 1.02
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/2 thread(s) 277542 ns 294791 ns 0.94
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/4 thread(s) 281625 ns 279125 ns 1.01
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/8 thread(s) 288750 ns 275479.5 ns 1.05
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/1 thread(s) 275500 ns 280854.5 ns 0.98
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/CUDA 189809 ns 185224.5 ns 1.02
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/2 thread(s) 3285583 ns 3152041.5 ns 1.04
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/4 thread(s) 3022333.5 ns 3026187 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/8 thread(s) 2780375 ns 3022520.5 ns 0.92
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/1 thread(s) 4038625 ns 3964167 ns 1.02
mlp7layer_bn(gelu)(32 x 256)/forward/GPU/CUDA 573967 ns 573818.5 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/2 thread(s) 7586208.5 ns 7551166.5 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/4 thread(s) 7415437 ns 7449979 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/8 thread(s) 7333375 ns 7447000 ns 0.98
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/1 thread(s) 8220958 ns 8208396 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/zygote/GPU/CUDA 1351752.5 ns 1327975 ns 1.02
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/2 thread(s) 18835167 ns 18867458 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/4 thread(s) 19044834 ns 19142541 ns 0.99
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/8 thread(s) 19135125 ns 19088834 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/1 thread(s) 15633417 ns 15711167 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) 23661916.5 ns 24315583.5 ns 0.97
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) 33965500 ns 33983500 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) 41107417 ns 37046583.5 ns 1.11
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) 34858709 ns 34841833 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/GPU/CUDA 1862815 ns 2130242 ns 0.87
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) 189289541 ns 192387270.5 ns 0.98
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) 164224708 ns 163943875 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) 157847979 ns 152577625 ns 1.03
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) 438904833 ns 437847333 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/GPU/CUDA 13913764 ns 14119852 ns 0.99
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) 289733584 ns 294725229.5 ns 0.98
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) 338173667 ns 338344395.5 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) 307489541.5 ns 300590083.5 ns 1.02
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) 393585937.5 ns 396800708.5 ns 0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 21708.5 ns 23687.5 ns 0.92
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 24458 ns 23083 ns 1.06
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 25937 ns 24791 ns 1.05
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 24229 ns 23708 ns 1.02
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 96907 ns 95862 ns 1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 103750 ns 103250 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 105292 ns 103458 ns 1.02
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 104208 ns 103667 ns 1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 151250 ns 102750 ns 1.47
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 504189 ns 494978 ns 1.02
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 6583 ns 7083 ns 0.93
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 7292 ns 5750 ns 1.27
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 7959 ns 6875 ns 1.16
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 6958 ns 7000 ns 0.99
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA 68581 ns 67128 ns 1.02
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 14916.5 ns 15375 ns 0.97
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 14709 ns 15395.5 ns 0.96
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 16666 ns 16000 ns 1.04
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 14292 ns 14791.5 ns 0.97
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA 483895 ns 467877 ns 1.03
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) 3017937 ns 3009166.5 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) 2022458 ns 2067250 ns 0.98
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) 2307959 ns 2279667 ns 1.01
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) 4846645.5 ns 4832667 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/GPU/CUDA 585796 ns 581800.5 ns 1.01
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) 23617917 ns 23921708.5 ns 0.99
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) 17975417 ns 18037292 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) 18323812.5 ns 16963187.5 ns 1.08
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) 35597209 ns 34623770.5 ns 1.03
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/GPU/CUDA 3109235 ns 3105602 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) 33405687.5 ns 33780291 ns 0.99
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) 27693604 ns 27715666.5 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) 27860958 ns 27451041 ns 1.01
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) 42002937.5 ns 41640208 ns 1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 72375 ns 80479 ns 0.90
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 84624.5 ns 72416 ns 1.17
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 83250 ns 78354 ns 1.06
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 73750 ns 74645.5 ns 0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 102852 ns 100885 ns 1.02
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 218167 ns 311542 ns 0.70
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 309979 ns 224520.5 ns 1.38
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 317479 ns 209667 ns 1.51
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 288875 ns 257021 ns 1.12
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 550996 ns 539235 ns 1.02
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 12041 ns 12500 ns 0.96
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 12729.5 ns 11708 ns 1.09
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 13833 ns 12542 ns 1.10
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 11666.5 ns 12833.5 ns 0.91
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA 71604 ns 70648 ns 1.01
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 26625 ns 26667 ns 1.00
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 26959 ns 26958.5 ns 1.00
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 28292 ns 27333.5 ns 1.04
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 26458 ns 26625 ns 0.99
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA 484486.5 ns 470896 ns 1.03
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 12417 ns 12791 ns 0.97
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 12542 ns 12333 ns 1.02
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 14584 ns 13500 ns 1.08
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 13041.5 ns 12875 ns 1.01
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA 53694 ns 52214 ns 1.03
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 26312.5 ns 25959 ns 1.01
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 26270.5 ns 25750 ns 1.02
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 26667 ns 26500 ns 1.01
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 26333 ns 26500 ns 0.99
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA 309291.5 ns 300818.5 ns 1.03
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 178770.5 ns 180750 ns 0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 182334 ns 179583 ns 1.02
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 184895.5 ns 183146 ns 1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 179750 ns 179250 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 57908 ns 56380 ns 1.03
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 587125 ns 593542 ns 0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 596500 ns 582459 ns 1.02
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 593770.5 ns 585042 ns 1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 583166 ns 594562 ns 0.98
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 290369.5 ns 284588 ns 1.02
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) 7354.5 ns 6770.5 ns 1.09
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) 7167 ns 5958 ns 1.20
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) 7875 ns 7084 ns 1.11
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) 6833 ns 7125 ns 0.96
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA 70829 ns 70103 ns 1.01
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 14375 ns 14709 ns 0.98
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 14708 ns 14500 ns 1.01
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 15625 ns 15291.5 ns 1.02
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 14083 ns 13958 ns 1.01
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA 471312.5 ns 460969.5 ns 1.02
batchedmm(512, Bsize=4)/forward/CPU/2 thread(s) 1235042 ns 1217750 ns 1.01
batchedmm(512, Bsize=4)/forward/CPU/4 thread(s) 1283583 ns 1209125 ns 1.06
batchedmm(512, Bsize=4)/forward/CPU/8 thread(s) 1282875 ns 1249750 ns 1.03
batchedmm(512, Bsize=4)/forward/CPU/1 thread(s) 1325208 ns 1326625 ns 1.00
batchedmm(512, Bsize=4)/forward/GPU/CUDA 301270 ns 302841 ns 0.99
batchedmm(512, Bsize=4)/zygote/CPU/2 thread(s) 4111125 ns 4351270.5 ns 0.94
batchedmm(512, Bsize=4)/zygote/CPU/4 thread(s) 4361625 ns 4353042 ns 1.00
batchedmm(512, Bsize=4)/zygote/CPU/8 thread(s) 4786395.5 ns 4630333 ns 1.03
batchedmm(512, Bsize=4)/zygote/CPU/1 thread(s) 4453229.5 ns 4466479 ns 1.00
batchedmm(512, Bsize=4)/zygote/GPU/CUDA 1047552 ns 1039570 ns 1.01
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/2 thread(s) 1792 ns 1833 ns 0.98
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/4 thread(s) 1750 ns 1792 ns 0.98
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/8 thread(s) 1834 ns 1833 ns 1.00
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/1 thread(s) 1834 ns 1875 ns 0.98
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/CUDA 23328 ns 23644 ns 0.99
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) 4833 ns 4875 ns 0.99
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) 4792 ns 4875 ns 0.98
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) 4917 ns 5042 ns 0.98
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) 4917 ns 4875 ns 1.01
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/CUDA 186698 ns 189061.5 ns 0.99
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 7208.5 ns 6021 ns 1.20
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 5584 ns 5708 ns 0.98
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 8667 ns 7042 ns 1.23
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 7312.5 ns 7416 ns 0.99
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA 54539 ns 54998.5 ns 0.99
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 10833 ns 11437.5 ns 0.95
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 10834 ns 11084 ns 0.98
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 12375 ns 11666 ns 1.06
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 11916 ns 12333 ns 0.97
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA 329099 ns 332242 ns 0.99
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/2 thread(s) 292 ns 333 ns 0.88
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/4 thread(s) 292 ns 292 ns 1
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/8 thread(s) 334 ns 333 ns 1.00
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/1 thread(s) 333 ns 292 ns 1.14
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/CUDA 22753 ns 22998 ns 0.99
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/2 thread(s) 2708 ns 2667 ns 1.02
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/4 thread(s) 2667 ns 2750 ns 0.97
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/8 thread(s) 2959 ns 2750 ns 1.08
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/1 thread(s) 3000 ns 2709 ns 1.11
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/CUDA 157496 ns 158762.5 ns 0.99
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 13167 ns 13687.5 ns 0.96
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 13166 ns 11208 ns 1.17
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 15000 ns 13958 ns 1.07
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 13792 ns 14125 ns 0.98
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA 55218 ns 57325 ns 0.96
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 24833 ns 24625 ns 1.01
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 24542 ns 24250 ns 1.01
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 25375 ns 25500 ns 1.00
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 24709 ns 24875 ns 0.99
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA 289966 ns 295945 ns 0.98
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/2 thread(s) 4083 ns 4167 ns 0.98
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/4 thread(s) 4166 ns 4166 ns 1
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/8 thread(s) 4167 ns 4167 ns 1
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/1 thread(s) 4125 ns 4125 ns 1
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/CUDA 24660 ns 24912 ns 0.99
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/2 thread(s) 15958 ns 16084 ns 0.99
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/4 thread(s) 16417 ns 16209 ns 1.01
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/8 thread(s) 16042 ns 16333.5 ns 0.98
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/1 thread(s) 16125 ns 16208 ns 0.99
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/CUDA 194045.5 ns 199034.5 ns 0.97
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 5667 ns 5708 ns 0.99
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 5625 ns 5584 ns 1.01
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 5750 ns 5708 ns 1.01
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 5791 ns 5708 ns 1.01
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA 32989 ns 33099 ns 1.00
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 21125 ns 21166 ns 1.00
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 20459 ns 20458 ns 1.00
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 21542 ns 21333.5 ns 1.01
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 20875 ns 20875 ns 1
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA 174273 ns 174613 ns 1.00
batchedmm(16, Bsize=512)/forward/CPU/2 thread(s) 403209 ns 383042 ns 1.05
batchedmm(16, Bsize=512)/forward/CPU/4 thread(s) 371125 ns 373541 ns 0.99
batchedmm(16, Bsize=512)/forward/CPU/8 thread(s) 474292 ns 485896 ns 0.98
batchedmm(16, Bsize=512)/forward/CPU/1 thread(s) 539604.5 ns 532854.5 ns 1.01
batchedmm(16, Bsize=512)/forward/GPU/CUDA 66734 ns 66578.5 ns 1.00
batchedmm(16, Bsize=512)/zygote/CPU/2 thread(s) 1011917 ns 938166 ns 1.08
batchedmm(16, Bsize=512)/zygote/CPU/4 thread(s) 884896 ns 847083 ns 1.04
batchedmm(16, Bsize=512)/zygote/CPU/8 thread(s) 1220125 ns 1235042 ns 0.99
batchedmm(16, Bsize=512)/zygote/CPU/1 thread(s) 1400208 ns 1418833 ns 0.99
batchedmm(16, Bsize=512)/zygote/GPU/CUDA 190566.5 ns 191164 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 82917 ns 81020.5 ns 1.02
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 82791 ns 80354.5 ns 1.03
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 88958.5 ns 82250 ns 1.08
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 83187.5 ns 132458 ns 0.63
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 192556.5 ns 192525 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1921500 ns 1945166 ns 0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1696166 ns 1909584 ns 0.89
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1938083 ns 1920333 ns 1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1915875 ns 1914354.5 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 393732 ns 402795 ns 0.98
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/2 thread(s) 291 ns 292 ns 1.00
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/4 thread(s) 291 ns 292 ns 1.00
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/8 thread(s) 292 ns 292 ns 1
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/1 thread(s) 292 ns 292 ns 1
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/CUDA 21580 ns 21790 ns 0.99
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/2 thread(s) 1792 ns 1792 ns 1
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/4 thread(s) 1792 ns 1791 ns 1.00
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/8 thread(s) 1833 ns 1916 ns 0.96
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/1 thread(s) 1833 ns 1792 ns 1.02
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/CUDA 165924 ns 172681 ns 0.96
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 6708 ns 8000 ns 0.84
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 6250 ns 6833 ns 0.91
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 9750 ns 8334 ns 1.17
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 8125 ns 7999.5 ns 1.02
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA 56950.5 ns 62227.5 ns 0.92
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 8916.5 ns 9375 ns 0.95
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 8958 ns 8875 ns 1.01
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 9625 ns 9625 ns 1
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 9542 ns 9250 ns 1.03
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA 299584.5 ns 315550.5 ns 0.95
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) 120035854.5 ns 159022167 ns 0.75
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) 174382959 ns 174256125 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) 154831333 ns 147914021 ns 1.05
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) 103109500 ns 102407958 ns 1.01
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/GPU/CUDA 5474606 ns 5468366 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) 617124000 ns 678096083 ns 0.91
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) 555612167 ns 555598625 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) 468382792 ns 453528479 ns 1.03
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) 756087750 ns 754205958.5 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/GPU/CUDA 38213656 ns 34940005 ns 1.09
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) 651747459 ns 703546875 ns 0.93
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) 666674583.5 ns 666832020.5 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) 602170708.5 ns 585927312.5 ns 1.03
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) 734251875 ns 742692916 ns 0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 57208 ns 57542 ns 0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 48167 ns 47583 ns 1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 39167 ns 47291 ns 0.83
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 83958 ns 82208 ns 1.02
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 37250 ns 37135 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1929792 ns 1947333 ns 0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1973292 ns 1971042 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1984249.5 ns 1976458 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1881417 ns 1893520.5 ns 0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 171491 ns 171380.5 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 273354 ns 272291 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 267959 ns 265834 ns 1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 270687.5 ns 289417 ns 0.94
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 268834 ns 267167 ns 1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 124192.5 ns 135867.5 ns 0.91
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 658333 ns 671917 ns 0.98
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 674854.5 ns 596708 ns 1.13
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 665333 ns 696292 ns 0.96
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 670500 ns 692687.5 ns 0.97
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 664813 ns 737698 ns 0.90
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 2190167 ns 2231188 ns 0.98
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 2214354.5 ns 2215042 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 2216958.5 ns 2207229 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 2099979 ns 2243770.5 ns 0.94
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 133238 ns 133226 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 5505354.5 ns 5572500 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 5504750 ns 5486875 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 5565292 ns 5511083 ns 1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 5499708 ns 5495666.5 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 740235 ns 759202.5 ns 0.98
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/2 thread(s) 650417 ns 652833.5 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/4 thread(s) 649020.5 ns 657229 ns 0.99
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/8 thread(s) 640625 ns 639500 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/1 thread(s) 648292 ns 639791 ns 1.01
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/CUDA 47265 ns 46976 ns 1.01
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) 1821708 ns 1799583 ns 1.01
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) 1720959 ns 1724792 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) 1675729.5 ns 1722792 ns 0.97
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) 2108500 ns 2103895.5 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/CUDA 224014 ns 221178.5 ns 1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 58583 ns 56541 ns 1.04
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 46645.5 ns 46833 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 38750 ns 46041 ns 0.84
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 83834 ns 83792 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 28947 ns 28073 ns 1.03
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2024916 ns 2058250 ns 0.98
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2086188 ns 2078709 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2100521 ns 2093000 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1993416.5 ns 1996646 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 191815.5 ns 187152 ns 1.02
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) 13473875 ns 13406125 ns 1.01
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) 12547041.5 ns 12455458 ns 1.01
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) 12559604 ns 12584792 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) 15213416.5 ns 14882959 ns 1.02
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/GPU/CUDA 517805 ns 517201.5 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) 47353458 ns 47687000 ns 0.99
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) 41833334 ns 41754625 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) 41118750 ns 40922625 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) 58300041 ns 58112708 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/GPU/CUDA 3203904 ns 3212087 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) 74077042 ns 74213479 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) 68022250 ns 68010000 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) 90906749.5 ns 90988625 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) 99115937.5 ns 76809750 ns 1.29
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 58958 ns 56917 ns 1.04
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 47375 ns 47042 ns 1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 38729.5 ns 47041 ns 0.82
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 83500 ns 83375 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 47777 ns 46301 ns 1.03
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1923375 ns 1939854 ns 0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1961541 ns 1973333 ns 0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1980229 ns 1974729.5 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1890354 ns 1884375 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 194350.5 ns 189579 ns 1.03
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) 291 ns 291 ns 1
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) 291 ns 333 ns 0.87
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) 375 ns 375 ns 1
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) 375 ns 250 ns 1.50
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA 32617.5 ns 31617 ns 1.03
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 6208.5 ns 6229.5 ns 1.00
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 5958 ns 6167 ns 0.97
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 6708 ns 6458 ns 1.04
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 6437.5 ns 6167 ns 1.04
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA 173722.5 ns 171396 ns 1.01
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/2 thread(s) 250 ns 250 ns 1
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/4 thread(s) 250 ns 250 ns 1
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/8 thread(s) 292 ns 250 ns 1.17
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/1 thread(s) 292 ns 250 ns 1.17
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/CUDA 32110 ns 31328 ns 1.02
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/2 thread(s) 2583 ns 2583 ns 1
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/4 thread(s) 2542 ns 2625 ns 0.97
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/8 thread(s) 2833 ns 2792 ns 1.01
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/1 thread(s) 2833 ns 2625 ns 1.08
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/CUDA 161891 ns 161410 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) 286335145.5 ns 324182500 ns 0.88
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) 339870250 ns 339536042 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) 320445937.5 ns 314625854 ns 1.02
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) 272825875 ns 273060250 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/GPU/CUDA 7113314 ns 7093070 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) 990386709 ns 1051455583 ns 0.94
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) 938484666 ns 941830875 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) 868613416.5 ns 858538271 ns 1.01
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) 1158749666 ns 1153691292 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/GPU/CUDA 33903874 ns 34020243.5 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) 1310266104.5 ns 1359481562.5 ns 0.96
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) 1325766333.5 ns 1360673729 ns 0.97
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) 1623996500 ns 1640965792 ns 0.99
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) 1663239334 ns 1309802292 ns 1.27
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 1461479 ns 1414416.5 ns 1.03
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 1415750 ns 1409541 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 1429167 ns 1408500 ns 1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 1414437.5 ns 1453875 ns 0.97
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 128213 ns 127358 ns 1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 5019792 ns 5056229 ns 0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 5022458 ns 5013583 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 5050000 ns 4954291 ns 1.02
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 5006541.5 ns 5017021 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 557532 ns 601067 ns 0.93
vgg16(32, 32, 3, 32)/forward/CPU/2 thread(s) 175263520.5 ns 170719208 ns 1.03
vgg16(32, 32, 3, 32)/forward/CPU/4 thread(s) 129816208.5 ns 132607979.5 ns 0.98
vgg16(32, 32, 3, 32)/forward/CPU/8 thread(s) 145953208.5 ns 124493437.5 ns 1.17
vgg16(32, 32, 3, 32)/forward/CPU/1 thread(s) 164619104.5 ns 162230500 ns 1.01
vgg16(32, 32, 3, 32)/forward/GPU/CUDA 4883992 ns 4886055.5 ns 1.00
vgg16(32, 32, 3, 32)/zygote/CPU/2 thread(s) 831528333 ns 854987208 ns 0.97
vgg16(32, 32, 3, 32)/zygote/CPU/4 thread(s) 497840084 ns 644456708 ns 0.77
vgg16(32, 32, 3, 32)/zygote/CPU/8 thread(s) 556789916 ns 532057834 ns 1.05
vgg16(32, 32, 3, 32)/zygote/CPU/1 thread(s) 679969833 ns 687805708 ns 0.99
vgg16(32, 32, 3, 32)/zygote/GPU/CUDA 16195623 ns 16138006 ns 1.00
batchedmm(512, Bsize=32)/forward/CPU/2 thread(s) 8914083 ns 9114041.5 ns 0.98
batchedmm(512, Bsize=32)/forward/CPU/4 thread(s) 8769917 ns 8770313 ns 1.00
batchedmm(512, Bsize=32)/forward/CPU/8 thread(s) 8216313 ns 7860292 ns 1.05
batchedmm(512, Bsize=32)/forward/CPU/1 thread(s) 10158000 ns 10147292 ns 1.00
batchedmm(512, Bsize=32)/forward/GPU/CUDA 1595526 ns 1612586 ns 0.99
batchedmm(512, Bsize=32)/zygote/CPU/2 thread(s) 35894250 ns 37546375 ns 0.96
batchedmm(512, Bsize=32)/zygote/CPU/4 thread(s) 36843625 ns 36886146 ns 1.00
batchedmm(512, Bsize=32)/zygote/CPU/8 thread(s) 34476562 ns 33451021 ns 1.03
batchedmm(512, Bsize=32)/zygote/CPU/1 thread(s) 38802729 ns 38875771 ns 1.00
batchedmm(512, Bsize=32)/zygote/GPU/CUDA 6454567.5 ns 6459090.5 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/2 thread(s) 47396 ns 47458.5 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/4 thread(s) 49334 ns 49333 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/8 thread(s) 47542 ns 49583 ns 0.96
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/1 thread(s) 47417 ns 47250 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/CUDA 19457 ns 18585 ns 1.05
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/2 thread(s) 50292 ns 50584 ns 0.99
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/4 thread(s) 50520.5 ns 50416 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/8 thread(s) 50584 ns 50708.5 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/1 thread(s) 50250 ns 50500 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/CUDA 189575 ns 216293 ns 0.88
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 8104 ns 7979.5 ns 1.02
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 6791 ns 6791 ns 1
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 9125 ns 8875 ns 1.03
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 7333 ns 8583 ns 0.85
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA 86829.5 ns 106035 ns 0.82
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 9875 ns 10333 ns 0.96
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 9583 ns 9958 ns 0.96
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 10375 ns 10500 ns 0.99
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 10208 ns 10167 ns 1.00
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA 537525 ns 612658 ns 0.88
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 8208 ns 8750 ns 0.94
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 8250 ns 6438 ns 1.28
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 9812.5 ns 8667 ns 1.13
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 6375 ns 5875 ns 1.09
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA 113788.5 ns 119844.5 ns 0.95
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 13333.5 ns 13375 ns 1.00
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 12625 ns 13000 ns 0.97
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 13584 ns 13416 ns 1.01
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 13208 ns 12791 ns 1.03
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA 479705.5 ns 517417.5 ns 0.93
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 958 ns 1042 ns 0.92
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 958 ns 958 ns 1
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 1042 ns 1042 ns 1
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 1083 ns 1042 ns 1.04
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA 32580 ns 31817 ns 1.02
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7750 ns 8041 ns 0.96
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7625 ns 7750 ns 0.98
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 8542 ns 8333 ns 1.03
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 8208 ns 8292 ns 0.99
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA 201701.5 ns 203048 ns 0.99
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/2 thread(s) 23250 ns 23145.5 ns 1.00
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/4 thread(s) 23042 ns 24541 ns 0.94
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/8 thread(s) 23500 ns 24167 ns 0.97
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/1 thread(s) 23167 ns 23334 ns 0.99
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/CUDA 18765.5 ns 18371 ns 1.02
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) 52875 ns 52542 ns 1.01
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) 52292 ns 52416 ns 1.00
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) 52792 ns 52500 ns 1.01
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) 52459 ns 52334 ns 1.00
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/CUDA 260844.5 ns 295739.5 ns 0.88
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 1400229 ns 1440625 ns 0.97
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 1398666.5 ns 1400291 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 1400708 ns 1400875 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 1398917 ns 1406313 ns 0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 196521.5 ns 194620 ns 1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 5018604 ns 5047479.5 ns 0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 5004729.5 ns 5003458.5 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 5044229.5 ns 4836292 ns 1.04
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 5001271 ns 4996708 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 595122 ns 628014 ns 0.95
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) 3043083 ns 3062438 ns 0.99
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) 2094042 ns 2084417 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) 2287146 ns 2227208.5 ns 1.03
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) 4530875 ns 4812250 ns 0.94
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/GPU/CUDA 582703 ns 579246 ns 1.01
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) 24366625 ns 24741125 ns 0.98
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) 18829583 ns 18811521 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) 19120291 ns 18691437 ns 1.02
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) 36653000 ns 36587416 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/GPU/CUDA 3189516.5 ns 3196070 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) 33943229 ns 34435312 ns 0.99
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) 28373417 ns 28306583.5 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) 28357208 ns 28069750 ns 1.01
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) 41659750 ns 41958375 ns 0.99
batchedmm(512, Bsize=512)/forward/CPU/2 thread(s) 144299750 ns 145325041 ns 0.99
batchedmm(512, Bsize=512)/forward/CPU/4 thread(s) 142248375 ns 141848041.5 ns 1.00
batchedmm(512, Bsize=512)/forward/CPU/8 thread(s) 126632146 ns 123758375 ns 1.02
batchedmm(512, Bsize=512)/forward/CPU/1 thread(s) 173840291.5 ns 173196604 ns 1.00
batchedmm(512, Bsize=512)/forward/GPU/CUDA 22781482 ns 22560824 ns 1.01
batchedmm(512, Bsize=512)/zygote/CPU/2 thread(s) 1307941437.5 ns 942531917 ns 1.39
batchedmm(512, Bsize=512)/zygote/CPU/4 thread(s) 1133574500.5 ns 871530625 ns 1.30
batchedmm(512, Bsize=512)/zygote/CPU/8 thread(s) 711240125 ns 1498315250 ns 0.47
batchedmm(512, Bsize=512)/zygote/CPU/1 thread(s) 670828250 ns 674150833 ns 1.00
batchedmm(512, Bsize=512)/zygote/GPU/CUDA 118499942 ns 118289465 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 74542 ns 76208 ns 0.98
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 73917 ns 75041 ns 0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 83125 ns 77875 ns 1.07
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 72916.5 ns 75417 ns 0.97
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 225032.5 ns 273038.5 ns 0.82
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 202979.5 ns 299708 ns 0.68
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 282792 ns 284646 ns 0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 253479.5 ns 191687.5 ns 1.32
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 244146 ns 202979.5 ns 1.20
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1201754 ns 1439967 ns 0.83
batchedmm(512, Bsize=128)/forward/CPU/2 thread(s) 35408938 ns 36345458 ns 0.97
batchedmm(512, Bsize=128)/forward/CPU/4 thread(s) 35449645.5 ns 35416645.5 ns 1.00
batchedmm(512, Bsize=128)/forward/CPU/8 thread(s) 32512083 ns 32239562.5 ns 1.01
batchedmm(512, Bsize=128)/forward/CPU/1 thread(s) 41003541.5 ns 40930312.5 ns 1.00
batchedmm(512, Bsize=128)/forward/GPU/CUDA 5848198 ns 5849412 ns 1.00
batchedmm(512, Bsize=128)/zygote/CPU/2 thread(s) 146608875 ns 151966416 ns 0.96
batchedmm(512, Bsize=128)/zygote/CPU/4 thread(s) 151542938 ns 152232437.5 ns 1.00
batchedmm(512, Bsize=128)/zygote/CPU/8 thread(s) 138849083 ns 136165208.5 ns 1.02
batchedmm(512, Bsize=128)/zygote/CPU/1 thread(s) 287439584 ns 287396625 ns 1.00
batchedmm(512, Bsize=128)/zygote/GPU/CUDA 34913824 ns 34914778 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) 121086291.5 ns 158627833 ns 0.76
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) 174190000 ns 174511667 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) 155717667 ns 148215771.5 ns 1.05
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) 106488666.5 ns 108212479 ns 0.98
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/GPU/CUDA 5478422 ns 5459784 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) 611208666 ns 524328229.5 ns 1.17
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) 466441167 ns 467038291 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) 453562937.5 ns 441190000 ns 1.03
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) 741621625 ns 741818542 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/GPU/CUDA 35157227 ns 32279915 ns 1.09
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) 648662584 ns 692549750 ns 0.94
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) 657411208 ns 656203708.5 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) 585962375 ns 573625208 ns 1.02
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) 845072208 ns 853537834 ns 0.99
mlp7layer_bn(relu)(32 x 256)/forward/CPU/2 thread(s) 1304708 ns 1226937.5 ns 1.06
mlp7layer_bn(relu)(32 x 256)/forward/CPU/4 thread(s) 965666 ns 992979 ns 0.97
mlp7layer_bn(relu)(32 x 256)/forward/CPU/8 thread(s) 744354 ns 904625 ns 0.82
mlp7layer_bn(relu)(32 x 256)/forward/CPU/1 thread(s) 1944604 ns 2085917 ns 0.93
mlp7layer_bn(relu)(32 x 256)/forward/GPU/CUDA 572387 ns 566912.5 ns 1.01
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/2 thread(s) 2974271 ns 2909667 ns 1.02
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/4 thread(s) 2531646 ns 2628208 ns 0.96
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/8 thread(s) 2512854 ns 2006333.5 ns 1.25
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/1 thread(s) 3691334 ns 3693750.5 ns 1.00
mlp7layer_bn(relu)(32 x 256)/zygote/GPU/CUDA 1817474 ns 1796011.5 ns 1.01
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/2 thread(s) 6642416 ns 6757875 ns 0.98
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/4 thread(s) 6630792 ns 6503250 ns 1.02
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/8 thread(s) 6466375 ns 6239125 ns 1.04
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/1 thread(s) 4443145.5 ns 4454771 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7334 ns 7250 ns 1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 6208 ns 6167 ns 1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 5458 ns 6208 ns 0.88
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 10167 ns 10250 ns 0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 25916 ns 24809.5 ns 1.04
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 212104 ns 213666 ns 0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 219562.5 ns 220313 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 220667 ns 220125 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 206291 ns 209542 ns 0.98
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 257490 ns 276995.5 ns 0.93
vgg16(32, 32, 3, 64)/forward/CPU/2 thread(s) 301772791.5 ns 315354292 ns 0.96
vgg16(32, 32, 3, 64)/forward/CPU/4 thread(s) 222879750 ns 221860750 ns 1.00
vgg16(32, 32, 3, 64)/forward/CPU/8 thread(s) 222700312.5 ns 197740833.5 ns 1.13
vgg16(32, 32, 3, 64)/forward/CPU/1 thread(s) 311773125 ns 312004542 ns 1.00
vgg16(32, 32, 3, 64)/forward/GPU/CUDA 7676597.5 ns 7676221 ns 1.00
vgg16(32, 32, 3, 64)/zygote/CPU/2 thread(s) 1082870459 ns 1085627020.5 ns 1.00
vgg16(32, 32, 3, 64)/zygote/CPU/4 thread(s) 892532250 ns 891084375.5 ns 1.00
vgg16(32, 32, 3, 64)/zygote/CPU/8 thread(s) 883941208.5 ns 865730125 ns 1.02
vgg16(32, 32, 3, 64)/zygote/CPU/1 thread(s) 1154293562 ns 1163266979.5 ns 0.99
vgg16(32, 32, 3, 64)/zygote/GPU/CUDA 26959026 ns 26544800.5 ns 1.02
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 6459 ns 6083 ns 1.06
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 5209 ns 5583 ns 0.93
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 10000 ns 7375 ns 1.36
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 5708.5 ns 5270.5 ns 1.08
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA 168546.5 ns 178949 ns 0.94
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7458 ns 7708 ns 0.97
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 6792 ns 7292 ns 0.93
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7542 ns 7500 ns 1.01
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7792 ns 6792 ns 1.15
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA 639812.5 ns 667282.5 ns 0.96
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) 458 ns 542 ns 0.85
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) 458 ns 459 ns 1.00
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) 542 ns 542 ns 1
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) 542 ns 459 ns 1.18
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA 24361 ns 23245 ns 1.05
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 9000 ns 9583.5 ns 0.94
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 9000 ns 9167 ns 0.98
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 9583 ns 9458.5 ns 1.01
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 9708 ns 8792 ns 1.10
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA 234125.5 ns 227149 ns 1.03
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/2 thread(s) 351500 ns 352521.5 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/4 thread(s) 351500 ns 352709 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/8 thread(s) 351916 ns 352958.5 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/1 thread(s) 356625 ns 352708 ns 1.01
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/CUDA 21502 ns 21007 ns 1.02
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) 811270.5 ns 828104 ns 0.98
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) 774958.5 ns 820292 ns 0.94
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) 776584 ns 773500 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) 821875 ns 828312 ns 0.99
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/CUDA 315795.5 ns 289596 ns 1.09
batchedmm(16, Bsize=32)/forward/CPU/2 thread(s) 335896 ns 312083.5 ns 1.08
batchedmm(16, Bsize=32)/forward/CPU/4 thread(s) 338208.5 ns 340166.5 ns 0.99
batchedmm(16, Bsize=32)/forward/CPU/8 thread(s) 441167 ns 445354 ns 0.99
batchedmm(16, Bsize=32)/forward/CPU/1 thread(s) 331375 ns 333520.5 ns 0.99
batchedmm(16, Bsize=32)/forward/GPU/CUDA 18761.5 ns 17918 ns 1.05
batchedmm(16, Bsize=32)/zygote/CPU/2 thread(s) 695166 ns 691583 ns 1.01
batchedmm(16, Bsize=32)/zygote/CPU/4 thread(s) 738208 ns 732334 ns 1.01
batchedmm(16, Bsize=32)/zygote/CPU/8 thread(s) 1036458 ns 1026459 ns 1.01
batchedmm(16, Bsize=32)/zygote/CPU/1 thread(s) 692396 ns 691042 ns 1.00
batchedmm(16, Bsize=32)/zygote/GPU/CUDA 292461.5 ns 273557 ns 1.07
batchedmm(16, Bsize=128)/forward/CPU/2 thread(s) 354166.5 ns 332396 ns 1.07
batchedmm(16, Bsize=128)/forward/CPU/4 thread(s) 346771 ns 348875 ns 0.99
batchedmm(16, Bsize=128)/forward/CPU/8 thread(s) 433791 ns 409541 ns 1.06
batchedmm(16, Bsize=128)/forward/CPU/1 thread(s) 370250 ns 375250 ns 0.99
batchedmm(16, Bsize=128)/forward/GPU/CUDA 23121 ns 22378 ns 1.03
batchedmm(16, Bsize=128)/zygote/CPU/2 thread(s) 757417 ns 755875 ns 1.00
batchedmm(16, Bsize=128)/zygote/CPU/4 thread(s) 749625 ns 743000 ns 1.01
batchedmm(16, Bsize=128)/zygote/CPU/8 thread(s) 1070562.5 ns 1068417 ns 1.00
batchedmm(16, Bsize=128)/zygote/CPU/1 thread(s) 828458 ns 822124.5 ns 1.01
batchedmm(16, Bsize=128)/zygote/GPU/CUDA 257074.5 ns 239682 ns 1.07
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/2 thread(s) 3292 ns 3625 ns 0.91
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/4 thread(s) 3458 ns 3417 ns 1.01
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/8 thread(s) 3750 ns 3583 ns 1.05
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/1 thread(s) 3417 ns 3583 ns 0.95
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/CUDA 18586 ns 17823 ns 1.04
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/2 thread(s) 4167 ns 4208 ns 0.99
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/4 thread(s) 4375 ns 4167 ns 1.05
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/8 thread(s) 4417 ns 4375 ns 1.01
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/1 thread(s) 4250 ns 4292 ns 0.99
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/CUDA 296700.5 ns 271995 ns 1.09
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) 3625 ns 4792 ns 0.76
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) 3750 ns 3834 ns 0.98
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) 6541 ns 5250 ns 1.25
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) 6354.5 ns 3625 ns 1.75
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA 232189.5 ns 214003.5 ns 1.08
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 8187.5 ns 8354.5 ns 0.98
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 8000 ns 8334 ns 0.96
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 8458 ns 8667 ns 0.98
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 8500 ns 8417 ns 1.01
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA 1227082 ns 1200425 ns 1.02
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 203417 ns 204209 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 209541.5 ns 210000 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 208250 ns 211875 ns 0.98
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 198709 ns 199417 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 35300 ns 34086 ns 1.04
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 612417 ns 608520.5 ns 1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 623292 ns 620750 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 623250 ns 620416 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 630166 ns 628625 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 347973 ns 347622 ns 1.00
batchedmm(128, Bsize=128)/forward/CPU/2 thread(s) 977646 ns 980000 ns 1.00
batchedmm(128, Bsize=128)/forward/CPU/4 thread(s) 935437.5 ns 929916.5 ns 1.01
batchedmm(128, Bsize=128)/forward/CPU/8 thread(s) 970083 ns 954250 ns 1.02
batchedmm(128, Bsize=128)/forward/CPU/1 thread(s) 1286374.5 ns 1278542 ns 1.01
batchedmm(128, Bsize=128)/forward/GPU/CUDA 209031 ns 206777 ns 1.01
batchedmm(128, Bsize=128)/zygote/CPU/2 thread(s) 4514333 ns 4651729 ns 0.97
batchedmm(128, Bsize=128)/zygote/CPU/4 thread(s) 4466146 ns 4500083 ns 0.99
batchedmm(128, Bsize=128)/zygote/CPU/8 thread(s) 4452875 ns 4296645.5 ns 1.04
batchedmm(128, Bsize=128)/zygote/CPU/1 thread(s) 6260416.5 ns 6216979.5 ns 1.01
batchedmm(128, Bsize=128)/zygote/GPU/CUDA 947144.5 ns 942518 ns 1.00
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) 3542 ns 3916 ns 0.90
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) 3417 ns 3375 ns 1.01
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) 5896 ns 4667 ns 1.26
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) 6667 ns 3354.5 ns 1.99
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA 219336.5 ns 231395.5 ns 0.95
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 6917 ns 7375 ns 0.94
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 6958 ns 7292 ns 0.95
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7708 ns 7667 ns 1.01
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7291 ns 7000 ns 1.04
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA 1020167.5 ns 1002762 ns 1.02
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) 1635042 ns 1644583 ns 0.99
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) 1200395.5 ns 1174458 ns 1.02
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) 1363584 ns 1323125 ns 1.03
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) 2345187.5 ns 2461333.5 ns 0.95
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/GPU/CUDA 215784.5 ns 213304.5 ns 1.01
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) 12316854.5 ns 12444729.5 ns 0.99
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) 9564000 ns 9564709 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) 9378437.5 ns 9234833 ns 1.02
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) 17989542 ns 18020417 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/GPU/CUDA 1948181 ns 1940786 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) 17368125 ns 17431792 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) 14382958 ns 14392958.5 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) 14502250 ns 14240000 ns 1.02
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) 21085917 ns 21049562.5 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 90917 ns 90625 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 89500 ns 88041 ns 1.02
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 91833 ns 92333 ns 0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 113437.5 ns 136917 ns 0.83
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 126891 ns 125618 ns 1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2009625 ns 2061125 ns 0.98
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2030000 ns 2018458 ns 1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2039270.5 ns 1720042 ns 1.19
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1871125 ns 2024104 ns 0.92
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 1032563 ns 1024038 ns 1.01
batchedmm(2, Bsize=4)/forward/CPU/2 thread(s) 342166.5 ns 331312 ns 1.03
batchedmm(2, Bsize=4)/forward/CPU/4 thread(s) 343375 ns 343500 ns 1.00
batchedmm(2, Bsize=4)/forward/CPU/8 thread(s) 406458 ns 395083 ns 1.03
batchedmm(2, Bsize=4)/forward/CPU/1 thread(s) 311729 ns 310458.5 ns 1.00
batchedmm(2, Bsize=4)/forward/GPU/CUDA 16465.5 ns 15733 ns 1.05
batchedmm(2, Bsize=4)/zygote/CPU/2 thread(s) 706208 ns 699959 ns 1.01
batchedmm(2, Bsize=4)/zygote/CPU/4 thread(s) 728542 ns 722062.5 ns 1.01
batchedmm(2, Bsize=4)/zygote/CPU/8 thread(s) 1018584 ns 1018209 ns 1.00
batchedmm(2, Bsize=4)/zygote/CPU/1 thread(s) 650375 ns 646375 ns 1.01
batchedmm(2, Bsize=4)/zygote/GPU/CUDA 195366.5 ns 189475.5 ns 1.03
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7375 ns 7167 ns 1.03
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 5875 ns 5958 ns 0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 5416 ns 5875 ns 0.92
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 10000 ns 10000 ns 1
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 34591 ns 33239 ns 1.04
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 243791 ns 221625 ns 1.10
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 220125 ns 219959 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 221083 ns 219750 ns 1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 239167 ns 218375 ns 1.10
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 327793 ns 314279 ns 1.04
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/2 thread(s) 3667 ns 3750 ns 0.98
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/4 thread(s) 3667 ns 3667 ns 1
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/8 thread(s) 3709 ns 3667 ns 1.01
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/1 thread(s) 3708 ns 3667 ns 1.01
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/CUDA 22616 ns 22722 ns 1.00
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/2 thread(s) 14292 ns 14167 ns 1.01
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/4 thread(s) 14416 ns 14334 ns 1.01
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/8 thread(s) 14208 ns 14291 ns 0.99
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/1 thread(s) 14417 ns 14375 ns 1.00
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/CUDA 480334.5 ns 475447 ns 1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 94458 ns 95166.5 ns 0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 92625 ns 91833 ns 1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 96875 ns 96125 ns 1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 96229.5 ns 139167 ns 0.69
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 126007 ns 125450 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1714792 ns 1948250 ns 0.88
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1926792 ns 1921104.5 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1913291.5 ns 1669729.5 ns 1.15
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1711417 ns 1920708.5 ns 0.89
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 1034230 ns 954893.5 ns 1.08
lenet(28, 28, 1, 32)/forward/CPU/2 thread(s) 876916.5 ns 854375 ns 1.03
lenet(28, 28, 1, 32)/forward/CPU/4 thread(s) 817791 ns 817542 ns 1.00
lenet(28, 28, 1, 32)/forward/CPU/8 thread(s) 1169438 ns 1213833.5 ns 0.96
lenet(28, 28, 1, 32)/forward/CPU/1 thread(s) 966187.5 ns 958895.5 ns 1.01
lenet(28, 28, 1, 32)/forward/GPU/CUDA 275657.5 ns 276078 ns 1.00
lenet(28, 28, 1, 32)/zygote/CPU/2 thread(s) 2828583 ns 2843334 ns 0.99
lenet(28, 28, 1, 32)/zygote/CPU/4 thread(s) 2474833 ns 2456145.5 ns 1.01
lenet(28, 28, 1, 32)/zygote/CPU/8 thread(s) 3335750 ns 3332000 ns 1.00
lenet(28, 28, 1, 32)/zygote/CPU/1 thread(s) 3304292 ns 3419792 ns 0.97
lenet(28, 28, 1, 32)/zygote/GPU/CUDA 1618381.5 ns 1629171 ns 0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 16709 ns 15333 ns 1.09
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 15625 ns 14709 ns 1.06
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 18667 ns 17041 ns 1.10
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 15583 ns 14333 ns 1.09
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 142594 ns 142609.5 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 228750 ns 262125 ns 0.87
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 215750 ns 215416.5 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 217625 ns 215250 ns 1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 255500 ns 221958 ns 1.15
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 641543.5 ns 641081.5 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 222458 ns 221583.5 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 221500 ns 218625 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 223458.5 ns 222833 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 222604.5 ns 221750 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 269850.5 ns 271537.5 ns 0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 537583 ns 497750 ns 1.08
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 497334 ns 494833 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 499583 ns 497084 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 526833 ns 509000 ns 1.04
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1430878.5 ns 1365399 ns 1.05
batchedmm(16, Bsize=4)/forward/CPU/2 thread(s) 330125 ns 315729 ns 1.05
batchedmm(16, Bsize=4)/forward/CPU/4 thread(s) 332834 ns 333917 ns 1.00
batchedmm(16, Bsize=4)/forward/CPU/8 thread(s) 435458.5 ns 375125 ns 1.16
batchedmm(16, Bsize=4)/forward/CPU/1 thread(s) 315917 ns 322083 ns 0.98
batchedmm(16, Bsize=4)/forward/GPU/CUDA 16581 ns 16846 ns 0.98
batchedmm(16, Bsize=4)/zygote/CPU/2 thread(s) 717084 ns 710041 ns 1.01
batchedmm(16, Bsize=4)/zygote/CPU/4 thread(s) 728166.5 ns 725063 ns 1.00
batchedmm(16, Bsize=4)/zygote/CPU/8 thread(s) 1021104 ns 1022417 ns 1.00
batchedmm(16, Bsize=4)/zygote/CPU/1 thread(s) 662729.5 ns 663021 ns 1.00
batchedmm(16, Bsize=4)/zygote/GPU/CUDA 195479.5 ns 196884 ns 0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 17875 ns 17625 ns 1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 17167 ns 16708 ns 1.03
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 20250 ns 18792 ns 1.08
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 17208 ns 17625 ns 0.98
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 145639 ns 144721 ns 1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 223750 ns 220104.5 ns 1.02
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 212417 ns 212792 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 214041 ns 212750 ns 1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 221917 ns 217250 ns 1.02
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1035551.5 ns 955774 ns 1.08
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) 6708 ns 6042 ns 1.11
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) 6333 ns 4250 ns 1.49
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) 7208 ns 6958 ns 1.04
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) 6625 ns 6541 ns 1.01
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA 240542 ns 245177 ns 0.98
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 10584 ns 10583.5 ns 1.00
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 9917 ns 10250 ns 0.97
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 11166.5 ns 10708 ns 1.04
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 10917 ns 10084 ns 1.08
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA 1097401.5 ns 1099715 ns 1.00
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 3500 ns 4542 ns 0.77
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 3208 ns 3208 ns 1
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 6333.5 ns 4834 ns 1.31
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 6750 ns 2875 ns 2.35
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA 250006 ns 250616.5 ns 1.00
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7625 ns 7125 ns 1.07
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7084 ns 7375 ns 0.96
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 8125 ns 7750 ns 1.05
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7500 ns 7375 ns 1.02
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA 1102649 ns 1110249 ns 0.99
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) 23315625 ns 24293729.5 ns 0.96
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) 34529125 ns 34647499.5 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) 41513333.5 ns 38065167 ns 1.09
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) 34929834 ns 34799687.5 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/GPU/CUDA 1838602 ns 1834951 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) 184421875 ns 187799375 ns 0.98
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) 159459792 ns 159175458 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) 151225083 ns 146555271 ns 1.03
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) 413223958 ns 415008291 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/GPU/CUDA 16387494 ns 16504056.5 ns 0.99
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) 428743125 ns 437855250 ns 0.98
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) 252439020.5 ns 254443000 ns 0.99
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) 233017396 ns 231693624.5 ns 1.01
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) 484197291 ns 485497958 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 183584 ns 184229.5 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 182750 ns 181916 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 186625 ns 184084 ns 1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 183146 ns 182167 ns 1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 228677.5 ns 230730 ns 0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 596083 ns 637084 ns 0.94
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 586292 ns 586270.5 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 589770.5 ns 586583 ns 1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 631958 ns 631542 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1119701 ns 1097701 ns 1.02
batchedmm(128, Bsize=512)/forward/CPU/2 thread(s) 3838833 ns 3894562.5 ns 0.99
batchedmm(128, Bsize=512)/forward/CPU/4 thread(s) 3643375.5 ns 3827292 ns 0.95
batchedmm(128, Bsize=512)/forward/CPU/8 thread(s) 3563521 ns 3469958 ns 1.03
batchedmm(128, Bsize=512)/forward/CPU/1 thread(s) 5359750 ns 5353020.5 ns 1.00
batchedmm(128, Bsize=512)/forward/GPU/CUDA 537722 ns 535365 ns 1.00
batchedmm(128, Bsize=512)/zygote/CPU/2 thread(s) 17412417 ns 18146250 ns 0.96
batchedmm(128, Bsize=512)/zygote/CPU/4 thread(s) 17190667 ns 17166041.5 ns 1.00
batchedmm(128, Bsize=512)/zygote/CPU/8 thread(s) 17100375 ns 16601417 ns 1.03
batchedmm(128, Bsize=512)/zygote/CPU/1 thread(s) 22144083 ns 22202083 ns 1.00
batchedmm(128, Bsize=512)/zygote/GPU/CUDA 2612799 ns 2616593 ns 1.00
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 542 ns 500 ns 1.08
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 458 ns 458 ns 1
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 542 ns 500 ns 1.08
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 583 ns 500 ns 1.17
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA 32035 ns 32123 ns 1.00
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 9208 ns 9458 ns 0.97
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 8542 ns 8667 ns 0.99
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 10208 ns 9167 ns 1.11
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 9459 ns 9208 ns 1.03
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA 264327.5 ns 267754 ns 0.99
vgg16(32, 32, 3, 128)/forward/CPU/2 thread(s) 504274209 ns 580762562.5 ns 0.87
vgg16(32, 32, 3, 128)/forward/CPU/4 thread(s) 430218396 ns 427173312.5 ns 1.01
vgg16(32, 32, 3, 128)/forward/CPU/8 thread(s) 471374500 ns 376948624.5 ns 1.25
vgg16(32, 32, 3, 128)/forward/CPU/1 thread(s) 672994208.5 ns 671986666.5 ns 1.00
vgg16(32, 32, 3, 128)/forward/GPU/CUDA 12486595 ns 12479261 ns 1.00
vgg16(32, 32, 3, 128)/zygote/CPU/2 thread(s) 2049529562.5 ns 2061821458.5 ns 0.99
vgg16(32, 32, 3, 128)/zygote/CPU/4 thread(s) 1632649709 ns 1626836125 ns 1.00
vgg16(32, 32, 3, 128)/zygote/CPU/8 thread(s) 1536417708 ns 1500724875 ns 1.02
vgg16(32, 32, 3, 128)/zygote/CPU/1 thread(s) 2205666041.5 ns 2217147562.5 ns 0.99
vgg16(32, 32, 3, 128)/zygote/GPU/CUDA 49389302 ns 48947892 ns 1.01
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) 1657645.5 ns 1651250 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) 1189208.5 ns 1196959 ns 0.99
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) 1382000 ns 1346187.5 ns 1.03
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) 2334125 ns 2356042 ns 0.99
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/GPU/CUDA 214982 ns 218070 ns 0.99
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) 12688500 ns 12822417 ns 0.99
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) 9942000 ns 9953541.5 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) 9748312.5 ns 9605000 ns 1.01
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) 18407312 ns 18408062.5 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/GPU/CUDA 2050613 ns 2047696.5 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) 17691583.5 ns 17771104.5 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) 14746041.5 ns 14762729 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) 14804417 ns 14473917 ns 1.02
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) 21386084 ns 21336042 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/2 thread(s) 26167 ns 26250 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/4 thread(s) 26292 ns 26209 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/8 thread(s) 26291 ns 26583 ns 0.99
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/1 thread(s) 26291 ns 26209 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/CUDA 24125 ns 24922 ns 0.97
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) 66875 ns 66792 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) 66917 ns 67000 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) 67083 ns 66791 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) 67209 ns 66916 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/CUDA 398847.5 ns 410676.5 ns 0.97
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 202667 ns 203542 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 209000 ns 210583 ns 0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 209167 ns 210500 ns 0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 199583 ns 199958 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 26392 ns 26405 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 612416.5 ns 602333 ns 1.02
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 627416.5 ns 621292 ns 1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 667979 ns 621250 ns 1.08
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 631250 ns 630584 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 353043.5 ns 355627 ns 0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 645542 ns 657646 ns 0.98
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 643375 ns 638729 ns 1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 664187.5 ns 544125 ns 1.22
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 540834 ns 677396 ns 0.80
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 132126 ns 132242 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2247375 ns 2305542 ns 0.97
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2239958 ns 2254292 ns 0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2302917 ns 1426250 ns 1.61
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2219000 ns 2248542 ns 0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 1328726 ns 1182706 ns 1.12
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 17667 ns 17937.5 ns 0.98
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 16979.5 ns 17042 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 20792 ns 19500 ns 1.07
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 18500 ns 16895.5 ns 1.09
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 146392.5 ns 144900 ns 1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 229708 ns 220000 ns 1.04
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 225333 ns 218416.5 ns 1.03
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 229292 ns 219458 ns 1.04
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 259083 ns 261708 ns 0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1081671 ns 1051792 ns 1.03
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 500 ns 459 ns 1.09
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 459 ns 459 ns 1
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 542 ns 542 ns 1
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 542 ns 458 ns 1.18
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA 23645 ns 23475 ns 1.01
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 9833.5 ns 9520.5 ns 1.03
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 9542 ns 9541 ns 1.00
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 10708 ns 10166 ns 1.05
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 9916 ns 9375 ns 1.06
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA 262941 ns 261505 ns 1.01
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) 7291 ns 6542 ns 1.11
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) 5833 ns 5292 ns 1.10
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) 9625 ns 6625 ns 1.45
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) 7250 ns 7416 ns 0.98
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA 234003 ns 235631 ns 0.99
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7333 ns 7000 ns 1.05
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7000 ns 7291 ns 0.96
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7833 ns 7250 ns 1.08
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7250 ns 7208 ns 1.01
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA 810029.5 ns 803793 ns 1.01
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/2 thread(s) 2042 ns 2334 ns 0.87
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/4 thread(s) 2000 ns 2041 ns 0.98
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/8 thread(s) 2375 ns 2292 ns 1.04
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/1 thread(s) 2208 ns 2333 ns 0.95
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/CUDA 18218 ns 18245.5 ns 1.00
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) 6542 ns 6750 ns 0.97
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) 6500 ns 6459 ns 1.01
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) 6708 ns 6667 ns 1.01
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) 6750 ns 6625 ns 1.02
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/CUDA 335368 ns 333087.5 ns 1.01
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/2 thread(s) 750166 ns 748458 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/4 thread(s) 746604.5 ns 746645.5 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/8 thread(s) 751041 ns 746833 ns 1.01
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/1 thread(s) 761417 ns 749417 ns 1.02
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/CUDA 21856 ns 21817 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/2 thread(s) 775334 ns 789125.5 ns 0.98
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/4 thread(s) 775042 ns 772625 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/8 thread(s) 804792 ns 775145.5 ns 1.04
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/1 thread(s) 791625 ns 787875 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/CUDA 299022 ns 298327 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7375 ns 7291 ns 1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 5875 ns 5959 ns 0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 5208 ns 5750 ns 0.91
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 10125 ns 10792 ns 0.94
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 32492 ns 32858 ns 0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 233188 ns 221541 ns 1.05
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 227750 ns 226958 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 254458 ns 226625 ns 1.12
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 255583 ns 220292 ns 1.16
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 359227 ns 360131.5 ns 1.00
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 11042 ns 10250 ns 1.08
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 12458 ns 9917 ns 1.26
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 12959 ns 12459 ns 1.04
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 12000 ns 10583.5 ns 1.13
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA 245075.5 ns 243730.5 ns 1.01
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 24875 ns 24834 ns 1.00
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 24458 ns 24833.5 ns 0.98
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 25458 ns 24750 ns 1.03
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 24583.5 ns 24666 ns 1.00
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA 1120608 ns 1133764 ns 0.99
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) 106980458 ns 107061375 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) 118006979.5 ns 116928479.5 ns 1.01
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) 123940208 ns 121136000 ns 1.02
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) 118407959 ns 117635875 ns 1.01
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/GPU/CUDA 2661574 ns 2659433 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) 394378313 ns 396814083.5 ns 0.99
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) 368164500 ns 366591458 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) 358657167 ns 425794499.5 ns 0.84
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) 482282708 ns 482285959 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/GPU/CUDA 15138278 ns 15258375 ns 0.99
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) 759267583 ns 769963270.5 ns 0.99
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) 577881125 ns 576371708 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) 749378833 ns 745582312 ns 1.01
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) 945671312.5 ns 765495854.5 ns 1.24
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) 7458 ns 7333 ns 1.02
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) 7958 ns 6334 ns 1.26
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) 8750 ns 7750 ns 1.13
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) 7333 ns 8333 ns 0.88
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA 235620 ns 237972 ns 0.99
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 14500 ns 14125 ns 1.03
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 13333 ns 13209 ns 1.01
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 15041 ns 13417 ns 1.12
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 14292 ns 13459 ns 1.06
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA 1078273.5 ns 1080162 ns 1.00
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) 8542 ns 7667 ns 1.11
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) 7792 ns 5583 ns 1.40
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) 9187.5 ns 8167 ns 1.12
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) 7833.5 ns 8291 ns 0.94
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA 235827.5 ns 233794.5 ns 1.01
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 13167 ns 12542 ns 1.05
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 12084 ns 11875 ns 1.02
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 13084 ns 12645.5 ns 1.03
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 12833 ns 11875 ns 1.08
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA 787391.5 ns 787815 ns 1.00
batchedmm(2, Bsize=128)/forward/CPU/2 thread(s) 347250 ns 332667 ns 1.04
batchedmm(2, Bsize=128)/forward/CPU/4 thread(s) 344875 ns 344396 ns 1.00
batchedmm(2, Bsize=128)/forward/CPU/8 thread(s) 409896 ns 395770.5 ns 1.04
batchedmm(2, Bsize=128)/forward/CPU/1 thread(s) 310562 ns 312500 ns 0.99
batchedmm(2, Bsize=128)/forward/GPU/CUDA 16566 ns 16497 ns 1.00
batchedmm(2, Bsize=128)/zygote/CPU/2 thread(s) 713833.5 ns 706958.5 ns 1.01
batchedmm(2, Bsize=128)/zygote/CPU/4 thread(s) 727291 ns 725208 ns 1.00
batchedmm(2, Bsize=128)/zygote/CPU/8 thread(s) 1023416 ns 1019750 ns 1.00
batchedmm(2, Bsize=128)/zygote/CPU/1 thread(s) 654959 ns 658292 ns 0.99
batchedmm(2, Bsize=128)/zygote/GPU/CUDA 197250.5 ns 198046.5 ns 1.00
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) 292 ns 375 ns 0.78
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) 291 ns 292 ns 1.00
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) 375 ns 375 ns 1
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) 375 ns 292 ns 1.28
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA 23066 ns 22951 ns 1.01
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 6250 ns 6542 ns 0.96
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 6334 ns 6208 ns 1.02
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 6750 ns 6792 ns 0.99
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 6791 ns 6208 ns 1.09
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA 238420 ns 237567.5 ns 1.00
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 5750 ns 5709 ns 1.01
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 5750 ns 5667 ns 1.01
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 5875 ns 5875 ns 1
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 5834 ns 5667 ns 1.03
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA 23863 ns 24038 ns 0.99
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 21750 ns 21958 ns 0.99
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 21000 ns 20875 ns 1.01
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 21958 ns 21625 ns 1.02
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 21708 ns 21125 ns 1.03
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA 261085 ns 260574.5 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 152146 ns 146812.5 ns 1.04
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 145250 ns 143875 ns 1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 149541 ns 145917 ns 1.02
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 145937 ns 178146 ns 0.82
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 166536.5 ns 166659.5 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1328792 ns 1355917 ns 0.98
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1319083.5 ns 1329374.5 ns 0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1350812.5 ns 861416.5 ns 1.57
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1317084 ns 1325916 ns 0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 1336276 ns 1338261 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 24917 ns 23084 ns 1.08
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 24208 ns 21458 ns 1.13
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 25708 ns 24042 ns 1.07
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 24208.5 ns 23958 ns 1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 351114.5 ns 350919.5 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 131125 ns 179500 ns 0.73
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 117791 ns 120541 ns 0.98
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 172917 ns 118167 ns 1.46
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 177334 ns 151208 ns 1.17
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1465398.5 ns 1454020.5 ns 1.01
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 333 ns 292 ns 1.14
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 292 ns 292 ns 1
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 375 ns 375 ns 1
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 375 ns 291 ns 1.29
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA 22926 ns 22580 ns 1.02
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 6417 ns 6291 ns 1.02
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 6458 ns 6334 ns 1.02
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 6917 ns 6791 ns 1.02
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 6542 ns 6208 ns 1.05
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA 254551 ns 253799.5 ns 1.00
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 7625 ns 5042 ns 1.51
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 4167 ns 4250 ns 0.98
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 7708.5 ns 5833.5 ns 1.32
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 7375 ns 4666 ns 1.58
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA 250274.5 ns 254794.5 ns 0.98
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 10042 ns 10042 ns 1
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 9708 ns 10042 ns 0.97
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 10333 ns 10417 ns 0.99
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 10250 ns 10125 ns 1.01
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA 1345295 ns 1352736 ns 0.99
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/2 thread(s) 1584 ns 1625 ns 0.97
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/4 thread(s) 1625 ns 1583 ns 1.03
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/8 thread(s) 1625 ns 1584 ns 1.03
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/1 thread(s) 1625 ns 1542 ns 1.05
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/CUDA 22897 ns 23495 ns 0.97
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) 5625 ns 5708 ns 0.99
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) 5584 ns 5667 ns 0.99
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) 5959 ns 5750 ns 1.04
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) 5958 ns 5625 ns 1.06
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/CUDA 271438.5 ns 273637.5 ns 0.99
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) 6886125 ns 6842458 ns 1.01
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) 6378229 ns 6343020.5 ns 1.01
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) 6526875 ns 6507417 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) 7602250 ns 7623042 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/GPU/CUDA 213111 ns 213659 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) 24073062 ns 24131500 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) 21283625 ns 21298104 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) 21045584 ns 21004749.5 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) 29677875 ns 29792896 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/GPU/CUDA 2108165 ns 2117701 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) 37353145.5 ns 37668083 ns 0.99
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) 34386667 ns 34323688 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) 45930020.5 ns 45641000 ns 1.01
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) 49322334 ns 38230313 ns 1.29
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) 7708.5 ns 6459 ns 1.19
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) 5875 ns 5250 ns 1.12
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) 8333 ns 7500 ns 1.11
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) 7062.5 ns 7458 ns 0.95
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA 238522.5 ns 235380.5 ns 1.01
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 8458 ns 8541 ns 0.99
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 8042 ns 7792 ns 1.03
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 8583 ns 8292 ns 1.04
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 8292 ns 9208 ns 0.90
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA 1070850 ns 1057995 ns 1.01
lenet(28, 28, 1, 128)/forward/CPU/2 thread(s) 1544374.5 ns 1525083 ns 1.01
lenet(28, 28, 1, 128)/forward/CPU/4 thread(s) 1259666.5 ns 1258604.5 ns 1.00
lenet(28, 28, 1, 128)/forward/CPU/8 thread(s) 1632771 ns 1613917 ns 1.01
lenet(28, 28, 1, 128)/forward/CPU/1 thread(s) 2150667 ns 2159167 ns 1.00
lenet(28, 28, 1, 128)/forward/GPU/CUDA 278945 ns 273469.5 ns 1.02
lenet(28, 28, 1, 128)/zygote/CPU/2 thread(s) 7908937.5 ns 7971979 ns 0.99
lenet(28, 28, 1, 128)/zygote/CPU/4 thread(s) 6609937 ns 6561833.5 ns 1.01
lenet(28, 28, 1, 128)/zygote/CPU/8 thread(s) 7237750.5 ns 7004875 ns 1.03
lenet(28, 28, 1, 128)/zygote/CPU/1 thread(s) 10434334 ns 10476458 ns 1.00
lenet(28, 28, 1, 128)/zygote/GPU/CUDA 1889956 ns 1860749 ns 1.02
batchedmm(128, Bsize=4)/forward/CPU/2 thread(s) 340979 ns 326083.5 ns 1.05
batchedmm(128, Bsize=4)/forward/CPU/4 thread(s) 345792 ns 347292 ns 1.00
batchedmm(128, Bsize=4)/forward/CPU/8 thread(s) 417125 ns 379020.5 ns 1.10
batchedmm(128, Bsize=4)/forward/CPU/1 thread(s) 345833 ns 343562.5 ns 1.01
batchedmm(128, Bsize=4)/forward/GPU/CUDA 42448 ns 46613.5 ns 0.91
batchedmm(128, Bsize=4)/zygote/CPU/2 thread(s) 746500.5 ns 745458 ns 1.00
batchedmm(128, Bsize=4)/zygote/CPU/4 thread(s) 784542 ns 781417 ns 1.00
batchedmm(128, Bsize=4)/zygote/CPU/8 thread(s) 1073250 ns 1067437.5 ns 1.01
batchedmm(128, Bsize=4)/zygote/CPU/1 thread(s) 761062.5 ns 751125 ns 1.01
batchedmm(128, Bsize=4)/zygote/GPU/CUDA 303720.5 ns 306721.5 ns 0.99
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/2 thread(s) 397500 ns 396333 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/4 thread(s) 288250 ns 287916 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/8 thread(s) 212666 ns 288062.5 ns 0.74
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/1 thread(s) 756084 ns 751542 ns 1.01
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/CUDA 43887 ns 43483 ns 1.01
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/2 thread(s) 671083 ns 646375 ns 1.04
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/4 thread(s) 530083 ns 531834 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/8 thread(s) 470667 ns 530042 ns 0.89
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/1 thread(s) 974750 ns 973417 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/CUDA 188388.5 ns 188389 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 679250 ns 653542 ns 1.04
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 645333.5 ns 639041.5 ns 1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 642458 ns 545542 ns 1.18
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 638562.5 ns 655584 ns 0.97
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 131530 ns 131455.5 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2409292 ns 2529917 ns 0.95
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2456416.5 ns 2399708 ns 1.02
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2514583 ns 2436833 ns 1.03
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2456292 ns 2460520.5 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 1277300 ns 1513461 ns 0.84
batchedmm(2, Bsize=32)/forward/CPU/2 thread(s) 345146 ns 323146 ns 1.07
batchedmm(2, Bsize=32)/forward/CPU/4 thread(s) 343583 ns 343771 ns 1.00
batchedmm(2, Bsize=32)/forward/CPU/8 thread(s) 403708.5 ns 394750 ns 1.02
batchedmm(2, Bsize=32)/forward/CPU/1 thread(s) 312208 ns 310562 ns 1.01
batchedmm(2, Bsize=32)/forward/GPU/CUDA 16009 ns 15996 ns 1.00
batchedmm(2, Bsize=32)/zygote/CPU/2 thread(s) 709667 ns 699000 ns 1.02
batchedmm(2, Bsize=32)/zygote/CPU/4 thread(s) 724500 ns 717792 ns 1.01
batchedmm(2, Bsize=32)/zygote/CPU/8 thread(s) 1022687.5 ns 1016334 ns 1.01
batchedmm(2, Bsize=32)/zygote/CPU/1 thread(s) 650417 ns 649937 ns 1.00
batchedmm(2, Bsize=32)/zygote/GPU/CUDA 195917 ns 196510 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 1460417 ns 1458958 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 1500812.5 ns 1506167 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 1496375 ns 1503458 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 1438708 ns 1442834 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 40600 ns 39862 ns 1.02
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 5128791 ns 5157334 ns 0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 5302375 ns 5010437.5 ns 1.06
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 5313000 ns 4993104 ns 1.06
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 4970208.5 ns 4988542 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 196206.5 ns 197580.5 ns 0.99
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/2 thread(s) 3667 ns 3709 ns 0.99
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/4 thread(s) 3667 ns 3667 ns 1
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/8 thread(s) 3709 ns 3667 ns 1.01
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/1 thread(s) 3708 ns 3708 ns 1
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/CUDA 32895 ns 32748 ns 1.00
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/2 thread(s) 15167 ns 14833 ns 1.02
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/4 thread(s) 15083 ns 15125 ns 1.00
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/8 thread(s) 15083 ns 15292 ns 0.99
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/1 thread(s) 15375 ns 15041 ns 1.02
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/CUDA 376729 ns 374855 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/2 thread(s) 71459 ns 71625 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/4 thread(s) 71250 ns 71333 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/8 thread(s) 71375 ns 71333 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/1 thread(s) 70708 ns 71333 ns 0.99
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/CUDA 113177.5 ns 113422 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/2 thread(s) 317917 ns 326208 ns 0.97
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/4 thread(s) 320417 ns 318250 ns 1.01
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/8 thread(s) 325333 ns 319375 ns 1.02
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/1 thread(s) 320916 ns 317917 ns 1.01
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/CUDA 193043 ns 192316 ns 1.00
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 958 ns 1000 ns 0.96
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 958 ns 959 ns 1.00
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 1083 ns 1083 ns 1
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 1042 ns 1000 ns 1.04
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA 23363 ns 23450 ns 1.00
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 8083 ns 8042 ns 1.01
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 7792 ns 7895.5 ns 0.99
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 8750 ns 8333 ns 1.05
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 8750 ns 7792 ns 1.12
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA 260535.5 ns 258455 ns 1.01
batchedmm(128, Bsize=32)/forward/CPU/2 thread(s) 475499.5 ns 465250 ns 1.02
batchedmm(128, Bsize=32)/forward/CPU/4 thread(s) 470520.5 ns 472750 ns 1.00
batchedmm(128, Bsize=32)/forward/CPU/8 thread(s) 557125 ns 547875 ns 1.02
batchedmm(128, Bsize=32)/forward/CPU/1 thread(s) 557959 ns 554667 ns 1.01
batchedmm(128, Bsize=32)/forward/GPU/CUDA 129404 ns 130091 ns 0.99
batchedmm(128, Bsize=32)/zygote/CPU/2 thread(s) 1399270.5 ns 1420208 ns 0.99
batchedmm(128, Bsize=32)/zygote/CPU/4 thread(s) 1382375 ns 1378895.5 ns 1.00
batchedmm(128, Bsize=32)/zygote/CPU/8 thread(s) 1611125 ns 1600250 ns 1.01
batchedmm(128, Bsize=32)/zygote/CPU/1 thread(s) 1582104.5 ns 1587791 ns 1.00
batchedmm(128, Bsize=32)/zygote/GPU/CUDA 274924 ns 274988 ns 1.00
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 375 ns 334 ns 1.12
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 250 ns 292 ns 0.86
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 375 ns 292 ns 1.28
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 375 ns 292 ns 1.28
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA 31647 ns 31336 ns 1.01
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 6375 ns 6625 ns 0.96
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 6042 ns 5959 ns 1.01
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 6666 ns 6354.5 ns 1.05
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 6625 ns 6166 ns 1.07
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA 262541.5 ns 261129.5 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 1761833 ns 1730708 ns 1.02
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 1723396 ns 1721229.5 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 1733812.5 ns 1723750 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 1730625 ns 1730229 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 169477.5 ns 168441.5 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 4358625 ns 4400167 ns 0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 4358708 ns 4366354 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 4403062.5 ns 3903958 ns 1.13
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 4373875 ns 4358458 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 1208123 ns 1240708 ns 0.97
bias_activation(512, act=relu)(512 x 128)/forward/CPU/2 thread(s) 7167 ns 6792 ns 1.06
bias_activation(512, act=relu)(512 x 128)/forward/CPU/4 thread(s) 6875 ns 6584 ns 1.04
bias_activation(512, act=relu)(512 x 128)/forward/CPU/8 thread(s) 6916 ns 6833 ns 1.01
bias_activation(512, act=relu)(512 x 128)/forward/CPU/1 thread(s) 6750 ns 14542 ns 0.46
bias_activation(512, act=relu)(512 x 128)/forward/GPU/CUDA 20662 ns 20531 ns 1.01
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/2 thread(s) 51625 ns 32708 ns 1.58
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/4 thread(s) 32917 ns 67708 ns 0.49
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/8 thread(s) 48208.5 ns 32833 ns 1.47
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/1 thread(s) 51417 ns 51667 ns 1.00
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/CUDA 292106.5 ns 291979.5 ns 1.00
batchedmm(2, Bsize=512)/forward/CPU/2 thread(s) 354562.5 ns 336292 ns 1.05
batchedmm(2, Bsize=512)/forward/CPU/4 thread(s) 348666.5 ns 347187.5 ns 1.00
batchedmm(2, Bsize=512)/forward/CPU/8 thread(s) 433333 ns 415021 ns 1.04
batchedmm(2, Bsize=512)/forward/CPU/1 thread(s) 322041.5 ns 324666.5 ns 0.99
batchedmm(2, Bsize=512)/forward/GPU/CUDA 18353 ns 18102.5 ns 1.01
batchedmm(2, Bsize=512)/zygote/CPU/2 thread(s) 724625 ns 718416.5 ns 1.01
batchedmm(2, Bsize=512)/zygote/CPU/4 thread(s) 730583 ns 727250 ns 1.00
batchedmm(2, Bsize=512)/zygote/CPU/8 thread(s) 1038687.5 ns 1030292 ns 1.01
batchedmm(2, Bsize=512)/zygote/CPU/1 thread(s) 675333 ns 672709 ns 1.00
batchedmm(2, Bsize=512)/zygote/GPU/CUDA 335730.5 ns 346719.5 ns 0.97
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/2 thread(s) 75458 ns 75667 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/4 thread(s) 75333 ns 75208 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/8 thread(s) 75375 ns 75375 ns 1
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/1 thread(s) 74584 ns 75000 ns 0.99
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/CUDA 46864.5 ns 46739 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/2 thread(s) 325166 ns 333209 ns 0.98
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/4 thread(s) 324250 ns 331291 ns 0.98
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/8 thread(s) 336875 ns 332729.5 ns 1.01
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/1 thread(s) 325125 ns 324292 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/CUDA 209059.5 ns 208913 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 1485709 ns 1483875 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 1526833 ns 1531875 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 1522792 ns 1529458 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 1462625 ns 1467834 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 51397 ns 51266 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 5113395.5 ns 5149875 ns 0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 5295292 ns 5290166.5 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 5300812.5 ns 5287000 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 5001042 ns 4982583 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 202971.5 ns 202737.5 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/2 thread(s) 28250 ns 28291 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/4 thread(s) 28208 ns 28167 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/8 thread(s) 28208 ns 28291 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/1 thread(s) 28209 ns 28167 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/CUDA 24514.5 ns 24497 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) 66417 ns 66625 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) 66458 ns 66542 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) 66500 ns 66500 ns 1
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) 66500 ns 66500 ns 1
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/CUDA 505942 ns 532969 ns 0.95
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/2 thread(s) 1502084 ns 1260875 ns 1.19
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/4 thread(s) 1124250 ns 1118417 ns 1.01
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/8 thread(s) 944270.5 ns 1056541 ns 0.89
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/1 thread(s) 2255250 ns 2256375 ns 1.00
mlp7layer_bn(tanh)(32 x 256)/forward/GPU/CUDA 566674 ns 573252 ns 0.99
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/2 thread(s) 3090791 ns 3028208 ns 1.02
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/4 thread(s) 2751542 ns 2726937.5 ns 1.01
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/8 thread(s) 2628896 ns 2733875 ns 0.96
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/1 thread(s) 3819709 ns 3818500 ns 1.00
mlp7layer_bn(tanh)(32 x 256)/zygote/GPU/CUDA 1979936 ns 1997088 ns 0.99
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/2 thread(s) 8847333 ns 8958062.5 ns 0.99
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/4 thread(s) 8768375 ns 8813834 ns 0.99
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/8 thread(s) 8750250 ns 8742917 ns 1.00
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/1 thread(s) 6340375 ns 6350021 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 85125 ns 82895.5 ns 1.03
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 83021 ns 80270.5 ns 1.03
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 85708.5 ns 82875 ns 1.03
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 83562.5 ns 80167 ns 1.04
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 192703 ns 192999 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2012875 ns 2045708.5 ns 0.98
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2024062.5 ns 2026499.5 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2038542 ns 2015875 ns 1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2008812 ns 2005042 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 791664.5 ns 797613 ns 0.99

This comment was automatically generated by workflow using github-action-benchmark.

Please sign in to comment.