Skip to content
This repository has been archived by the owner on Nov 4, 2024. It is now read-only.

refactor: move JuliaSIMD deps to extensions #175

Merged
merged 11 commits into from
Oct 18, 2024
Merged

refactor: move JuliaSIMD deps to extensions #175

merged 11 commits into from
Oct 18, 2024

Conversation

avik-pal
Copy link
Member

No description provided.

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LuxLib Benchmarks

Benchmark suite Current: d2f76dd Previous: 604783f Ratio
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 5500 ns 5375 ns 1.02
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 5084 ns 5250 ns 0.97
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 5625.5 ns 7708.5 ns 0.73
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 5708 ns 5416 ns 1.05
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA 103933 ns 113361 ns 0.92
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/oneAPI 2722154 ns 2795172 ns 0.97
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/AMDGPU 413664 ns 601544 ns 0.69
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 10083 ns 9729.5 ns 1.04
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 10208.5 ns 9938 ns 1.03
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 10021 ns 10167 ns 0.99
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 10104.5 ns 11063 ns 0.91
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA 536028 ns 544547 ns 0.98
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/oneAPI 19483150 ns 17852957 ns 1.09
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/AMDGPU 683066 ns 629346 ns 1.09
bias_activation(32, act=relu)(32 x 128)/forward/CPU/2 thread(s) 1458 ns 1500 ns 0.97
bias_activation(32, act=relu)(32 x 128)/forward/CPU/4 thread(s) 1667 ns 1458 ns 1.14
bias_activation(32, act=relu)(32 x 128)/forward/CPU/8 thread(s) 1771 ns 1771 ns 1
bias_activation(32, act=relu)(32 x 128)/forward/CPU/1 thread(s) 2292 ns 1583 ns 1.45
bias_activation(32, act=relu)(32 x 128)/forward/GPU/CUDA 20317 ns 20770 ns 0.98
bias_activation(32, act=relu)(32 x 128)/forward/GPU/oneAPI 1310053 ns 1342503 ns 0.98
bias_activation(32, act=relu)(32 x 128)/forward/GPU/AMDGPU 31070.5 ns 30997 ns 1.00
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/2 thread(s) 4020.5 ns 4104 ns 0.98
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/4 thread(s) 4042 ns 4500 ns 0.90
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/8 thread(s) 4604 ns 4500 ns 1.02
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/1 thread(s) 4125 ns 4333 ns 0.95
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/CUDA 132014.5 ns 134970 ns 0.98
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/oneAPI 8955699 ns 8677498 ns 1.03
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/AMDGPU 145462 ns 138579 ns 1.05
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 57750 ns 57666.5 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 46917 ns 46875 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 46791 ns 47125 ns 0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 82542 ns 81458 ns 1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 36903 ns 36587 ns 1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 570577 ns 582336 ns 0.98
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 79861 ns 69420 ns 1.15
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2033166 ns 2030375 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2090875 ns 2088625 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2087542 ns 2086625 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1999667 ns 1998562 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 217109 ns 217216 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 7863243 ns 8077777 ns 0.97
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 956020 ns 930850 ns 1.03
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 148333 ns 175083 ns 0.85
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 144145.5 ns 147291 ns 0.98
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 151208 ns 150021 ns 1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 177166 ns 151750 ns 1.17
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 166653.5 ns 166825 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 7677064 ns 7358467.5 ns 1.04
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 177102 ns 262570 ns 0.67
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1118667 ns 1115103.5 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1120208 ns 1110771 ns 1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1117479.5 ns 1113771 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1119312 ns 1136250 ns 0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 616442 ns 639845.5 ns 0.96
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 31939588 ns 33057102 ns 0.97
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1022810.5 ns 864075 ns 1.18
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 6417 ns 3792 ns 1.69
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 4209 ns 4479 ns 0.94
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 5083 ns 6583 ns 0.77
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 3917 ns 6375 ns 0.61
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA 79406.5 ns 85209.5 ns 0.93
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/oneAPI 5085855.5 ns 5875726.5 ns 0.87
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/AMDGPU 69810 ns 59531 ns 1.17
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 8500 ns 8417 ns 1.01
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 8458 ns 8750 ns 0.97
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 8541 ns 9042 ns 0.94
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 8292 ns 8958 ns 0.93
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA 532320 ns 557500.5 ns 0.95
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/oneAPI 35070703.5 ns 34838164 ns 1.01
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/AMDGPU 371294 ns 370833 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 17458 ns 17958 ns 0.97
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 18000 ns 16458 ns 1.09
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 19687.5 ns 21125 ns 0.93
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 16979.5 ns 17292 ns 0.98
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 62297 ns 63776.5 ns 0.98
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 3097027 ns 2927491.5 ns 1.06
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 78650 ns 82870 ns 0.95
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 220625 ns 212625 ns 1.04
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 213209 ns 213042 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 218833 ns 212771 ns 1.03
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 213708 ns 212291 ns 1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 324224 ns 329859 ns 0.98
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 14318744 ns 12611094 ns 1.14
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 466149 ns 405232 ns 1.15
bias_activation(2, act=relu)(2 x 128)/forward/CPU/2 thread(s) 625 ns 667 ns 0.94
bias_activation(2, act=relu)(2 x 128)/forward/CPU/4 thread(s) 625 ns 625 ns 1
bias_activation(2, act=relu)(2 x 128)/forward/CPU/8 thread(s) 833 ns 875 ns 0.95
bias_activation(2, act=relu)(2 x 128)/forward/CPU/1 thread(s) 625 ns 709 ns 0.88
bias_activation(2, act=relu)(2 x 128)/forward/GPU/CUDA 18901 ns 19101 ns 0.99
bias_activation(2, act=relu)(2 x 128)/forward/GPU/oneAPI 1163594 ns 1145778 ns 1.02
bias_activation(2, act=relu)(2 x 128)/forward/GPU/AMDGPU 33020 ns 26409 ns 1.25
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/2 thread(s) 1417 ns 1458 ns 0.97
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/4 thread(s) 1416 ns 1334 ns 1.06
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/8 thread(s) 1584 ns 1583 ns 1.00
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/1 thread(s) 1375 ns 1375 ns 1
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/CUDA 114703 ns 117126.5 ns 0.98
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/oneAPI 9022001 ns 8850213 ns 1.02
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/AMDGPU 131551 ns 115676 ns 1.14
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7375 ns 7375 ns 1
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 6042 ns 6041 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 6084 ns 6084 ns 1
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 10042 ns 9958 ns 1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 23484.5 ns 23587 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 1176408.5 ns 1261233 ns 0.93
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 49200 ns 52723 ns 0.93
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 259375 ns 229167 ns 1.13
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 262167 ns 230667 ns 1.14
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 266896 ns 267875 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 226813 ns 257458 ns 0.88
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 187535 ns 182744 ns 1.03
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 30816922.5 ns 32590762.5 ns 0.95
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 601896 ns 548449.5 ns 1.10
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/2 thread(s) 3958 ns 3917 ns 1.01
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/4 thread(s) 3917 ns 3958 ns 0.99
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/8 thread(s) 3958 ns 3958 ns 1
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/1 thread(s) 3916 ns 3917 ns 1.00
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/CUDA 23548 ns 22860 ns 1.03
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/oneAPI 1971367 ns 1933593 ns 1.02
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/AMDGPU 48141 ns 39504 ns 1.22
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/2 thread(s) 17375 ns 17042 ns 1.02
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/4 thread(s) 16792 ns 16875 ns 1.00
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/8 thread(s) 17084 ns 17083 ns 1.00
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/1 thread(s) 16917 ns 16875 ns 1.00
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/CUDA 185511 ns 185787.5 ns 1.00
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/oneAPI 12296122.5 ns 10029430 ns 1.23
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/AMDGPU 174551.5 ns 162052 ns 1.08
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/2 thread(s) 934166.5 ns 491583 ns 1.90
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/4 thread(s) 835667 ns 385625 ns 2.17
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/8 thread(s) 846958 ns 386458 ns 2.19
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/1 thread(s) 1262937.5 ns 844083 ns 1.50
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/CUDA 113311 ns 113763 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/oneAPI 400768 ns 418213 ns 0.96
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/AMDGPU 243532 ns 388657 ns 0.63
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) 2602833.5 ns 2155583 ns 1.21
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) 2321062.5 ns 1863374.5 ns 1.25
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) 2334542 ns 1865167 ns 1.25
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) 3549145.5 ns 3377520.5 ns 1.05
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/CUDA 227551 ns 229580 ns 0.99
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/oneAPI 12523677 ns 9922983 ns 1.26
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/AMDGPU 741438 ns 610962 ns 1.21
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 6416.5 ns 6500 ns 0.99
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 5917 ns 5500 ns 1.08
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 7270.5 ns 7667 ns 0.95
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 6229 ns 5167 ns 1.21
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA 84061.5 ns 84720.5 ns 0.99
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/oneAPI 5499782 ns 5300415 ns 1.04
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/AMDGPU 57281 ns 59932 ns 0.96
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 11333.5 ns 11229 ns 1.01
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 11667 ns 11395.5 ns 1.02
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 11625 ns 12334 ns 0.94
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 11271 ns 10667 ns 1.06
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA 597623.5 ns 602168 ns 0.99
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/oneAPI 39133306 ns 38613143.5 ns 1.01
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/AMDGPU 397753.5 ns 383917 ns 1.04
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/2 thread(s) 500 ns 500 ns 1
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/4 thread(s) 542 ns 500 ns 1.08
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/8 thread(s) 542 ns 542 ns 1
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/1 thread(s) 500 ns 500 ns 1
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/CUDA 23645.5 ns 23328 ns 1.01
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/oneAPI 2233296 ns 2178076 ns 1.03
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/AMDGPU 46720 ns 41367 ns 1.13
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/2 thread(s) 2125 ns 2084 ns 1.02
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/4 thread(s) 2209 ns 2166 ns 1.02
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/8 thread(s) 2167 ns 2167 ns 1
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/1 thread(s) 2084 ns 2084 ns 1
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/CUDA 220079.5 ns 228927.5 ns 0.96
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/oneAPI 11462778.5 ns 11774524 ns 0.97
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/AMDGPU 170107 ns 165900 ns 1.03
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 8500 ns 9584 ns 0.89
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 8708 ns 8333 ns 1.05
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 10791.5 ns 9895.5 ns 1.09
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 8542 ns 8542 ns 1
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA 94772.5 ns 105241 ns 0.90
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/oneAPI 3385215 ns 3103348.5 ns 1.09
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/AMDGPU 73640 ns 71955 ns 1.02
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 17625 ns 17688 ns 1.00
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 18000 ns 16666.5 ns 1.08
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 18875 ns 18708 ns 1.01
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 17375 ns 17562 ns 0.99
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA 540650 ns 595171 ns 0.91
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/oneAPI 17058901 ns 16252508 ns 1.05
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/AMDGPU 377674 ns 358129 ns 1.05
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) 541 ns 542 ns 1.00
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) 625 ns 458 ns 1.36
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) 625 ns 583 ns 1.07
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) 500 ns 458 ns 1.09
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA 34202 ns 34578 ns 0.99
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/oneAPI 1244726 ns 1237584 ns 1.01
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/AMDGPU 46001 ns 41387 ns 1.11
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 9479 ns 9229 ns 1.03
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 9542 ns 8958.5 ns 1.07
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 10166 ns 9750 ns 1.04
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 8666.5 ns 8104 ns 1.07
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA 244744 ns 257823 ns 0.95
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/oneAPI 18753058.5 ns 18331589 ns 1.02
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/AMDGPU 364583 ns 349944 ns 1.04
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/2 thread(s) 397209 ns 397270.5 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/4 thread(s) 288125 ns 288083 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/8 thread(s) 288083 ns 288666.5 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/1 thread(s) 745000 ns 751792 ns 0.99
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/CUDA 111800.5 ns 112022 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/oneAPI 333937 ns 349915 ns 0.95
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/AMDGPU 74491 ns 74609 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/2 thread(s) 1461604 ns 1454270.5 ns 1.01
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/4 thread(s) 1130625 ns 1130500 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/8 thread(s) 1130250 ns 1131583 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/1 thread(s) 2406792 ns 2437959 ns 0.99
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/CUDA 197620.5 ns 200057 ns 0.99
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/oneAPI 10419288 ns 7687949 ns 1.36
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/AMDGPU 322733 ns 302285 ns 1.07
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 6875 ns 7750 ns 0.89
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 6979 ns 7083.5 ns 0.99
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 8791.5 ns 8312.5 ns 1.06
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 6708.5 ns 6687.5 ns 1.00
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA 132998 ns 139766 ns 0.95
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/oneAPI 5954726 ns 5685169 ns 1.05
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/AMDGPU 58080 ns 60383 ns 0.96
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 14625 ns 13479.5 ns 1.08
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 13958.5 ns 12750 ns 1.09
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 15833.5 ns 15125 ns 1.05
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 14854.5 ns 14625.5 ns 1.02
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA 894545 ns 923489 ns 0.97
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/oneAPI 44910299 ns 42519536.5 ns 1.06
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/AMDGPU 416034 ns 407432 ns 1.02
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 26042 ns 25625 ns 1.02
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 27625 ns 23666 ns 1.17
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 26166 ns 29417 ns 0.89
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 24584 ns 24041 ns 1.02
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 184672 ns 186240.5 ns 0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 7678081.5 ns 7554376 ns 1.02
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 115321 ns 120505 ns 0.96
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 152750 ns 152187 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 149312.5 ns 145250 ns 1.03
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 104396 ns 146917 ns 0.71
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 146333 ns 103958 ns 1.41
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1006389 ns 1013659 ns 0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 42603061 ns 44493070 ns 0.96
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 580066 ns 535240 ns 1.08
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 84333 ns 74583 ns 1.13
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 79333.5 ns 79584 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 76625 ns 76791.5 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 75208 ns 76083 ns 0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 189509.5 ns 190594.5 ns 0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 7761115 ns 7364811 ns 1.05
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 124941 ns 121316.5 ns 1.03
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 280666.5 ns 273562.5 ns 1.03
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 290479 ns 304084 ns 0.96
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 269708 ns 303333 ns 0.89
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 282833.5 ns 307583 ns 0.92
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1016176 ns 1045024 ns 0.97
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 42923658 ns 39473308 ns 1.09
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 694507 ns 624192 ns 1.11
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 13041 ns 12417 ns 1.05
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 12875 ns 12896 ns 1.00
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 14229 ns 14000 ns 1.02
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 12667 ns 12500 ns 1.01
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA 135169.5 ns 138416 ns 0.98
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/oneAPI 5673569 ns 5479910 ns 1.04
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/AMDGPU 231712.5 ns 226152 ns 1.02
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 27062.5 ns 27792 ns 0.97
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 26854 ns 26458 ns 1.01
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 27917 ns 28437.5 ns 0.98
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 27416.5 ns 33937.5 ns 0.81
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA 904285 ns 924126.5 ns 0.98
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/oneAPI 41716960 ns 42086872 ns 0.99
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/AMDGPU 692027 ns 610976 ns 1.13
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 14041 ns 11124.5 ns 1.26
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 13792 ns 10333 ns 1.33
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 16604 ns 12479.5 ns 1.33
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 14000 ns 11125 ns 1.26
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA 117799 ns 118543.5 ns 0.99
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/oneAPI 3485511 ns 3443799.5 ns 1.01
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/AMDGPU 239113 ns 233176 ns 1.03
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 25542 ns 22291.5 ns 1.15
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 26979 ns 22417 ns 1.20
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 27104.5 ns 24167 ns 1.12
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 25875 ns 28562.5 ns 0.91
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA 660016 ns 668341 ns 0.99
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/oneAPI 21973748 ns 21034051 ns 1.04
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/AMDGPU 670591.5 ns 569113 ns 1.18
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 186729 ns 68709 ns 2.72
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 181959 ns 62750 ns 2.90
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 185145.5 ns 67520.5 ns 2.74
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 185000 ns 64417 ns 2.87
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 101640.5 ns 102389 ns 0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 3356748 ns 3441143 ns 0.98
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 233872 ns 230751 ns 1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 584292 ns 506375 ns 1.15
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 590792 ns 510167 ns 1.16
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 590750.5 ns 475209 ns 1.24
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 590667 ns 647896 ns 0.91
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 489282 ns 492781 ns 0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 20715079 ns 20664230 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 710517 ns 593680 ns 1.20
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) 7375 ns 7958 ns 0.93
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) 7271 ns 6750 ns 1.08
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) 8375 ns 8208 ns 1.02
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) 7374.5 ns 7562.5 ns 0.98
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA 134121.5 ns 137965 ns 0.97
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/oneAPI 5728586.5 ns 5508177.5 ns 1.04
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/AMDGPU 57391 ns 62687 ns 0.92
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 13125 ns 16125 ns 0.81
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 14417 ns 16250 ns 0.89
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 14333 ns 16250 ns 0.88
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 15625 ns 14833 ns 1.05
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA 874264.5 ns 900927 ns 0.97
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/oneAPI 38435171 ns 39349971 ns 0.98
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/AMDGPU 395809 ns 388286 ns 1.02
batchedmm(512, Bsize=4)/forward/CPU/2 thread(s) 6152167 ns 6150354 ns 1.00
batchedmm(512, Bsize=4)/forward/CPU/4 thread(s) 6375812.5 ns 6368167 ns 1.00
batchedmm(512, Bsize=4)/forward/CPU/8 thread(s) 6376958 ns 6373937.5 ns 1.00
batchedmm(512, Bsize=4)/forward/CPU/1 thread(s) 11905750 ns 11915167 ns 1.00
batchedmm(512, Bsize=4)/forward/GPU/CUDA 349172 ns 345749 ns 1.01
batchedmm(512, Bsize=4)/forward/GPU/oneAPI 52345310.5 ns 49052559 ns 1.07
batchedmm(512, Bsize=4)/forward/GPU/AMDGPU 303913 ns 388426 ns 0.78
batchedmm(512, Bsize=4)/zygote/CPU/2 thread(s) 19121687.5 ns 19083437.5 ns 1.00
batchedmm(512, Bsize=4)/zygote/CPU/4 thread(s) 19964166 ns 19960479.5 ns 1.00
batchedmm(512, Bsize=4)/zygote/CPU/8 thread(s) 19966167 ns 19966834 ns 1.00
batchedmm(512, Bsize=4)/zygote/CPU/1 thread(s) 36940917 ns 37142104 ns 0.99
batchedmm(512, Bsize=4)/zygote/GPU/CUDA 1051854 ns 1072087 ns 0.98
batchedmm(512, Bsize=4)/zygote/GPU/oneAPI 77616747 ns 78467188 ns 0.99
batchedmm(512, Bsize=4)/zygote/GPU/AMDGPU 1165782 ns 1035750.5 ns 1.13
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/2 thread(s) 1833 ns 958 ns 1.91
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/4 thread(s) 1833 ns 1000 ns 1.83
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/8 thread(s) 1875 ns 1042 ns 1.80
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/1 thread(s) 1792 ns 958 ns 1.87
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/CUDA 23638 ns 23415 ns 1.01
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/oneAPI 1954095 ns 2079171 ns 0.94
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/AMDGPU 209392 ns 200906 ns 1.04
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) 4792 ns 3917 ns 1.22
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) 4917 ns 4000 ns 1.23
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) 5000 ns 4041 ns 1.24
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) 4916 ns 5458 ns 0.90
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/CUDA 272861 ns 270573.5 ns 1.01
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/oneAPI 10678638 ns 10484095 ns 1.02
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/AMDGPU 619377 ns 486775 ns 1.27
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 7771 ns 8687 ns 0.89
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 7666 ns 7459 ns 1.03
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 9187 ns 9334 ns 0.98
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 7604.5 ns 7834 ns 0.97
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA 114805.5 ns 116220 ns 0.99
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/oneAPI 3374772.5 ns 3435001.5 ns 0.98
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/AMDGPU 67791 ns 71133 ns 0.95
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 11625 ns 12125 ns 0.96
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 11959 ns 11958 ns 1.00
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 12416 ns 13000 ns 0.96
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 11750 ns 11750 ns 1
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA 596992 ns 609643.5 ns 0.98
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/oneAPI 22147198 ns 21784602 ns 1.02
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/AMDGPU 351603 ns 341729 ns 1.03
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/2 thread(s) 291 ns 292 ns 1.00
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/4 thread(s) 333 ns 292 ns 1.14
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/8 thread(s) 333 ns 292 ns 1.14
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/1 thread(s) 292 ns 291 ns 1.00
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/CUDA 22715 ns 22413 ns 1.01
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/oneAPI 2095326 ns 2035110 ns 1.03
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/AMDGPU 47561 ns 44053 ns 1.08
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/2 thread(s) 3333 ns 3000 ns 1.11
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/4 thread(s) 2917 ns 2917 ns 1
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/8 thread(s) 3250 ns 3208 ns 1.01
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/1 thread(s) 3042 ns 2916 ns 1.04
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/CUDA 193310 ns 194923.5 ns 0.99
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/oneAPI 10025999 ns 9225861.5 ns 1.09
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/AMDGPU 158052 ns 154488.5 ns 1.02
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 14625.5 ns 11625 ns 1.26
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 14167 ns 10500 ns 1.35
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 16542 ns 12875 ns 1.28
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 14000 ns 11875 ns 1.18
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA 114348.5 ns 115370 ns 0.99
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/oneAPI 3449562.5 ns 3433218 ns 1.00
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/AMDGPU 236323 ns 231793 ns 1.02
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 25542 ns 22667 ns 1.13
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 25709 ns 22104.5 ns 1.16
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 26500 ns 23625 ns 1.12
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 25646 ns 26729 ns 0.96
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA 549754 ns 555861 ns 0.99
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/oneAPI 20207967.5 ns 20482208 ns 0.99
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/AMDGPU 649557 ns 545740 ns 1.19
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/2 thread(s) 4167 ns 4334 ns 0.96
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/4 thread(s) 4208 ns 4333 ns 0.97
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/8 thread(s) 4209 ns 4208 ns 1.00
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/1 thread(s) 4209 ns 4250 ns 0.99
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/CUDA 24817 ns 23923 ns 1.04
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/oneAPI 2004466.5 ns 2205811 ns 0.91
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/AMDGPU 48320 ns 44864 ns 1.08
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/2 thread(s) 16167 ns 16500 ns 0.98
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/4 thread(s) 16625 ns 16333 ns 1.02
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/8 thread(s) 16333 ns 16166 ns 1.01
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/1 thread(s) 16208 ns 16292 ns 0.99
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/CUDA 315841 ns 319806 ns 0.99
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/oneAPI 12092472 ns 10190777 ns 1.19
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/AMDGPU 205562.5 ns 186077 ns 1.10
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 5708 ns 2125 ns 2.69
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 6000 ns 2084 ns 2.88
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 5916 ns 2209 ns 2.68
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 5958 ns 2000 ns 2.98
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA 34686 ns 35327 ns 0.98
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/oneAPI 1229247 ns 1213779 ns 1.01
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/AMDGPU 206283 ns 199242 ns 1.04
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 21291 ns 17104 ns 1.24
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 22459 ns 20167 ns 1.11
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 21979.5 ns 19000 ns 1.16
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 19750 ns 23083.5 ns 0.86
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA 280754 ns 284984 ns 0.99
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/oneAPI 19480504 ns 18211018 ns 1.07
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/AMDGPU 682467 ns 583431 ns 1.17
batchedmm(16, Bsize=512)/forward/CPU/2 thread(s) 59375 ns 59458 ns 1.00
batchedmm(16, Bsize=512)/forward/CPU/4 thread(s) 65187.5 ns 65666 ns 0.99
batchedmm(16, Bsize=512)/forward/CPU/8 thread(s) 65875 ns 66125 ns 1.00
batchedmm(16, Bsize=512)/forward/CPU/1 thread(s) 51292 ns 52833 ns 0.97
batchedmm(16, Bsize=512)/forward/GPU/CUDA 66425.5 ns 66304 ns 1.00
batchedmm(16, Bsize=512)/forward/GPU/oneAPI 87165834 ns 87707222.5 ns 0.99
batchedmm(16, Bsize=512)/forward/GPU/AMDGPU 97421 ns 110241 ns 0.88
batchedmm(16, Bsize=512)/zygote/CPU/2 thread(s) 192313 ns 153041 ns 1.26
batchedmm(16, Bsize=512)/zygote/CPU/4 thread(s) 164333.5 ns 155229 ns 1.06
batchedmm(16, Bsize=512)/zygote/CPU/8 thread(s) 159791.5 ns 130209 ns 1.23
batchedmm(16, Bsize=512)/zygote/CPU/1 thread(s) 316000 ns 286334 ns 1.10
batchedmm(16, Bsize=512)/zygote/GPU/CUDA 207817.5 ns 210129.5 ns 0.99
batchedmm(16, Bsize=512)/zygote/GPU/oneAPI 150011863.5 ns 149924497 ns 1.00
batchedmm(16, Bsize=512)/zygote/GPU/AMDGPU 568915 ns 511145 ns 1.11
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 83750 ns 106521 ns 0.79
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 123375 ns 78958 ns 1.56
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 85916 ns 84042 ns 1.02
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 91500 ns 115521 ns 0.79
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 192420 ns 191513.5 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 5523456 ns 5334020 ns 1.04
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 169002 ns 267630 ns 0.63
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1886125 ns 1894896 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1898541.5 ns 1902375 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1924000 ns 1878334 ns 1.02
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1918083 ns 1895250 ns 1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 504156 ns 507442 ns 0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 26851109 ns 28152566.5 ns 0.95
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 921499.5 ns 825763 ns 1.12
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/2 thread(s) 291 ns 292 ns 1.00
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/4 thread(s) 333 ns 292 ns 1.14
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/8 thread(s) 333 ns 292 ns 1.14
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/1 thread(s) 292 ns 291 ns 1.00
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/CUDA 21670 ns 21516 ns 1.01
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/oneAPI 2049467 ns 2100524 ns 0.98
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/AMDGPU 40101 ns 35507 ns 1.13
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/2 thread(s) 1792 ns 1792 ns 1
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/4 thread(s) 1834 ns 1875 ns 0.98
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/8 thread(s) 1834 ns 1834 ns 1
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/1 thread(s) 1834 ns 1833 ns 1.00
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/CUDA 240558 ns 245735 ns 0.98
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/oneAPI 10170156.5 ns 9780504 ns 1.04
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/AMDGPU 177582 ns 164548 ns 1.08
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 8958.5 ns 10916 ns 0.82
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 8666.5 ns 8291 ns 1.05
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 11458 ns 11146 ns 1.03
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 8354.5 ns 9500 ns 0.88
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA 113347 ns 114788 ns 0.99
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/oneAPI 3369620 ns 3351587 ns 1.01
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/AMDGPU 234943 ns 232004 ns 1.01
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 10000 ns 8916 ns 1.12
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 9958 ns 8854.5 ns 1.12
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 10541 ns 10917 ns 0.97
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 9833 ns 9583 ns 1.03
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA 486676.5 ns 491693 ns 0.99
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/oneAPI 19572305 ns 19969043 ns 0.98
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/AMDGPU 630451.5 ns 536332 ns 1.18
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 57750 ns 57958 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 47417 ns 46625 ns 1.02
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 46708 ns 46750 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 81042 ns 83166 ns 0.97
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 38230 ns 38476.5 ns 0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 1397410 ns 1460287 ns 0.96
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 77240 ns 71814 ns 1.08
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1885750 ns 1905145.5 ns 0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1962125 ns 1949542 ns 1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1974167 ns 1958500 ns 1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1846708.5 ns 1874958 ns 0.98
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 209291 ns 212675 ns 0.98
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 32347400 ns 33332615 ns 0.97
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1006711 ns 968925.5 ns 1.04
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 269437.5 ns 267500 ns 1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 267604 ns 271479.5 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 267792 ns 271209 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 267125 ns 268209 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 192943 ns 194219.5 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 7708822 ns 7638787 ns 1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 283363 ns 271267 ns 1.04
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 585041 ns 585333.5 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 596916.5 ns 600292 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 613750 ns 671042 ns 0.91
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 587542 ns 845604.5 ns 0.69
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 984053 ns 991966 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 45660970.5 ns 42952243 ns 1.06
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 897330 ns 831153 ns 1.08
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 2242875 ns 2211666 ns 1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 2212687.5 ns 2203958 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 2178500 ns 2229083 ns 0.98
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 2203687 ns 2173792 ns 1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 159048 ns 161646 ns 0.98
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 7948607 ns 8668502.5 ns 0.92
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 410544 ns 470965 ns 0.87
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 5304459 ns 5493104.5 ns 0.97
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 5589291.5 ns 5515875 ns 1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 5525312.5 ns 5526542 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 5503750 ns 6852458 ns 0.80
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 923591 ns 959137 ns 0.96
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 50777266 ns 49532486 ns 1.03
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1712757.5 ns 1437405 ns 1.19
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/2 thread(s) 988917 ns 478292 ns 2.07
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/4 thread(s) 915208 ns 345625 ns 2.65
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/8 thread(s) 913708 ns 346750 ns 2.64
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/1 thread(s) 1332562.5 ns 908542 ns 1.47
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/CUDA 46427 ns 46909 ns 0.99
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/oneAPI 877905.5 ns 871386 ns 1.01
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/AMDGPU 242262 ns 393175 ns 0.62
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) 2614104.5 ns 2137500 ns 1.22
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) 2325354 ns 1869334 ns 1.24
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) 2332000 ns 1859271 ns 1.25
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) 3535541.5 ns 3380209 ns 1.05
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/CUDA 243101.5 ns 264095.5 ns 0.92
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/oneAPI 15142156 ns 13390420 ns 1.13
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/AMDGPU 767673 ns 632907.5 ns 1.21
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 57667 ns 57458 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 46667 ns 46166 ns 1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 46375 ns 46250 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 82541 ns 78667 ns 1.05
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 28222 ns 28560 ns 0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 1376753 ns 1394875.5 ns 0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 74791 ns 73147 ns 1.02
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2027458.5 ns 2029292 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2099250.5 ns 2078187.5 ns 1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2083875 ns 2063250 ns 1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2005208 ns 1963958 ns 1.02
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 222327 ns 230846.5 ns 0.96
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 35859255 ns 36347331 ns 0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1037291 ns 980522 ns 1.06
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 58041 ns 58083.5 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 47292 ns 46584 ns 1.02
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 46833 ns 46917 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 79458 ns 79958 ns 0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 48075 ns 48944 ns 0.98
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 799306 ns 829446 ns 0.96
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 68641 ns 71428.5 ns 0.96
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1892042 ns 1871729 ns 1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1961875 ns 1973604 ns 0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1967292 ns 1944167 ns 1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1849541.5 ns 1876792 ns 0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 229908 ns 238010 ns 0.97
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 17840443.5 ns 18705710.5 ns 0.95
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 909929 ns 881607.5 ns 1.03
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) 333 ns 292 ns 1.14
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) 375 ns 292 ns 1.28
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) 375 ns 375 ns 1
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) 333 ns 291 ns 1.14
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA 33977 ns 34878 ns 0.97
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/oneAPI 1467230.5 ns 1190778.5 ns 1.23
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/AMDGPU 47440 ns 47028 ns 1.01
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 6834 ns 6270.5 ns 1.09
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7000 ns 6187.5 ns 1.13
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7583.5 ns 7375 ns 1.03
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 6375 ns 6125 ns 1.04
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA 199504 ns 211705.5 ns 0.94
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/oneAPI 19180704 ns 20119098 ns 0.95
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/AMDGPU 364834 ns 332741 ns 1.10
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/2 thread(s) 250 ns 250 ns 1
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/4 thread(s) 292 ns 291 ns 1.00
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/8 thread(s) 291 ns 292 ns 1.00
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/1 thread(s) 292 ns 250 ns 1.17
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/CUDA 32545 ns 32902 ns 0.99
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/oneAPI 1210755.5 ns 1224139 ns 0.99
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/AMDGPU 37950 ns 36327 ns 1.04
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/2 thread(s) 3083 ns 2667 ns 1.16
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/4 thread(s) 3292 ns 2667 ns 1.23
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/8 thread(s) 3042 ns 4292 ns 0.71
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/1 thread(s) 2917 ns 3167 ns 0.92
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/CUDA 181917.5 ns 187662.5 ns 0.97
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/oneAPI 7985622 ns 5673429 ns 1.41
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/AMDGPU 144651.5 ns 136635 ns 1.06
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 1411437.5 ns 467208 ns 3.02
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 1417562.5 ns 469417 ns 3.02
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 1418583 ns 466875 ns 3.04
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 1409854.5 ns 464979.5 ns 3.03
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 135301 ns 137312 ns 0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 5882962.5 ns 5812904.5 ns 1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 323274 ns 361475 ns 0.89
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 5017125 ns 4027749.5 ns 1.25
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 5053042 ns 4071500 ns 1.24
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 5029562.5 ns 4067417 ns 1.24
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 4977854 ns 5516750 ns 0.90
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 669461.5 ns 690445 ns 0.97
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 31806387 ns 32063716 ns 0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1465840.5 ns 1091915 ns 1.34
batchedmm(512, Bsize=32)/forward/CPU/2 thread(s) 49837750 ns 49879250 ns 1.00
batchedmm(512, Bsize=32)/forward/CPU/4 thread(s) 35538000 ns 35487583 ns 1.00
batchedmm(512, Bsize=32)/forward/CPU/8 thread(s) 35513667 ns 35512833.5 ns 1.00
batchedmm(512, Bsize=32)/forward/CPU/1 thread(s) 98403875 ns 96974083 ns 1.01
batchedmm(512, Bsize=32)/forward/GPU/CUDA 1620153 ns 1622377 ns 1.00
batchedmm(512, Bsize=32)/forward/GPU/oneAPI 56126123 ns 55868634.5 ns 1.00
batchedmm(512, Bsize=32)/forward/GPU/AMDGPU 1047301 ns 1579230 ns 0.66
batchedmm(512, Bsize=32)/zygote/CPU/2 thread(s) 154556249.5 ns 154423062.5 ns 1.00
batchedmm(512, Bsize=32)/zygote/CPU/4 thread(s) 112408270.5 ns 112364750 ns 1.00
batchedmm(512, Bsize=32)/zygote/CPU/8 thread(s) 112220083 ns 112377416 ns 1.00
batchedmm(512, Bsize=32)/zygote/CPU/1 thread(s) 299786083 ns 299989812 ns 1.00
batchedmm(512, Bsize=32)/zygote/GPU/CUDA 6499320.5 ns 6468945 ns 1.00
batchedmm(512, Bsize=32)/zygote/GPU/oneAPI 126387633 ns 126761495 ns 1.00
batchedmm(512, Bsize=32)/zygote/GPU/AMDGPU 5555408 ns 7230228 ns 0.77
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/2 thread(s) 47417 ns 19104.5 ns 2.48
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/4 thread(s) 48541 ns 18375 ns 2.64
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/8 thread(s) 48041.5 ns 17375.5 ns 2.76
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/1 thread(s) 47500 ns 15083 ns 3.15
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/CUDA 20075 ns 19621 ns 1.02
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/oneAPI 1215909.5 ns 1223248 ns 0.99
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/AMDGPU 25930 ns 28854 ns 0.90
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/2 thread(s) 50917 ns 11062.5 ns 4.60
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/4 thread(s) 50521 ns 8833 ns 5.72
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/8 thread(s) 50854.5 ns 9291 ns 5.47
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/1 thread(s) 51020.5 ns 17667 ns 2.89
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/CUDA 249429.5 ns 252067.5 ns 0.99
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/oneAPI 10392543.5 ns 9844493 ns 1.06
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/AMDGPU 145712 ns 138484 ns 1.05
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 8416 ns 7937.5 ns 1.06
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 8125 ns 8125 ns 1
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 10333 ns 10375 ns 1.00
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 7959 ns 8708 ns 0.91
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA 119433.5 ns 120230.5 ns 0.99
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/oneAPI 3587986.5 ns 3557828.5 ns 1.01
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/AMDGPU 234342 ns 235119 ns 1.00
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 9916 ns 9708 ns 1.02
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 11000 ns 9084 ns 1.21
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 11084 ns 9792 ns 1.13
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 9959 ns 10667 ns 0.93
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA 588933 ns 599437 ns 0.98
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/oneAPI 21671588 ns 22720103 ns 0.95
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/AMDGPU 652456 ns 557070 ns 1.17
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 8875 ns 9291.5 ns 0.96
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 8833 ns 8812.5 ns 1.00
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 11375 ns 9917 ns 1.15
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 8750 ns 8958.5 ns 0.98
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA 116580 ns 118821 ns 0.98
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/oneAPI 3383763 ns 3465548.5 ns 0.98
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/AMDGPU 70650 ns 71593 ns 0.99
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 15499.5 ns 13687.5 ns 1.13
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 14833 ns 13604.5 ns 1.09
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 15125.5 ns 14395.5 ns 1.05
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 14083.5 ns 14750 ns 0.95
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA 561017 ns 570663 ns 0.98
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/oneAPI 19949150 ns 20121784.5 ns 0.99
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/AMDGPU 344553 ns 323504 ns 1.07
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 958 ns 542 ns 1.77
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 1083 ns 625 ns 1.73
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 1083 ns 584 ns 1.85
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 1042 ns 500 ns 2.08
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA 34592 ns 35088 ns 0.99
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/oneAPI 1144490 ns 1218149.5 ns 0.94
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/AMDGPU 206327.5 ns 203871 ns 1.01
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 8458 ns 7562.5 ns 1.12
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 9500 ns 7667 ns 1.24
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 8917 ns 7875 ns 1.13
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 8167 ns 8520.5 ns 0.96
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA 228959.5 ns 227876 ns 1.00
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/oneAPI 23249864.5 ns 22566032 ns 1.03
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/AMDGPU 656487 ns 569945 ns 1.15
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/2 thread(s) 23250 ns 16458 ns 1.41
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/4 thread(s) 23666 ns 17041 ns 1.39
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/8 thread(s) 24167 ns 16209 ns 1.49
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/1 thread(s) 23542 ns 10979 ns 2.14
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/CUDA 20583 ns 20941 ns 0.98
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/oneAPI 1139709 ns 1150830 ns 0.99
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/AMDGPU 188752 ns 182992 ns 1.03
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) 54437 ns 35666 ns 1.53
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) 52500 ns 35167 ns 1.49
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) 53667 ns 36000 ns 1.49
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) 53125 ns 57833 ns 0.92
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/CUDA 261115 ns 265749 ns 0.98
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/oneAPI 10843834 ns 12188303 ns 0.89
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/AMDGPU 590356 ns 534293 ns 1.10
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 1410938 ns 447500 ns 3.15
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 1407250 ns 488042 ns 2.88
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 1404958.5 ns 455709 ns 3.08
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 1440583 ns 496916 ns 2.90
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 195363 ns 195513 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 5637745 ns 5997948.5 ns 0.94
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 345038.5 ns 328714 ns 1.05
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 5017125 ns 4024209 ns 1.25
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 5068833 ns 4055021 ns 1.25
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 5019833.5 ns 4053917 ns 1.24
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 4892041.5 ns 5501562.5 ns 0.89
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 516054 ns 521631.5 ns 0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 28647251.5 ns 27256015 ns 1.05
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1200248 ns 1059038 ns 1.13
batchedmm(512, Bsize=512)/forward/CPU/2 thread(s) 828633875 ns 836727208 ns 0.99
batchedmm(512, Bsize=512)/forward/CPU/4 thread(s) 550214334 ns 553913292 ns 0.99
batchedmm(512, Bsize=512)/forward/CPU/8 thread(s) 540750875 ns 540736625 ns 1.00
batchedmm(512, Bsize=512)/forward/CPU/1 thread(s) 1588891625.5 ns 1517196875 ns 1.05
batchedmm(512, Bsize=512)/forward/GPU/CUDA 22557910 ns 22767789 ns 0.99
batchedmm(512, Bsize=512)/forward/GPU/oneAPI 176100339 ns 174930068 ns 1.01
batchedmm(512, Bsize=512)/forward/GPU/AMDGPU 14645884 ns 10331681 ns 1.42
batchedmm(512, Bsize=512)/zygote/CPU/2 thread(s) 3850981209 ns 3773348667 ns 1.02
batchedmm(512, Bsize=512)/zygote/CPU/4 thread(s) 1775008333 ns 1782084291 ns 1.00
batchedmm(512, Bsize=512)/zygote/CPU/8 thread(s) 2246599000 ns 1780399750 ns 1.26
batchedmm(512, Bsize=512)/zygote/CPU/1 thread(s) 6357143209 ns 4786718666 ns 1.33
batchedmm(512, Bsize=512)/zygote/GPU/CUDA 118713618 ns 118657187 ns 1.00
batchedmm(512, Bsize=512)/zygote/GPU/oneAPI 987258982 ns 1332561794 ns 0.74
batchedmm(512, Bsize=512)/zygote/GPU/AMDGPU 87244864 ns 67063298 ns 1.30
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 78500 ns 76542 ns 1.03
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 76708 ns 76584 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 79542 ns 79583 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 77541 ns 76708.5 ns 1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 195507.5 ns 195943.5 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 5538798 ns 5455658.5 ns 1.02
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 107411.5 ns 123300.5 ns 0.87
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 279395.5 ns 191292 ns 1.46
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 194208.5 ns 252042 ns 0.77
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 266709 ns 199562.5 ns 1.34
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 194416.5 ns 225542 ns 0.86
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1001402 ns 1004442 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 42684248 ns 43458500 ns 0.98
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 631486 ns 590764 ns 1.07
batchedmm(512, Bsize=128)/forward/CPU/2 thread(s) 199442208.5 ns 199694520.5 ns 1.00
batchedmm(512, Bsize=128)/forward/CPU/4 thread(s) 138679541 ns 138856500 ns 1.00
batchedmm(512, Bsize=128)/forward/CPU/8 thread(s) 139099334 ns 139241166 ns 1.00
batchedmm(512, Bsize=128)/forward/CPU/1 thread(s) 394971583 ns 393790959 ns 1.00
batchedmm(512, Bsize=128)/forward/GPU/CUDA 5839594 ns 5842492 ns 1.00
batchedmm(512, Bsize=128)/forward/GPU/oneAPI 78661896 ns 78913006.5 ns 1.00
batchedmm(512, Bsize=128)/forward/GPU/AMDGPU 3608983 ns 4746717.5 ns 0.76
batchedmm(512, Bsize=128)/zygote/CPU/2 thread(s) 617352375.5 ns 617676375.5 ns 1.00
batchedmm(512, Bsize=128)/zygote/CPU/4 thread(s) 438426875 ns 439446917 ns 1.00
batchedmm(512, Bsize=128)/zygote/CPU/8 thread(s) 439509500 ns 439765166.5 ns 1.00
batchedmm(512, Bsize=128)/zygote/CPU/1 thread(s) 1199223417 ns 1174222000 ns 1.02
batchedmm(512, Bsize=128)/zygote/GPU/CUDA 26592101 ns 26723523 ns 1.00
batchedmm(512, Bsize=128)/zygote/GPU/oneAPI 285808418.5 ns 276392509 ns 1.03
batchedmm(512, Bsize=128)/zygote/GPU/AMDGPU 21811069 ns 15854720 ns 1.38
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7334 ns 7292 ns 1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 6167 ns 6125 ns 1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 5958 ns 5959 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 9917 ns 9834 ns 1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 26841 ns 26896.5 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 1234360 ns 1173091 ns 1.05
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 46450 ns 55173 ns 0.84
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 213895.5 ns 213041.5 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 266937.5 ns 227729 ns 1.17
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 222916 ns 220416.5 ns 1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 208083 ns 206125 ns 1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 218249 ns 219868 ns 0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 19255466 ns 20153337 ns 0.96
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 528005 ns 541982 ns 0.97
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 8250 ns 8521 ns 0.97
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 7708 ns 7458 ns 1.03
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 9500 ns 11167 ns 0.85
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 6834 ns 9250 ns 0.74
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA 114813.5 ns 115361 ns 1.00
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/oneAPI 3509204 ns 3392154.5 ns 1.03
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/AMDGPU 70165.5 ns 74069 ns 0.95
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 9042 ns 7562.5 ns 1.20
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 8500 ns 7958 ns 1.07
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 8875 ns 8167 ns 1.09
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 8375 ns 7395.5 ns 1.13
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA 492349 ns 495697 ns 0.99
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/oneAPI 19405923 ns 20965461 ns 0.93
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/AMDGPU 317353 ns 309298 ns 1.03
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) 417 ns 417 ns 1
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) 542 ns 459 ns 1.18
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) 500 ns 500 ns 1
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) 542 ns 375 ns 1.45
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA 25659.5 ns 26124 ns 0.98
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/oneAPI 1207404 ns 1243719 ns 0.97
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/AMDGPU 46461 ns 45334 ns 1.02
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 9729.5 ns 9584 ns 1.02
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 9917 ns 9062.5 ns 1.09
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 9999.5 ns 9792 ns 1.02
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 9291.5 ns 9542 ns 0.97
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA 247915 ns 247606 ns 1.00
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/oneAPI 22484624 ns 24899790.5 ns 0.90
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/AMDGPU 383749 ns 382304 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/2 thread(s) 351458 ns 112312.5 ns 3.13
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/4 thread(s) 351708 ns 103229 ns 3.41
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/8 thread(s) 353417 ns 104104.5 ns 3.39
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/1 thread(s) 353583 ns 155083 ns 2.28
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/CUDA 23647 ns 23501 ns 1.01
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/oneAPI 817717.5 ns 811475 ns 1.01
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/AMDGPU 188532 ns 192539 ns 0.98
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) 779374.5 ns 536562 ns 1.45
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) 778292 ns 554250 ns 1.40
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) 785770.5 ns 535291.5 ns 1.47
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) 816333.5 ns 910854 ns 0.90
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/CUDA 219037 ns 221242 ns 0.99
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/oneAPI 12364678.5 ns 11751092 ns 1.05
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/AMDGPU 605051.5 ns 560216.5 ns 1.08
batchedmm(16, Bsize=32)/forward/CPU/2 thread(s) 4937.5 ns 5416.5 ns 0.91
batchedmm(16, Bsize=32)/forward/CPU/4 thread(s) 5708 ns 6208.5 ns 0.92
batchedmm(16, Bsize=32)/forward/CPU/8 thread(s) 6458 ns 6021 ns 1.07
batchedmm(16, Bsize=32)/forward/CPU/1 thread(s) 6417 ns 4000 ns 1.60
batchedmm(16, Bsize=32)/forward/GPU/CUDA 17947.5 ns 17520 ns 1.02
batchedmm(16, Bsize=32)/forward/GPU/oneAPI 73120248 ns 72849606 ns 1.00
batchedmm(16, Bsize=32)/forward/GPU/AMDGPU 77421 ns 73648 ns 1.05
batchedmm(16, Bsize=32)/zygote/CPU/2 thread(s) 12313 ns 11562.5 ns 1.06
batchedmm(16, Bsize=32)/zygote/CPU/4 thread(s) 11646 ns 11062 ns 1.05
batchedmm(16, Bsize=32)/zygote/CPU/8 thread(s) 11542 ns 11000 ns 1.05
batchedmm(16, Bsize=32)/zygote/CPU/1 thread(s) 17125 ns 16666 ns 1.03
batchedmm(16, Bsize=32)/zygote/GPU/CUDA 206798 ns 207455.5 ns 1.00
batchedmm(16, Bsize=32)/zygote/GPU/oneAPI 99646801 ns 97442684 ns 1.02
batchedmm(16, Bsize=32)/zygote/GPU/AMDGPU 364324 ns 330387 ns 1.10
batchedmm(16, Bsize=128)/forward/CPU/2 thread(s) 39729 ns 39667 ns 1.00
batchedmm(16, Bsize=128)/forward/CPU/4 thread(s) 51375 ns 51291 ns 1.00
batchedmm(16, Bsize=128)/forward/CPU/8 thread(s) 52437.5 ns 52958.5 ns 0.99
batchedmm(16, Bsize=128)/forward/CPU/1 thread(s) 14000 ns 13625 ns 1.03
batchedmm(16, Bsize=128)/forward/GPU/CUDA 22811 ns 20356 ns 1.12
batchedmm(16, Bsize=128)/forward/GPU/oneAPI 76357271 ns 76663129 ns 1.00
batchedmm(16, Bsize=128)/forward/GPU/AMDGPU 85726 ns 98364 ns 0.87
batchedmm(16, Bsize=128)/zygote/CPU/2 thread(s) 42979 ns 36375.5 ns 1.18
batchedmm(16, Bsize=128)/zygote/CPU/4 thread(s) 31833.5 ns 31417 ns 1.01
batchedmm(16, Bsize=128)/zygote/CPU/8 thread(s) 32125 ns 31229.5 ns 1.03
batchedmm(16, Bsize=128)/zygote/CPU/1 thread(s) 65271 ns 57000 ns 1.15
batchedmm(16, Bsize=128)/zygote/GPU/CUDA 184581 ns 184178 ns 1.00
batchedmm(16, Bsize=128)/zygote/GPU/oneAPI 112866310 ns 111708023 ns 1.01
batchedmm(16, Bsize=128)/zygote/GPU/AMDGPU 405654 ns 355254 ns 1.14
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/2 thread(s) 3500 ns 1750 ns 2
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/4 thread(s) 3625 ns 2042 ns 1.78
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/8 thread(s) 4042 ns 2208 ns 1.83
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/1 thread(s) 3584 ns 1875 ns 1.91
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/CUDA 19709 ns 19575 ns 1.01
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/oneAPI 1207802 ns 1219758.5 ns 0.99
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/AMDGPU 29141 ns 29099.5 ns 1.00
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/2 thread(s) 4208 ns 2208 ns 1.91
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/4 thread(s) 4333 ns 2167 ns 2.00
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/8 thread(s) 4583 ns 2375 ns 1.93
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/1 thread(s) 4250 ns 2208 ns 1.92
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/CUDA 197529 ns 198996.5 ns 0.99
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/oneAPI 10465817 ns 8766738.5 ns 1.19
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/AMDGPU 136751 ns 128571 ns 1.06
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) 6250 ns 4583 ns 1.36
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) 4542 ns 4417 ns 1.03
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) 6229 ns 6729 ns 0.93
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) 4833 ns 3958 ns 1.22
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA 141517 ns 143699.5 ns 0.98
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/oneAPI 5697543 ns 5704411.5 ns 1.00
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/AMDGPU 61831 ns 61955.5 ns 1.00
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 8375 ns 8334 ns 1.00
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 8583 ns 8083.5 ns 1.06
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 8666.5 ns 8709 ns 1.00
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 8791 ns 8583 ns 1.02
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA 825195 ns 836045.5 ns 0.99
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/oneAPI 39430730.5 ns 39725172 ns 0.99
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/AMDGPU 382689 ns 364891 ns 1.05
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 204959 ns 54833 ns 3.74
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 211312.5 ns 55833 ns 3.78
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 210042 ns 55583 ns 3.78
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 200833 ns 56000 ns 3.59
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 36707 ns 36570 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 1193144 ns 1345223 ns 0.89
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 208072 ns 202568 ns 1.03
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 649875 ns 476729 ns 1.36
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 622959 ns 494500 ns 1.26
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 627750 ns 494208 ns 1.27
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 626583 ns 641625 ns 0.98
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 260696 ns 259886 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 27024723 ns 28017517.5 ns 0.96
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 800078 ns 705894 ns 1.13
batchedmm(128, Bsize=128)/forward/CPU/2 thread(s) 3314520.5 ns 3310333 ns 1.00
batchedmm(128, Bsize=128)/forward/CPU/4 thread(s) 2333042 ns 2334062.5 ns 1.00
batchedmm(128, Bsize=128)/forward/CPU/8 thread(s) 2334667 ns 2333375 ns 1.00
batchedmm(128, Bsize=128)/forward/CPU/1 thread(s) 6298459 ns 6300479 ns 1.00
batchedmm(128, Bsize=128)/forward/GPU/CUDA 205748 ns 204581.5 ns 1.01
batchedmm(128, Bsize=128)/forward/GPU/oneAPI 76861698.5 ns 77398976 ns 0.99
batchedmm(128, Bsize=128)/forward/GPU/AMDGPU 216863 ns 373097 ns 0.58
batchedmm(128, Bsize=128)/zygote/CPU/2 thread(s) 11451687 ns 11459729 ns 1.00
batchedmm(128, Bsize=128)/zygote/CPU/4 thread(s) 8308625 ns 8305729.5 ns 1.00
batchedmm(128, Bsize=128)/zygote/CPU/8 thread(s) 8341250 ns 8342854 ns 1.00
batchedmm(128, Bsize=128)/zygote/CPU/1 thread(s) 21350500 ns 21088292 ns 1.01
batchedmm(128, Bsize=128)/zygote/GPU/CUDA 733789 ns 744676 ns 0.99
batchedmm(128, Bsize=128)/zygote/GPU/oneAPI 121292322.5 ns 121497637 ns 1.00
batchedmm(128, Bsize=128)/zygote/GPU/AMDGPU 1069846 ns 1994797.5 ns 0.54
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) 6250.5 ns 4833 ns 1.29
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) 4625 ns 4646 ns 1.00
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) 6562 ns 7520.5 ns 0.87
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) 5604.5 ns 4917 ns 1.14
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA 131336.5 ns 133339 ns 0.98
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/oneAPI 5403092 ns 5450569.5 ns 0.99
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/AMDGPU 55485.5 ns 61520 ns 0.90
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 8458 ns 7083 ns 1.19
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 10417 ns 7291.5 ns 1.43
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7500 ns 7500 ns 1
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 8167 ns 7416.5 ns 1.10
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA 720749 ns 725863 ns 0.99
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/oneAPI 38254432 ns 33872141 ns 1.13
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/AMDGPU 369044 ns 353680 ns 1.04
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 146937.5 ns 100459 ns 1.46
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 119354 ns 123042 ns 0.97
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 99458 ns 102417 ns 0.97
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 119750 ns 121458.5 ns 0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 151150 ns 151940.5 ns 0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 6157451.5 ns 5695179 ns 1.08
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 182732 ns 233346 ns 0.78
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2012395.5 ns 2033271 ns 0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2034875 ns 2026417 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2027292 ns 1997458.5 ns 1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2040354 ns 2041833 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 677835 ns 678763 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 31280000.5 ns 31810809 ns 0.98
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1115061 ns 931831 ns 1.20
batchedmm(2, Bsize=4)/forward/CPU/2 thread(s) 34166 ns 32666 ns 1.05
batchedmm(2, Bsize=4)/forward/CPU/4 thread(s) 36583 ns 36562.5 ns 1.00
batchedmm(2, Bsize=4)/forward/CPU/8 thread(s) 36125 ns 36167 ns 1.00
batchedmm(2, Bsize=4)/forward/CPU/1 thread(s) 583 ns 667 ns 0.87
batchedmm(2, Bsize=4)/forward/GPU/CUDA 16380 ns 15627 ns 1.05
batchedmm(2, Bsize=4)/forward/GPU/oneAPI 72098869 ns 72187220 ns 1.00
batchedmm(2, Bsize=4)/forward/GPU/AMDGPU 78340 ns 70121 ns 1.12
batchedmm(2, Bsize=4)/zygote/CPU/2 thread(s) 2959 ns 2604.5 ns 1.14
batchedmm(2, Bsize=4)/zygote/CPU/4 thread(s) 3500 ns 2958 ns 1.18
batchedmm(2, Bsize=4)/zygote/CPU/8 thread(s) 3083 ns 2937.5 ns 1.05
batchedmm(2, Bsize=4)/zygote/CPU/1 thread(s) 2500 ns 2167 ns 1.15
batchedmm(2, Bsize=4)/zygote/GPU/CUDA 139410.5 ns 139744 ns 1.00
batchedmm(2, Bsize=4)/zygote/GPU/oneAPI 92975756 ns 92749943 ns 1.00
batchedmm(2, Bsize=4)/zygote/GPU/AMDGPU 338763.5 ns 289641 ns 1.17
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7333 ns 7208 ns 1.02
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 6042 ns 6000 ns 1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 5958 ns 5916 ns 1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 9875 ns 9917 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 36097 ns 35855 ns 1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 1203254 ns 1252207 ns 0.96
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 47460 ns 53911 ns 0.88
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 244875 ns 212958.5 ns 1.15
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 221041.5 ns 222708 ns 0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 221708 ns 219917 ns 1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 211396 ns 206209 ns 1.03
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 243135 ns 243430 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 25810257 ns 27468024.5 ns 0.94
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 502405 ns 513269 ns 0.98
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/2 thread(s) 3750 ns 3750 ns 1
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/4 thread(s) 3750 ns 3750 ns 1
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/8 thread(s) 3750 ns 3750 ns 1
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/1 thread(s) 3750 ns 3791 ns 0.99
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/CUDA 22535 ns 21959 ns 1.03
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/oneAPI 1978273 ns 2194149 ns 0.90
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/AMDGPU 43281 ns 35557 ns 1.22
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/2 thread(s) 14459 ns 14500 ns 1.00
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/4 thread(s) 14709 ns 14500 ns 1.01
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/8 thread(s) 14667 ns 14500 ns 1.01
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/1 thread(s) 14500 ns 14459 ns 1.00
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/CUDA 303497 ns 302419 ns 1.00
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/oneAPI 11149007.5 ns 11036089 ns 1.01
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/AMDGPU 195222 ns 179841 ns 1.09
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 128875 ns 128041 ns 1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 127875 ns 144417 ns 0.89
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 103500 ns 106917 ns 0.97
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 118729 ns 151959 ns 0.78
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 135839 ns 140874 ns 0.96
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 5908414 ns 5963081 ns 0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 168882 ns 236762 ns 0.71
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1884000 ns 1924583 ns 0.98
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1930708 ns 1920500 ns 1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1926583.5 ns 1914229.5 ns 1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1718687.5 ns 1928875 ns 0.89
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 666777 ns 673452 ns 0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 30322056 ns 29935915 ns 1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1214247.5 ns 899671 ns 1.35
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 18000 ns 17333 ns 1.04
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 18166.5 ns 17354.5 ns 1.05
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 20292 ns 21208 ns 0.96
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 18167 ns 17375 ns 1.05
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 107411 ns 108833.5 ns 0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 3376295 ns 3415955 ns 0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 78171 ns 91100 ns 0.86
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 239666.5 ns 216917 ns 1.10
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 218395.5 ns 252646 ns 0.86
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 223333 ns 222166 ns 1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 231958.5 ns 229125 ns 1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 503307.5 ns 508535.5 ns 0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 21773470.5 ns 19323488.5 ns 1.13
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 479765 ns 419764 ns 1.14
batchedmm(16, Bsize=4)/forward/CPU/2 thread(s) 27875 ns 24271 ns 1.15
batchedmm(16, Bsize=4)/forward/CPU/4 thread(s) 32604.5 ns 30791.5 ns 1.06
batchedmm(16, Bsize=4)/forward/CPU/8 thread(s) 29749.5 ns 29437.5 ns 1.01
batchedmm(16, Bsize=4)/forward/CPU/1 thread(s) 1209 ns 1584 ns 0.76
batchedmm(16, Bsize=4)/forward/GPU/CUDA 16715.5 ns 16398 ns 1.02
batchedmm(16, Bsize=4)/forward/GPU/oneAPI 71649972 ns 72518390 ns 0.99
batchedmm(16, Bsize=4)/forward/GPU/AMDGPU 86706 ns 76093 ns 1.14
batchedmm(16, Bsize=4)/zygote/CPU/2 thread(s) 5167 ns 4500 ns 1.15
batchedmm(16, Bsize=4)/zygote/CPU/4 thread(s) 5750 ns 4916 ns 1.17
batchedmm(16, Bsize=4)/zygote/CPU/8 thread(s) 5208 ns 5125 ns 1.02
batchedmm(16, Bsize=4)/zygote/CPU/1 thread(s) 5145.5 ns 4625 ns 1.11
batchedmm(16, Bsize=4)/zygote/GPU/CUDA 203064.5 ns 204364 ns 0.99
batchedmm(16, Bsize=4)/zygote/GPU/oneAPI 93125518 ns 94073985 ns 0.99
batchedmm(16, Bsize=4)/zygote/GPU/AMDGPU 389114 ns 331675 ns 1.17
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 226729.5 ns 222666 ns 1.02
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 221083 ns 220666.5 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 224687.5 ns 225667 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 222958 ns 220583 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 220347 ns 222506.5 ns 0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 7691389.5 ns 7881934.5 ns 0.98
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 273573 ns 267871 ns 1.02
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 546375 ns 495084 ns 1.10
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 532104.5 ns 511812.5 ns 1.04
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 551562.5 ns 500854 ns 1.10
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 560666.5 ns 675750 ns 0.83
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1047561 ns 1053634 ns 0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 45073460 ns 42862742 ns 1.05
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 850834 ns 780999 ns 1.09
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 19687.5 ns 20375 ns 0.97
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 19792 ns 20000 ns 0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 21500 ns 23875 ns 0.90
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 20854.5 ns 18792 ns 1.11
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 114522.5 ns 114286 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 3434961.5 ns 3510843 ns 0.98
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 79215.5 ns 89858 ns 0.88
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 214500.5 ns 212375 ns 1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 218000 ns 213041 ns 1.02
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 220833.5 ns 214458 ns 1.03
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 226166.5 ns 212541 ns 1.06
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 722949 ns 727333.5 ns 0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 25437401 ns 24570511 ns 1.04
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 539225.5 ns 469036 ns 1.15
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) 6604 ns 6666 ns 0.99
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) 6333.5 ns 6604.5 ns 0.96
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) 8334 ns 8750.5 ns 0.95
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) 5937 ns 6208 ns 0.96
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA 134733 ns 137142 ns 0.98
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/oneAPI 5747111 ns 5605207 ns 1.03
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/AMDGPU 65771 ns 60974 ns 1.08
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 11834 ns 9791 ns 1.21
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 14209 ns 10084 ns 1.41
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 10583 ns 10750 ns 0.98
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 11375 ns 10750 ns 1.06
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA 785256 ns 794651.5 ns 0.99
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/oneAPI 33719661 ns 37034174 ns 0.91
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/AMDGPU 375714 ns 370101.5 ns 1.02
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 4562 ns 4666 ns 0.98
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 4874.5 ns 4708 ns 1.04
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 7042 ns 7437.5 ns 0.95
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 6084 ns 4917 ns 1.24
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA 137237 ns 138544.5 ns 0.99
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/oneAPI 5382212 ns 5520602 ns 0.97
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/AMDGPU 56651 ns 59692 ns 0.95
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 8229.5 ns 7458 ns 1.10
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7709 ns 7166 ns 1.08
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7833 ns 7791 ns 1.01
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7937.5 ns 7708 ns 1.03
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA 748170 ns 755761 ns 0.99
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/oneAPI 39081859 ns 37179182 ns 1.05
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/AMDGPU 390474 ns 376523 ns 1.04
batchedmm(128, Bsize=512)/forward/CPU/2 thread(s) 14481645.5 ns 14498417 ns 1.00
batchedmm(128, Bsize=512)/forward/CPU/4 thread(s) 10092792 ns 10124125 ns 1.00
batchedmm(128, Bsize=512)/forward/CPU/8 thread(s) 10114250 ns 10094833 ns 1.00
batchedmm(128, Bsize=512)/forward/CPU/1 thread(s) 27708083 ns 27748583.5 ns 1.00
batchedmm(128, Bsize=512)/forward/GPU/CUDA 532624 ns 532665 ns 1.00
batchedmm(128, Bsize=512)/forward/GPU/oneAPI 94993046.5 ns 94795139 ns 1.00
batchedmm(128, Bsize=512)/forward/GPU/AMDGPU 395044 ns 866850 ns 0.46
batchedmm(128, Bsize=512)/zygote/CPU/2 thread(s) 46261583.5 ns 46333437 ns 1.00
batchedmm(128, Bsize=512)/zygote/CPU/4 thread(s) 33410959 ns 33447541.5 ns 1.00
batchedmm(128, Bsize=512)/zygote/CPU/8 thread(s) 33486333 ns 33510458 ns 1.00
batchedmm(128, Bsize=512)/zygote/CPU/1 thread(s) 86587583 ns 85445667 ns 1.01
batchedmm(128, Bsize=512)/zygote/GPU/CUDA 2654436 ns 2636151 ns 1.01
batchedmm(128, Bsize=512)/zygote/GPU/oneAPI 194923650.5 ns 192783631 ns 1.01
batchedmm(128, Bsize=512)/zygote/GPU/AMDGPU 3295955 ns 5189385.5 ns 0.64
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 190708.5 ns 66458 ns 2.87
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 186083.5 ns 65687.5 ns 2.83
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 188146 ns 70500 ns 2.67
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 185917 ns 66500 ns 2.80
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 118475.5 ns 118172.5 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 3648822.5 ns 3662360 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 231833 ns 237313 ns 0.98
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 639375 ns 467958 ns 1.37
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 598729.5 ns 480333.5 ns 1.25
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 616416.5 ns 474916.5 ns 1.30
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 595166.5 ns 686583.5 ns 0.87
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 713561 ns 715446 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 25853973 ns 26609747 ns 0.97
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 789453.5 ns 655875 ns 1.20
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 667 ns 542 ns 1.23
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 708 ns 625 ns 1.13
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 625 ns 583 ns 1.07
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 666 ns 500 ns 1.33
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA 33227.5 ns 32877 ns 1.01
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/oneAPI 1196192 ns 1227269 ns 0.97
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/AMDGPU 49141 ns 47579 ns 1.03
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 9208 ns 8750 ns 1.05
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 10375 ns 9208 ns 1.13
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 9916 ns 9104.5 ns 1.09
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 12625 ns 9750 ns 1.29
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA 282925 ns 280778.5 ns 1.01
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/oneAPI 21246808 ns 21881943 ns 0.97
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/AMDGPU 372549 ns 355484 ns 1.05
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/2 thread(s) 26291 ns 9500 ns 2.77
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/4 thread(s) 26291 ns 9500 ns 2.77
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/8 thread(s) 26667 ns 9500 ns 2.81
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/1 thread(s) 26292 ns 9500 ns 2.77
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/CUDA 23969 ns 23273 ns 1.03
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/oneAPI 2066696 ns 1862112.5 ns 1.11
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/AMDGPU 211022 ns 200655 ns 1.05
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) 67167 ns 50209 ns 1.34
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) 67459 ns 50250 ns 1.34
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) 68250 ns 50500 ns 1.35
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) 67604.5 ns 72375 ns 0.93
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/CUDA 278435 ns 278469.5 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/oneAPI 11499929 ns 13204061 ns 0.87
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/AMDGPU 607747 ns 491037 ns 1.24
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 203667 ns 54917 ns 3.71
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 211000 ns 55667 ns 3.79
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 209209 ns 55584 ns 3.76
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 199875 ns 56000 ns 3.57
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 27769 ns 28169 ns 0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 1427804.5 ns 1174691 ns 1.22
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 204902 ns 203240 ns 1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 647833.5 ns 518854 ns 1.25
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 672374.5 ns 500625 ns 1.34
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 627625 ns 497750 ns 1.26
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 591021 ns 643417 ns 0.92
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 238384 ns 238777 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 32020482 ns 31628121.5 ns 1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 835558 ns 758938 ns 1.10
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 677042 ns 655042 ns 1.03
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 644084 ns 613083 ns 1.05
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 624417 ns 652541 ns 0.96
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 652000 ns 678416.5 ns 0.96
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 191709.5 ns 192069 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 8119814 ns 8140636 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 250313 ns 269704 ns 0.93
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2245166.5 ns 2167104.5 ns 1.04
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2263083.5 ns 2233125 ns 1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2243937 ns 2241292 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1633354 ns 2230208.5 ns 0.73
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 919212 ns 929752.5 ns 0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 49553095.5 ns 55073105 ns 0.90
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1363434 ns 1217770.5 ns 1.12
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 24313 ns 19500 ns 1.25
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 19166.5 ns 19208.5 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 22000 ns 23542 ns 0.93
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 20124.5 ns 20000 ns 1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 110904.5 ns 111306 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 3666568 ns 3589059.5 ns 1.02
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 78541 ns 91551 ns 0.86
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 261875 ns 220459 ns 1.19
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 240917 ns 226458 ns 1.06
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 231187 ns 223104.5 ns 1.04
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 233562.5 ns 219708 ns 1.06
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 712476 ns 714110 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 24685366 ns 26626181 ns 0.93
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 557241 ns 487481 ns 1.14
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 500 ns 625 ns 0.80
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 708 ns 583 ns 1.21
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 667 ns 584 ns 1.14
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 625 ns 500 ns 1.25
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA 23419 ns 23491 ns 1.00
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/oneAPI 1200557 ns 1232519 ns 0.97
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/AMDGPU 47620 ns 43771 ns 1.09
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 9917 ns 9417 ns 1.05
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 11250 ns 9291.5 ns 1.21
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 10937.5 ns 9708 ns 1.13
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 10666 ns 9646 ns 1.11
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA 262591 ns 261581 ns 1.00
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/oneAPI 23985377 ns 23734390 ns 1.01
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/AMDGPU 399964 ns 381618 ns 1.05
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) 7916.5 ns 8917 ns 0.89
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) 7958.5 ns 7583 ns 1.05
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) 10041.5 ns 11854.5 ns 0.85
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) 8041.5 ns 9042 ns 0.89
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA 115917 ns 115935.5 ns 1.00
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/oneAPI 3281620 ns 3441325 ns 0.95
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/AMDGPU 66841 ns 70456.5 ns 0.95
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7916 ns 8125 ns 0.97
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 8917 ns 7542 ns 1.18
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7875 ns 8000 ns 0.98
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 9709 ns 7292 ns 1.33
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA 480674 ns 484010 ns 0.99
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/oneAPI 17996514 ns 17813154.5 ns 1.01
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/AMDGPU 321543 ns 302215 ns 1.06
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/2 thread(s) 2104.5 ns 1417 ns 1.49
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/4 thread(s) 2375 ns 1667 ns 1.42
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/8 thread(s) 2542 ns 1959 ns 1.30
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/1 thread(s) 2396 ns 1500 ns 1.60
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/CUDA 20098 ns 20030 ns 1.00
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/oneAPI 1066690 ns 1146657 ns 0.93
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/AMDGPU 190702 ns 184144 ns 1.04
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) 6292 ns 3708 ns 1.70
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) 6833 ns 3625 ns 1.88
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) 6750 ns 3833 ns 1.76
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) 6834 ns 4917 ns 1.39
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/CUDA 211335.5 ns 213101.5 ns 0.99
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/oneAPI 9983042 ns 10511562.5 ns 0.95
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/AMDGPU 579151 ns 524324.5 ns 1.10
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/2 thread(s) 747417 ns 148729 ns 5.03
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/4 thread(s) 750542 ns 128917 ns 5.82
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/8 thread(s) 747271 ns 129917 ns 5.75
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/1 thread(s) 748709 ns 235541 ns 3.18
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/CUDA 23157 ns 22778 ns 1.02
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/oneAPI 1175098.5 ns 1179919.5 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/AMDGPU 36460.5 ns 46868 ns 0.78
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/2 thread(s) 793625 ns 143645.5 ns 5.52
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/4 thread(s) 774979 ns 130875 ns 5.92
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/8 thread(s) 776479 ns 138417 ns 5.61
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/1 thread(s) 811000 ns 290021 ns 2.80
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/CUDA 209522 ns 211960 ns 0.99
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/oneAPI 10334508 ns 10741797 ns 0.96
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/AMDGPU 233752.5 ns 223578 ns 1.05
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7333 ns 7167 ns 1.02
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 6000 ns 5958 ns 1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 6125 ns 5958.5 ns 1.03
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 10125 ns 10000 ns 1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 33425 ns 33236 ns 1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 1255061.5 ns 1203805 ns 1.04
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 50400 ns 57207 ns 0.88
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 261396.5 ns 221249.5 ns 1.18
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 261479.5 ns 238542 ns 1.10
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 229333 ns 264500 ns 0.87
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 236062 ns 213250 ns 1.11
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 259516 ns 259447 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 24445196 ns 27707385 ns 0.88
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 521036 ns 530542 ns 0.98
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 13417 ns 13209 ns 1.02
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 12271.5 ns 12166 ns 1.01
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 13875 ns 13584 ns 1.02
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 11749.5 ns 12667 ns 0.93
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA 133835.5 ns 135078 ns 0.99
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/oneAPI 5374770 ns 5685986 ns 0.95
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/AMDGPU 234562 ns 227730.5 ns 1.03
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 23687.5 ns 23917 ns 0.99
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 24417 ns 24083.5 ns 1.01
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 25208 ns 24750 ns 1.02
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 24542 ns 30146 ns 0.81
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA 827187.5 ns 833527 ns 0.99
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/oneAPI 39569351 ns 39963084.5 ns 0.99
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/AMDGPU 678787.5 ns 615374.5 ns 1.10
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) 9834 ns 9271 ns 1.06
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) 8979 ns 9541 ns 0.94
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) 10083 ns 10375 ns 0.97
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) 9146 ns 9250 ns 0.99
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA 119836.5 ns 119628 ns 1.00
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/oneAPI 3554576 ns 3356719.5 ns 1.06
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/AMDGPU 71830.5 ns 74940 ns 0.96
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 13583.5 ns 14041 ns 0.97
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 15375 ns 13958 ns 1.10
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 14625 ns 14750 ns 0.99
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 14521 ns 13459 ns 1.08
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA 637852 ns 638262 ns 1.00
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/oneAPI 21976253 ns 22466836 ns 0.98
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/AMDGPU 364954 ns 344824 ns 1.06
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) 9166 ns 9666.5 ns 0.95
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) 8292 ns 9208 ns 0.90
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) 10625 ns 10959 ns 0.97
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) 9042 ns 9083.5 ns 1.00
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA 118116 ns 118521 ns 1.00
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/oneAPI 3346671 ns 3571671.5 ns 0.94
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/AMDGPU 70810.5 ns 79399 ns 0.89
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 12958 ns 13416 ns 0.97
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 14541.5 ns 12416 ns 1.17
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 13583.5 ns 13479.5 ns 1.01
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 15312 ns 12708 ns 1.20
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA 525394.5 ns 530027 ns 0.99
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/oneAPI 20373782 ns 19360325 ns 1.05
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/AMDGPU 338973.5 ns 317163 ns 1.07
batchedmm(2, Bsize=128)/forward/CPU/2 thread(s) 30520.5 ns 30896 ns 0.99
batchedmm(2, Bsize=128)/forward/CPU/4 thread(s) 33917 ns 33813 ns 1.00
batchedmm(2, Bsize=128)/forward/CPU/8 thread(s) 32250 ns 32249.5 ns 1.00
batchedmm(2, Bsize=128)/forward/CPU/1 thread(s) 1833 ns 1875 ns 0.98
batchedmm(2, Bsize=128)/forward/GPU/CUDA 16956 ns 16425 ns 1.03
batchedmm(2, Bsize=128)/forward/GPU/oneAPI 76609280 ns 76985679 ns 1.00
batchedmm(2, Bsize=128)/forward/GPU/AMDGPU 77851 ns 76663 ns 1.02
batchedmm(2, Bsize=128)/zygote/CPU/2 thread(s) 5291 ns 5417 ns 0.98
batchedmm(2, Bsize=128)/zygote/CPU/4 thread(s) 5709 ns 5000 ns 1.14
batchedmm(2, Bsize=128)/zygote/CPU/8 thread(s) 5792 ns 5479.5 ns 1.06
batchedmm(2, Bsize=128)/zygote/CPU/1 thread(s) 6917 ns 6270.5 ns 1.10
batchedmm(2, Bsize=128)/zygote/GPU/CUDA 137577 ns 138278 ns 0.99
batchedmm(2, Bsize=128)/zygote/GPU/oneAPI 110674438 ns 109824422.5 ns 1.01
batchedmm(2, Bsize=128)/zygote/GPU/AMDGPU 381429.5 ns 340566 ns 1.12
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) 292 ns 333 ns 0.88
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) 416 ns 375 ns 1.11
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) 375 ns 375 ns 1
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) 375 ns 291 ns 1.29
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA 25257 ns 25574 ns 0.99
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/oneAPI 1228478 ns 1142450 ns 1.08
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/AMDGPU 48831 ns 45666 ns 1.07
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 6334 ns 6458 ns 0.98
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 7479.5 ns 6375 ns 1.17
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 6959 ns 6791.5 ns 1.02
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 7209 ns 6458.5 ns 1.12
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA 184635.5 ns 185923.5 ns 0.99
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/oneAPI 23837580 ns 22900684.5 ns 1.04
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/AMDGPU 378815 ns 365402.5 ns 1.04
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 5834 ns 2084 ns 2.80
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 6000 ns 2084 ns 2.88
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 5958 ns 2083 ns 2.86
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 5958 ns 2000 ns 2.98
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA 25900 ns 26453 ns 0.98
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/oneAPI 1250594 ns 1207656 ns 1.04
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/AMDGPU 206832 ns 203645.5 ns 1.02
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 20458 ns 18041 ns 1.13
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 23792 ns 17166.5 ns 1.39
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 22334 ns 17750 ns 1.26
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 22875 ns 23458.5 ns 0.98
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA 266845.5 ns 268326 ns 0.99
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/oneAPI 25451299.5 ns 24994377.5 ns 1.02
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/AMDGPU 692177 ns 600702.5 ns 1.15
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 166604 ns 147875 ns 1.13
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 148104.5 ns 155437.5 ns 0.95
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 154125 ns 155125 ns 0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 178166 ns 151708 ns 1.17
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 190719 ns 190890.5 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 7757870 ns 7974634 ns 0.97
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 193662 ns 271146.5 ns 0.71
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1331209 ns 1321937.5 ns 1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1339083 ns 1330625 ns 1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1319166 ns 1308375 ns 1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1332625 ns 1285166 ns 1.04
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 860379 ns 867140 ns 0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 45780672 ns 45331705.5 ns 1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1115822 ns 1006962 ns 1.11
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 26708 ns 25500 ns 1.05
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 23334 ns 23542 ns 0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 27729 ns 28708.5 ns 0.97
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 24333 ns 24416.5 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 225932.5 ns 226899 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 8143765 ns 7680667 ns 1.06
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 103591 ns 128029 ns 0.81
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 131042 ns 125062.5 ns 1.05
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 178667 ns 165729.5 ns 1.08
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 127000 ns 125854.5 ns 1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 171834 ns 180062 ns 0.95
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 990219.5 ns 998018.5 ns 0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 46480369 ns 44411227 ns 1.05
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 605656 ns 568743 ns 1.06
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 292 ns 375 ns 0.78
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 375 ns 375 ns 1
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 375 ns 375 ns 1
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 375 ns 250 ns 1.50
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA 23167 ns 23453 ns 0.99
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/oneAPI 1211323.5 ns 1190116 ns 1.02
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/AMDGPU 48770 ns 44533 ns 1.10
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 6167 ns 6895.5 ns 0.89
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 8459 ns 6458 ns 1.31
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 7041 ns 6958 ns 1.01
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 6833 ns 6520.5 ns 1.05
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA 201408.5 ns 201834 ns 1.00
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/oneAPI 23826399 ns 23542895 ns 1.01
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/AMDGPU 383144 ns 372536 ns 1.03
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 6292 ns 5645.5 ns 1.11
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 5291 ns 5375 ns 0.98
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 6478.5 ns 7979 ns 0.81
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 5625 ns 5166 ns 1.09
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA 137210.5 ns 139838.5 ns 0.98
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/oneAPI 5528693 ns 5619575.5 ns 0.98
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/AMDGPU 234353 ns 229750 ns 1.02
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 9792 ns 9958 ns 0.98
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 10500 ns 10042 ns 1.05
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 10250 ns 10417 ns 0.98
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 9833 ns 10854.5 ns 0.91
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA 854499 ns 866511 ns 0.99
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/oneAPI 41863490 ns 43130156 ns 0.97
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/AMDGPU 668967 ns 603858 ns 1.11
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/2 thread(s) 1625 ns 708 ns 2.30
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/4 thread(s) 1625 ns 708 ns 2.30
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/8 thread(s) 1584 ns 750 ns 2.11
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/1 thread(s) 1584 ns 667 ns 2.37
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/CUDA 23312 ns 22827 ns 1.02
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/oneAPI 2038595.5 ns 2079377 ns 0.98
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/AMDGPU 208892 ns 202368 ns 1.03
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) 5750 ns 4834 ns 1.19
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) 6125 ns 4833 ns 1.27
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) 6208 ns 5125 ns 1.21
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) 5750 ns 6291 ns 0.91
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/CUDA 221857.5 ns 222098 ns 1.00
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/oneAPI 9884849 ns 9952955 ns 0.99
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/AMDGPU 583581 ns 471721 ns 1.24
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) 9083 ns 8750 ns 1.04
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) 7562.5 ns 7834 ns 0.97
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) 9917 ns 9375 ns 1.06
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) 8104 ns 7646 ns 1.06
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA 118095 ns 117939.5 ns 1.00
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/oneAPI 3469707 ns 3568146 ns 0.97
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/AMDGPU 70621 ns 74409 ns 0.95
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 8334 ns 8792 ns 0.95
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 10208 ns 8583 ns 1.19
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 8667 ns 8875 ns 0.98
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 8542 ns 8083 ns 1.06
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA 565017 ns 568724.5 ns 0.99
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/oneAPI 22916388 ns 20842961 ns 1.10
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/AMDGPU 345694 ns 335106 ns 1.03
batchedmm(128, Bsize=4)/forward/CPU/2 thread(s) 126854.5 ns 126042 ns 1.01
batchedmm(128, Bsize=4)/forward/CPU/4 thread(s) 129334 ns 129208 ns 1.00
batchedmm(128, Bsize=4)/forward/CPU/8 thread(s) 130000 ns 129542 ns 1.00
batchedmm(128, Bsize=4)/forward/CPU/1 thread(s) 185917 ns 180792 ns 1.03
batchedmm(128, Bsize=4)/forward/GPU/CUDA 46980 ns 46423 ns 1.01
batchedmm(128, Bsize=4)/forward/GPU/oneAPI 72077654 ns 72616088 ns 0.99
batchedmm(128, Bsize=4)/forward/GPU/AMDGPU 99391 ns 101850 ns 0.98
batchedmm(128, Bsize=4)/zygote/CPU/2 thread(s) 334333 ns 315875 ns 1.06
batchedmm(128, Bsize=4)/zygote/CPU/4 thread(s) 342437 ns 334166.5 ns 1.02
batchedmm(128, Bsize=4)/zygote/CPU/8 thread(s) 345333.5 ns 323291.5 ns 1.07
batchedmm(128, Bsize=4)/zygote/CPU/1 thread(s) 611000 ns 609395.5 ns 1.00
batchedmm(128, Bsize=4)/zygote/GPU/CUDA 186967 ns 187684 ns 1.00
batchedmm(128, Bsize=4)/zygote/GPU/oneAPI 93626369 ns 93899553 ns 1.00
batchedmm(128, Bsize=4)/zygote/GPU/AMDGPU 501325.5 ns 405833.5 ns 1.24
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/2 thread(s) 397417 ns 397500 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/4 thread(s) 288084 ns 287979.5 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/8 thread(s) 288083 ns 288375 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/1 thread(s) 756125 ns 756000 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/CUDA 44378.5 ns 43964 ns 1.01
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/oneAPI 1435945 ns 1424885 ns 1.01
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/AMDGPU 80171 ns 79439 ns 1.01
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/2 thread(s) 1455833 ns 1461000 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/4 thread(s) 1133250 ns 1133834 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/8 thread(s) 1127875 ns 1129645.5 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/1 thread(s) 2439437.5 ns 2449292 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/CUDA 253386 ns 254140 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/oneAPI 10085376 ns 11042616 ns 0.91
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/AMDGPU 348033 ns 254646 ns 1.37
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 660041.5 ns 626500 ns 1.05
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 640166 ns 657208.5 ns 0.97
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 639708.5 ns 649750.5 ns 0.98
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 651937.5 ns 642417 ns 1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 187406 ns 185720.5 ns 1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 8211515 ns 8332264.5 ns 0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 232427.5 ns 264649 ns 0.88
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2450084 ns 2452625 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2484375 ns 2465208.5 ns 1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2446834 ns 2459375 ns 0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2445125.5 ns 2376375 ns 1.03
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 943950 ns 949649 ns 0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 51038800.5 ns 53455476.5 ns 0.95
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1448370.5 ns 1323598 ns 1.09
batchedmm(2, Bsize=32)/forward/CPU/2 thread(s) 32750 ns 32458 ns 1.01
batchedmm(2, Bsize=32)/forward/CPU/4 thread(s) 35459 ns 36521 ns 0.97
batchedmm(2, Bsize=32)/forward/CPU/8 thread(s) 34937 ns 34833 ns 1.00
batchedmm(2, Bsize=32)/forward/CPU/1 thread(s) 834 ns 959 ns 0.87
batchedmm(2, Bsize=32)/forward/GPU/CUDA 15874 ns 15902 ns 1.00
batchedmm(2, Bsize=32)/forward/GPU/oneAPI 73383109 ns 73782106 ns 0.99
batchedmm(2, Bsize=32)/forward/GPU/AMDGPU 77231 ns 74499.5 ns 1.04
batchedmm(2, Bsize=32)/zygote/CPU/2 thread(s) 3104.5 ns 3125 ns 0.99
batchedmm(2, Bsize=32)/zygote/CPU/4 thread(s) 4125 ns 3250 ns 1.27
batchedmm(2, Bsize=32)/zygote/CPU/8 thread(s) 3437.5 ns 3375 ns 1.02
batchedmm(2, Bsize=32)/zygote/CPU/1 thread(s) 3292 ns 3062.5 ns 1.07
batchedmm(2, Bsize=32)/zygote/GPU/CUDA 136649 ns 137187.5 ns 1.00
batchedmm(2, Bsize=32)/zygote/GPU/oneAPI 98346436.5 ns 98822060.5 ns 1.00
batchedmm(2, Bsize=32)/zygote/GPU/AMDGPU 353034 ns 314258 ns 1.12
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 1458833 ns 436500 ns 3.34
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 1500667 ns 438625 ns 3.42
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 1501812.5 ns 438791 ns 3.42
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 1437625 ns 445917 ns 3.22
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 43141 ns 42826 ns 1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 1467999 ns 1503651 ns 0.98
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 241463 ns 374379.5 ns 0.64
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 5133250 ns 4140000 ns 1.24
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 5305083 ns 4271375 ns 1.24
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 5266666.5 ns 4270687.5 ns 1.23
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 5012583 ns 5468750 ns 0.92
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 234347 ns 236201.5 ns 0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 36702447 ns 36248116 ns 1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1231083 ns 1135862 ns 1.08
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/2 thread(s) 3792 ns 3750 ns 1.01
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/4 thread(s) 3791 ns 3791 ns 1
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/8 thread(s) 3750 ns 3750 ns 1
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/1 thread(s) 3750 ns 3709 ns 1.01
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/CUDA 35135 ns 34158 ns 1.03
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/oneAPI 1216141 ns 1274307 ns 0.95
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/AMDGPU 39590 ns 41117 ns 0.96
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/2 thread(s) 15333 ns 15375 ns 1.00
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/4 thread(s) 16125 ns 15334 ns 1.05
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/8 thread(s) 15584 ns 15500 ns 1.01
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/1 thread(s) 15334 ns 15250 ns 1.01
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/CUDA 257091.5 ns 255579 ns 1.01
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/oneAPI 8614070 ns 8309435 ns 1.04
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/AMDGPU 171612 ns 158606 ns 1.08
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/2 thread(s) 404458 ns 404792 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/4 thread(s) 295917 ns 295917 ns 1
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/8 thread(s) 296334 ns 295958 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/1 thread(s) 760750 ns 759750 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/CUDA 114125 ns 113245 ns 1.01
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/oneAPI 1028506.5 ns 1043498 ns 0.99
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/AMDGPU 89181 ns 91962 ns 0.97
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/2 thread(s) 1475687.5 ns 1482854 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/4 thread(s) 1156625 ns 1158625 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/8 thread(s) 1152792 ns 1150334 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/1 thread(s) 2466666.5 ns 2466708 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/CUDA 233821 ns 236768.5 ns 0.99
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/oneAPI 12310672 ns 9725420.5 ns 1.27
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/AMDGPU 353244 ns 298578 ns 1.18
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 1042 ns 584 ns 1.78
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 1083 ns 625 ns 1.73
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 1042 ns 584 ns 1.78
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 1083 ns 542 ns 2.00
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA 25063 ns 25569 ns 0.98
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/oneAPI 1050546 ns 1198679 ns 0.88
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/AMDGPU 208912 ns 202679 ns 1.03
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 8334 ns 8083 ns 1.03
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 10250 ns 7792 ns 1.32
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 8625 ns 8375 ns 1.03
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 8458 ns 8437.5 ns 1.00
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA 206386 ns 207068.5 ns 1.00
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/oneAPI 26827043 ns 25228707 ns 1.06
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/AMDGPU 684852.5 ns 593474 ns 1.15
batchedmm(128, Bsize=32)/forward/CPU/2 thread(s) 833625 ns 829375 ns 1.01
batchedmm(128, Bsize=32)/forward/CPU/4 thread(s) 616583.5 ns 617667 ns 1.00
batchedmm(128, Bsize=32)/forward/CPU/8 thread(s) 618792 ns 618667 ns 1.00
batchedmm(128, Bsize=32)/forward/CPU/1 thread(s) 1443834 ns 1544417 ns 0.93
batchedmm(128, Bsize=32)/forward/GPU/CUDA 130567 ns 130866 ns 1.00
batchedmm(128, Bsize=32)/forward/GPU/oneAPI 74402337 ns 74874331.5 ns 0.99
batchedmm(128, Bsize=32)/forward/GPU/AMDGPU 165932 ns 211214 ns 0.79
batchedmm(128, Bsize=32)/zygote/CPU/2 thread(s) 2682542 ns 2686104.5 ns 1.00
batchedmm(128, Bsize=32)/zygote/CPU/4 thread(s) 1999375 ns 1994542 ns 1.00
batchedmm(128, Bsize=32)/zygote/CPU/8 thread(s) 1997459 ns 1998375 ns 1.00
batchedmm(128, Bsize=32)/zygote/CPU/1 thread(s) 4921208 ns 4960479 ns 0.99
batchedmm(128, Bsize=32)/zygote/GPU/CUDA 234980 ns 234509 ns 1.00
batchedmm(128, Bsize=32)/zygote/GPU/oneAPI 102325987.5 ns 102181218 ns 1.00
batchedmm(128, Bsize=32)/zygote/GPU/AMDGPU 857239 ns 831293.5 ns 1.03
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 333 ns 292 ns 1.14
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 375 ns 375 ns 1
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 375 ns 375 ns 1
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 334 ns 250 ns 1.34
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA 32616 ns 32562 ns 1.00
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/oneAPI 1250784 ns 1276503 ns 0.98
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/AMDGPU 49801 ns 48691 ns 1.02
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 6292 ns 6333 ns 0.99
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 8208 ns 6375 ns 1.29
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 6583 ns 6667 ns 0.99
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 6625 ns 6104.5 ns 1.09
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA 221633 ns 227701 ns 0.97
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/oneAPI 19957066 ns 21756022 ns 0.92
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/AMDGPU 361473 ns 346728 ns 1.04
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 1788479 ns 1760625 ns 1.02
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 1743999.5 ns 1749875 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 1729500 ns 1744292 ns 0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 1754416.5 ns 1755166 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 188571.5 ns 189332 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 7643974 ns 7765672 ns 0.98
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 354044 ns 413433 ns 0.86
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 4362854 ns 4360416 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 4448958 ns 4366917 ns 1.02
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 4373500 ns 4349104 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 4374708 ns 5705104 ns 0.77
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 845544 ns 849205 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 48180808 ns 48802559 ns 0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1251573.5 ns 1205562.5 ns 1.04
bias_activation(512, act=relu)(512 x 128)/forward/CPU/2 thread(s) 7084 ns 9604 ns 0.74
bias_activation(512, act=relu)(512 x 128)/forward/CPU/4 thread(s) 6959 ns 6916 ns 1.01
bias_activation(512, act=relu)(512 x 128)/forward/CPU/8 thread(s) 7250 ns 8208 ns 0.88
bias_activation(512, act=relu)(512 x 128)/forward/CPU/1 thread(s) 6875 ns 6854 ns 1.00
bias_activation(512, act=relu)(512 x 128)/forward/GPU/CUDA 22661 ns 22924.5 ns 0.99
bias_activation(512, act=relu)(512 x 128)/forward/GPU/oneAPI 1150024 ns 1184238.5 ns 0.97
bias_activation(512, act=relu)(512 x 128)/forward/GPU/AMDGPU 36771 ns 46437 ns 0.79
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/2 thread(s) 48958 ns 50604.5 ns 0.97
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/4 thread(s) 71583 ns 52166 ns 1.37
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/8 thread(s) 32959 ns 45458.5 ns 0.73
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/1 thread(s) 45416 ns 33312.5 ns 1.36
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/CUDA 209801 ns 211538 ns 0.99
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/oneAPI 10775850 ns 10576796.5 ns 1.02
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/AMDGPU 220142 ns 226508 ns 0.97
batchedmm(2, Bsize=512)/forward/CPU/2 thread(s) 22145.5 ns 21646 ns 1.02
batchedmm(2, Bsize=512)/forward/CPU/4 thread(s) 26041 ns 26083.5 ns 1.00
batchedmm(2, Bsize=512)/forward/CPU/8 thread(s) 24917 ns 24958.5 ns 1.00
batchedmm(2, Bsize=512)/forward/CPU/1 thread(s) 5417 ns 5291.5 ns 1.02
batchedmm(2, Bsize=512)/forward/GPU/CUDA 18543 ns 18121 ns 1.02
batchedmm(2, Bsize=512)/forward/GPU/oneAPI 87577603.5 ns 88732630 ns 0.99
batchedmm(2, Bsize=512)/forward/GPU/AMDGPU 89291 ns 73668 ns 1.21
batchedmm(2, Bsize=512)/zygote/CPU/2 thread(s) 11896 ns 12125 ns 0.98
batchedmm(2, Bsize=512)/zygote/CPU/4 thread(s) 10875 ns 10667 ns 1.02
batchedmm(2, Bsize=512)/zygote/CPU/8 thread(s) 10895.5 ns 10833 ns 1.01
batchedmm(2, Bsize=512)/zygote/CPU/1 thread(s) 18166 ns 18042 ns 1.01
batchedmm(2, Bsize=512)/zygote/GPU/CUDA 221059 ns 221707 ns 1.00
batchedmm(2, Bsize=512)/zygote/GPU/oneAPI 150298637 ns 148404121 ns 1.01
batchedmm(2, Bsize=512)/zygote/GPU/AMDGPU 380694 ns 322703 ns 1.18
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/2 thread(s) 406375 ns 405917 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/4 thread(s) 297166 ns 296791.5 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/8 thread(s) 296458 ns 297167 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/1 thread(s) 757291.5 ns 756709 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/CUDA 47570 ns 46696 ns 1.02
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/oneAPI 1357729 ns 1393570.5 ns 0.97
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/AMDGPU 89771 ns 90770 ns 0.99
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/2 thread(s) 1484417 ns 1487375 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/4 thread(s) 1165583 ns 1163500 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/8 thread(s) 1160604 ns 1157209 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/1 thread(s) 2473104 ns 2472417 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/CUDA 284069.5 ns 283340.5 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/oneAPI 13947031.5 ns 11947586 ns 1.17
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/AMDGPU 374114 ns 269032 ns 1.39
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 1485042 ns 436458 ns 3.40
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 1528500 ns 443270.5 ns 3.45
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 1528542 ns 440750 ns 3.47
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 1464167 ns 449000 ns 3.26
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 54497 ns 53940 ns 1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 1041173 ns 1027722 ns 1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 236502 ns 323133 ns 0.73
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 5136792 ns 4138541 ns 1.24
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 5286271 ns 4268354.5 ns 1.24
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 5287958 ns 4258750 ns 1.24
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 4972333.5 ns 5475229.5 ns 0.91
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 257449 ns 255597 ns 1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 31450700.5 ns 31502698.5 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1207368 ns 1132896.5 ns 1.07
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/2 thread(s) 28292 ns 9333 ns 3.03
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/4 thread(s) 28875 ns 8000 ns 3.61
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/8 thread(s) 28375 ns 8000 ns 3.55
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/1 thread(s) 28375 ns 13250 ns 2.14
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/CUDA 25060 ns 23885 ns 1.05
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/oneAPI 2164878.5 ns 1973050 ns 1.10
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/AMDGPU 212672 ns 202528 ns 1.05
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) 66709 ns 49625 ns 1.34
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) 66542 ns 49667 ns 1.34
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) 67875 ns 49583 ns 1.37
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) 66625 ns 71667 ns 0.93
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/CUDA 339582 ns 336641 ns 1.01
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/oneAPI 12629760 ns 13058534 ns 0.97
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/AMDGPU 650592 ns 508895.5 ns 1.28
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 111167 ns 108270.5 ns 1.03
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 90500 ns 86167 ns 1.05
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 86500 ns 86500 ns 1
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 122542 ns 146083 ns 0.84
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 192161 ns 192063 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 5883194 ns 5750624 ns 1.02
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 185492 ns 267851 ns 0.69
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2014792 ns 2018917 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2027520.5 ns 2016937.5 ns 1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2013916 ns 2011375 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1895917 ns 2024000.5 ns 0.94
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 515387 ns 511598 ns 1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 26786371 ns 30563079 ns 0.88
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 948350 ns 860237 ns 1.10

This comment was automatically generated by workflow using github-action-benchmark.

@avik-pal avik-pal changed the title fix: remove certain LV usage refactor: move JuliaSIMD deps to extensions Oct 17, 2024
@avik-pal avik-pal force-pushed the ap/segfault branch 2 times, most recently from 68fc1b3 to 1c7ac61 Compare October 17, 2024 20:22
@avik-pal avik-pal linked an issue Oct 17, 2024 that may be closed by this pull request
@avik-pal avik-pal force-pushed the ap/segfault branch 4 times, most recently from 6634a80 to 27fa286 Compare October 17, 2024 23:08
@avik-pal
Copy link
Member Author

xref EnzymeAD/Enzyme.jl#1983

@avik-pal avik-pal changed the title refactor: move JuliaSIMD deps to extensions refactor: move JuliaSIMD deps to extensions Oct 18, 2024
@avik-pal avik-pal merged commit 98a2d7a into main Oct 18, 2024
56 of 61 checks passed
@avik-pal avik-pal deleted the ap/segfault branch October 18, 2024 18:04
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Segfault for simple Zygote pullback
1 participant