This repository has been archived by the owner on Nov 4, 2024. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 0
refactor: move JuliaSIMD
deps to extensions
#175
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LuxLib Benchmarks
Benchmark suite | Current: d2f76dd | Previous: 604783f | Ratio |
---|---|---|---|
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
5500 ns |
5375 ns |
1.02 |
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
5084 ns |
5250 ns |
0.97 |
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
5625.5 ns |
7708.5 ns |
0.73 |
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
5708 ns |
5416 ns |
1.05 |
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA |
103933 ns |
113361 ns |
0.92 |
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/oneAPI |
2722154 ns |
2795172 ns |
0.97 |
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/AMDGPU |
413664 ns |
601544 ns |
0.69 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
10083 ns |
9729.5 ns |
1.04 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
10208.5 ns |
9938 ns |
1.03 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
10021 ns |
10167 ns |
0.99 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
10104.5 ns |
11063 ns |
0.91 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA |
536028 ns |
544547 ns |
0.98 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/oneAPI |
19483150 ns |
17852957 ns |
1.09 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/AMDGPU |
683066 ns |
629346 ns |
1.09 |
bias_activation(32, act=relu)(32 x 128)/forward/CPU/2 thread(s) |
1458 ns |
1500 ns |
0.97 |
bias_activation(32, act=relu)(32 x 128)/forward/CPU/4 thread(s) |
1667 ns |
1458 ns |
1.14 |
bias_activation(32, act=relu)(32 x 128)/forward/CPU/8 thread(s) |
1771 ns |
1771 ns |
1 |
bias_activation(32, act=relu)(32 x 128)/forward/CPU/1 thread(s) |
2292 ns |
1583 ns |
1.45 |
bias_activation(32, act=relu)(32 x 128)/forward/GPU/CUDA |
20317 ns |
20770 ns |
0.98 |
bias_activation(32, act=relu)(32 x 128)/forward/GPU/oneAPI |
1310053 ns |
1342503 ns |
0.98 |
bias_activation(32, act=relu)(32 x 128)/forward/GPU/AMDGPU |
31070.5 ns |
30997 ns |
1.00 |
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/2 thread(s) |
4020.5 ns |
4104 ns |
0.98 |
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/4 thread(s) |
4042 ns |
4500 ns |
0.90 |
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/8 thread(s) |
4604 ns |
4500 ns |
1.02 |
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/1 thread(s) |
4125 ns |
4333 ns |
0.95 |
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/CUDA |
132014.5 ns |
134970 ns |
0.98 |
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/oneAPI |
8955699 ns |
8677498 ns |
1.03 |
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/AMDGPU |
145462 ns |
138579 ns |
1.05 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
57750 ns |
57666.5 ns |
1.00 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
46917 ns |
46875 ns |
1.00 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
46791 ns |
47125 ns |
0.99 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
82542 ns |
81458 ns |
1.01 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
36903 ns |
36587 ns |
1.01 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
570577 ns |
582336 ns |
0.98 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
79861 ns |
69420 ns |
1.15 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2033166 ns |
2030375 ns |
1.00 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
2090875 ns |
2088625 ns |
1.00 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2087542 ns |
2086625 ns |
1.00 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1999667 ns |
1998562 ns |
1.00 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
217109 ns |
217216 ns |
1.00 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
7863243 ns |
8077777 ns |
0.97 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
956020 ns |
930850 ns |
1.03 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
148333 ns |
175083 ns |
0.85 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
144145.5 ns |
147291 ns |
0.98 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
151208 ns |
150021 ns |
1.01 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
177166 ns |
151750 ns |
1.17 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
166653.5 ns |
166825 ns |
1.00 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
7677064 ns |
7358467.5 ns |
1.04 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
177102 ns |
262570 ns |
0.67 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1118667 ns |
1115103.5 ns |
1.00 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1120208 ns |
1110771 ns |
1.01 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1117479.5 ns |
1113771 ns |
1.00 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1119312 ns |
1136250 ns |
0.99 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
616442 ns |
639845.5 ns |
0.96 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
31939588 ns |
33057102 ns |
0.97 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1022810.5 ns |
864075 ns |
1.18 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
6417 ns |
3792 ns |
1.69 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
4209 ns |
4479 ns |
0.94 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
5083 ns |
6583 ns |
0.77 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
3917 ns |
6375 ns |
0.61 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA |
79406.5 ns |
85209.5 ns |
0.93 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/oneAPI |
5085855.5 ns |
5875726.5 ns |
0.87 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/AMDGPU |
69810 ns |
59531 ns |
1.17 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
8500 ns |
8417 ns |
1.01 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
8458 ns |
8750 ns |
0.97 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
8541 ns |
9042 ns |
0.94 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
8292 ns |
8958 ns |
0.93 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA |
532320 ns |
557500.5 ns |
0.95 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/oneAPI |
35070703.5 ns |
34838164 ns |
1.01 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/AMDGPU |
371294 ns |
370833 ns |
1.00 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
17458 ns |
17958 ns |
0.97 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
18000 ns |
16458 ns |
1.09 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
19687.5 ns |
21125 ns |
0.93 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
16979.5 ns |
17292 ns |
0.98 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
62297 ns |
63776.5 ns |
0.98 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
3097027 ns |
2927491.5 ns |
1.06 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
78650 ns |
82870 ns |
0.95 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
220625 ns |
212625 ns |
1.04 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
213209 ns |
213042 ns |
1.00 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
218833 ns |
212771 ns |
1.03 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
213708 ns |
212291 ns |
1.01 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
324224 ns |
329859 ns |
0.98 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
14318744 ns |
12611094 ns |
1.14 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
466149 ns |
405232 ns |
1.15 |
bias_activation(2, act=relu)(2 x 128)/forward/CPU/2 thread(s) |
625 ns |
667 ns |
0.94 |
bias_activation(2, act=relu)(2 x 128)/forward/CPU/4 thread(s) |
625 ns |
625 ns |
1 |
bias_activation(2, act=relu)(2 x 128)/forward/CPU/8 thread(s) |
833 ns |
875 ns |
0.95 |
bias_activation(2, act=relu)(2 x 128)/forward/CPU/1 thread(s) |
625 ns |
709 ns |
0.88 |
bias_activation(2, act=relu)(2 x 128)/forward/GPU/CUDA |
18901 ns |
19101 ns |
0.99 |
bias_activation(2, act=relu)(2 x 128)/forward/GPU/oneAPI |
1163594 ns |
1145778 ns |
1.02 |
bias_activation(2, act=relu)(2 x 128)/forward/GPU/AMDGPU |
33020 ns |
26409 ns |
1.25 |
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/2 thread(s) |
1417 ns |
1458 ns |
0.97 |
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/4 thread(s) |
1416 ns |
1334 ns |
1.06 |
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/8 thread(s) |
1584 ns |
1583 ns |
1.00 |
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/1 thread(s) |
1375 ns |
1375 ns |
1 |
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/CUDA |
114703 ns |
117126.5 ns |
0.98 |
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/oneAPI |
9022001 ns |
8850213 ns |
1.02 |
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/AMDGPU |
131551 ns |
115676 ns |
1.14 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
7375 ns |
7375 ns |
1 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
6042 ns |
6041 ns |
1.00 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
6084 ns |
6084 ns |
1 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
10042 ns |
9958 ns |
1.01 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
23484.5 ns |
23587 ns |
1.00 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
1176408.5 ns |
1261233 ns |
0.93 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
49200 ns |
52723 ns |
0.93 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
259375 ns |
229167 ns |
1.13 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
262167 ns |
230667 ns |
1.14 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
266896 ns |
267875 ns |
1.00 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
226813 ns |
257458 ns |
0.88 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
187535 ns |
182744 ns |
1.03 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
30816922.5 ns |
32590762.5 ns |
0.95 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
601896 ns |
548449.5 ns |
1.10 |
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/2 thread(s) |
3958 ns |
3917 ns |
1.01 |
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/4 thread(s) |
3917 ns |
3958 ns |
0.99 |
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/8 thread(s) |
3958 ns |
3958 ns |
1 |
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/1 thread(s) |
3916 ns |
3917 ns |
1.00 |
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/CUDA |
23548 ns |
22860 ns |
1.03 |
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/oneAPI |
1971367 ns |
1933593 ns |
1.02 |
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/AMDGPU |
48141 ns |
39504 ns |
1.22 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/2 thread(s) |
17375 ns |
17042 ns |
1.02 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/4 thread(s) |
16792 ns |
16875 ns |
1.00 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/8 thread(s) |
17084 ns |
17083 ns |
1.00 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/1 thread(s) |
16917 ns |
16875 ns |
1.00 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/CUDA |
185511 ns |
185787.5 ns |
1.00 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/oneAPI |
12296122.5 ns |
10029430 ns |
1.23 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/AMDGPU |
174551.5 ns |
162052 ns |
1.08 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/2 thread(s) |
934166.5 ns |
491583 ns |
1.90 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/4 thread(s) |
835667 ns |
385625 ns |
2.17 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/8 thread(s) |
846958 ns |
386458 ns |
2.19 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/1 thread(s) |
1262937.5 ns |
844083 ns |
1.50 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/CUDA |
113311 ns |
113763 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/oneAPI |
400768 ns |
418213 ns |
0.96 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/AMDGPU |
243532 ns |
388657 ns |
0.63 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) |
2602833.5 ns |
2155583 ns |
1.21 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) |
2321062.5 ns |
1863374.5 ns |
1.25 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) |
2334542 ns |
1865167 ns |
1.25 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) |
3549145.5 ns |
3377520.5 ns |
1.05 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/CUDA |
227551 ns |
229580 ns |
0.99 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/oneAPI |
12523677 ns |
9922983 ns |
1.26 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/AMDGPU |
741438 ns |
610962 ns |
1.21 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
6416.5 ns |
6500 ns |
0.99 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
5917 ns |
5500 ns |
1.08 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
7270.5 ns |
7667 ns |
0.95 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
6229 ns |
5167 ns |
1.21 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA |
84061.5 ns |
84720.5 ns |
0.99 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/oneAPI |
5499782 ns |
5300415 ns |
1.04 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/AMDGPU |
57281 ns |
59932 ns |
0.96 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
11333.5 ns |
11229 ns |
1.01 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
11667 ns |
11395.5 ns |
1.02 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
11625 ns |
12334 ns |
0.94 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
11271 ns |
10667 ns |
1.06 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA |
597623.5 ns |
602168 ns |
0.99 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/oneAPI |
39133306 ns |
38613143.5 ns |
1.01 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/AMDGPU |
397753.5 ns |
383917 ns |
1.04 |
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/2 thread(s) |
500 ns |
500 ns |
1 |
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/4 thread(s) |
542 ns |
500 ns |
1.08 |
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/8 thread(s) |
542 ns |
542 ns |
1 |
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/1 thread(s) |
500 ns |
500 ns |
1 |
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/CUDA |
23645.5 ns |
23328 ns |
1.01 |
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/oneAPI |
2233296 ns |
2178076 ns |
1.03 |
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/AMDGPU |
46720 ns |
41367 ns |
1.13 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/2 thread(s) |
2125 ns |
2084 ns |
1.02 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/4 thread(s) |
2209 ns |
2166 ns |
1.02 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/8 thread(s) |
2167 ns |
2167 ns |
1 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/1 thread(s) |
2084 ns |
2084 ns |
1 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/CUDA |
220079.5 ns |
228927.5 ns |
0.96 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/oneAPI |
11462778.5 ns |
11774524 ns |
0.97 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/AMDGPU |
170107 ns |
165900 ns |
1.03 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
8500 ns |
9584 ns |
0.89 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
8708 ns |
8333 ns |
1.05 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
10791.5 ns |
9895.5 ns |
1.09 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
8542 ns |
8542 ns |
1 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA |
94772.5 ns |
105241 ns |
0.90 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/oneAPI |
3385215 ns |
3103348.5 ns |
1.09 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/AMDGPU |
73640 ns |
71955 ns |
1.02 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
17625 ns |
17688 ns |
1.00 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
18000 ns |
16666.5 ns |
1.08 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
18875 ns |
18708 ns |
1.01 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
17375 ns |
17562 ns |
0.99 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA |
540650 ns |
595171 ns |
0.91 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/oneAPI |
17058901 ns |
16252508 ns |
1.05 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/AMDGPU |
377674 ns |
358129 ns |
1.05 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
541 ns |
542 ns |
1.00 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
625 ns |
458 ns |
1.36 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
625 ns |
583 ns |
1.07 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
500 ns |
458 ns |
1.09 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA |
34202 ns |
34578 ns |
0.99 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/oneAPI |
1244726 ns |
1237584 ns |
1.01 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/AMDGPU |
46001 ns |
41387 ns |
1.11 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
9479 ns |
9229 ns |
1.03 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
9542 ns |
8958.5 ns |
1.07 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
10166 ns |
9750 ns |
1.04 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
8666.5 ns |
8104 ns |
1.07 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA |
244744 ns |
257823 ns |
0.95 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/oneAPI |
18753058.5 ns |
18331589 ns |
1.02 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/AMDGPU |
364583 ns |
349944 ns |
1.04 |
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/2 thread(s) |
397209 ns |
397270.5 ns |
1.00 |
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/4 thread(s) |
288125 ns |
288083 ns |
1.00 |
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/8 thread(s) |
288083 ns |
288666.5 ns |
1.00 |
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/1 thread(s) |
745000 ns |
751792 ns |
0.99 |
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/CUDA |
111800.5 ns |
112022 ns |
1.00 |
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/oneAPI |
333937 ns |
349915 ns |
0.95 |
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/AMDGPU |
74491 ns |
74609 ns |
1.00 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/2 thread(s) |
1461604 ns |
1454270.5 ns |
1.01 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/4 thread(s) |
1130625 ns |
1130500 ns |
1.00 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/8 thread(s) |
1130250 ns |
1131583 ns |
1.00 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/1 thread(s) |
2406792 ns |
2437959 ns |
0.99 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/CUDA |
197620.5 ns |
200057 ns |
0.99 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/oneAPI |
10419288 ns |
7687949 ns |
1.36 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/AMDGPU |
322733 ns |
302285 ns |
1.07 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
6875 ns |
7750 ns |
0.89 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
6979 ns |
7083.5 ns |
0.99 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
8791.5 ns |
8312.5 ns |
1.06 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
6708.5 ns |
6687.5 ns |
1.00 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA |
132998 ns |
139766 ns |
0.95 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/oneAPI |
5954726 ns |
5685169 ns |
1.05 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/AMDGPU |
58080 ns |
60383 ns |
0.96 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
14625 ns |
13479.5 ns |
1.08 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
13958.5 ns |
12750 ns |
1.09 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
15833.5 ns |
15125 ns |
1.05 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
14854.5 ns |
14625.5 ns |
1.02 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA |
894545 ns |
923489 ns |
0.97 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/oneAPI |
44910299 ns |
42519536.5 ns |
1.06 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/AMDGPU |
416034 ns |
407432 ns |
1.02 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
26042 ns |
25625 ns |
1.02 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
27625 ns |
23666 ns |
1.17 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
26166 ns |
29417 ns |
0.89 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
24584 ns |
24041 ns |
1.02 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
184672 ns |
186240.5 ns |
0.99 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
7678081.5 ns |
7554376 ns |
1.02 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
115321 ns |
120505 ns |
0.96 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
152750 ns |
152187 ns |
1.00 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
149312.5 ns |
145250 ns |
1.03 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
104396 ns |
146917 ns |
0.71 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
146333 ns |
103958 ns |
1.41 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
1006389 ns |
1013659 ns |
0.99 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
42603061 ns |
44493070 ns |
0.96 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
580066 ns |
535240 ns |
1.08 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
84333 ns |
74583 ns |
1.13 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
79333.5 ns |
79584 ns |
1.00 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
76625 ns |
76791.5 ns |
1.00 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
75208 ns |
76083 ns |
0.99 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
189509.5 ns |
190594.5 ns |
0.99 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
7761115 ns |
7364811 ns |
1.05 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
124941 ns |
121316.5 ns |
1.03 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
280666.5 ns |
273562.5 ns |
1.03 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
290479 ns |
304084 ns |
0.96 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
269708 ns |
303333 ns |
0.89 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
282833.5 ns |
307583 ns |
0.92 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
1016176 ns |
1045024 ns |
0.97 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
42923658 ns |
39473308 ns |
1.09 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
694507 ns |
624192 ns |
1.11 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
13041 ns |
12417 ns |
1.05 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
12875 ns |
12896 ns |
1.00 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
14229 ns |
14000 ns |
1.02 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
12667 ns |
12500 ns |
1.01 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA |
135169.5 ns |
138416 ns |
0.98 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/oneAPI |
5673569 ns |
5479910 ns |
1.04 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/AMDGPU |
231712.5 ns |
226152 ns |
1.02 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
27062.5 ns |
27792 ns |
0.97 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
26854 ns |
26458 ns |
1.01 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
27917 ns |
28437.5 ns |
0.98 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
27416.5 ns |
33937.5 ns |
0.81 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA |
904285 ns |
924126.5 ns |
0.98 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/oneAPI |
41716960 ns |
42086872 ns |
0.99 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/AMDGPU |
692027 ns |
610976 ns |
1.13 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
14041 ns |
11124.5 ns |
1.26 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
13792 ns |
10333 ns |
1.33 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
16604 ns |
12479.5 ns |
1.33 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
14000 ns |
11125 ns |
1.26 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA |
117799 ns |
118543.5 ns |
0.99 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/oneAPI |
3485511 ns |
3443799.5 ns |
1.01 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/AMDGPU |
239113 ns |
233176 ns |
1.03 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
25542 ns |
22291.5 ns |
1.15 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
26979 ns |
22417 ns |
1.20 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
27104.5 ns |
24167 ns |
1.12 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
25875 ns |
28562.5 ns |
0.91 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA |
660016 ns |
668341 ns |
0.99 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/oneAPI |
21973748 ns |
21034051 ns |
1.04 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/AMDGPU |
670591.5 ns |
569113 ns |
1.18 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
186729 ns |
68709 ns |
2.72 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
181959 ns |
62750 ns |
2.90 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
185145.5 ns |
67520.5 ns |
2.74 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
185000 ns |
64417 ns |
2.87 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
101640.5 ns |
102389 ns |
0.99 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
3356748 ns |
3441143 ns |
0.98 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
233872 ns |
230751 ns |
1.01 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
584292 ns |
506375 ns |
1.15 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
590792 ns |
510167 ns |
1.16 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
590750.5 ns |
475209 ns |
1.24 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
590667 ns |
647896 ns |
0.91 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
489282 ns |
492781 ns |
0.99 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
20715079 ns |
20664230 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
710517 ns |
593680 ns |
1.20 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
7375 ns |
7958 ns |
0.93 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
7271 ns |
6750 ns |
1.08 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
8375 ns |
8208 ns |
1.02 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
7374.5 ns |
7562.5 ns |
0.98 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA |
134121.5 ns |
137965 ns |
0.97 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/oneAPI |
5728586.5 ns |
5508177.5 ns |
1.04 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/AMDGPU |
57391 ns |
62687 ns |
0.92 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
13125 ns |
16125 ns |
0.81 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
14417 ns |
16250 ns |
0.89 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
14333 ns |
16250 ns |
0.88 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
15625 ns |
14833 ns |
1.05 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA |
874264.5 ns |
900927 ns |
0.97 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/oneAPI |
38435171 ns |
39349971 ns |
0.98 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/AMDGPU |
395809 ns |
388286 ns |
1.02 |
batchedmm(512, Bsize=4)/forward/CPU/2 thread(s) |
6152167 ns |
6150354 ns |
1.00 |
batchedmm(512, Bsize=4)/forward/CPU/4 thread(s) |
6375812.5 ns |
6368167 ns |
1.00 |
batchedmm(512, Bsize=4)/forward/CPU/8 thread(s) |
6376958 ns |
6373937.5 ns |
1.00 |
batchedmm(512, Bsize=4)/forward/CPU/1 thread(s) |
11905750 ns |
11915167 ns |
1.00 |
batchedmm(512, Bsize=4)/forward/GPU/CUDA |
349172 ns |
345749 ns |
1.01 |
batchedmm(512, Bsize=4)/forward/GPU/oneAPI |
52345310.5 ns |
49052559 ns |
1.07 |
batchedmm(512, Bsize=4)/forward/GPU/AMDGPU |
303913 ns |
388426 ns |
0.78 |
batchedmm(512, Bsize=4)/zygote/CPU/2 thread(s) |
19121687.5 ns |
19083437.5 ns |
1.00 |
batchedmm(512, Bsize=4)/zygote/CPU/4 thread(s) |
19964166 ns |
19960479.5 ns |
1.00 |
batchedmm(512, Bsize=4)/zygote/CPU/8 thread(s) |
19966167 ns |
19966834 ns |
1.00 |
batchedmm(512, Bsize=4)/zygote/CPU/1 thread(s) |
36940917 ns |
37142104 ns |
0.99 |
batchedmm(512, Bsize=4)/zygote/GPU/CUDA |
1051854 ns |
1072087 ns |
0.98 |
batchedmm(512, Bsize=4)/zygote/GPU/oneAPI |
77616747 ns |
78467188 ns |
0.99 |
batchedmm(512, Bsize=4)/zygote/GPU/AMDGPU |
1165782 ns |
1035750.5 ns |
1.13 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/2 thread(s) |
1833 ns |
958 ns |
1.91 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/4 thread(s) |
1833 ns |
1000 ns |
1.83 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/8 thread(s) |
1875 ns |
1042 ns |
1.80 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/1 thread(s) |
1792 ns |
958 ns |
1.87 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/CUDA |
23638 ns |
23415 ns |
1.01 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/oneAPI |
1954095 ns |
2079171 ns |
0.94 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/AMDGPU |
209392 ns |
200906 ns |
1.04 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) |
4792 ns |
3917 ns |
1.22 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) |
4917 ns |
4000 ns |
1.23 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) |
5000 ns |
4041 ns |
1.24 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) |
4916 ns |
5458 ns |
0.90 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/CUDA |
272861 ns |
270573.5 ns |
1.01 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/oneAPI |
10678638 ns |
10484095 ns |
1.02 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/AMDGPU |
619377 ns |
486775 ns |
1.27 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
7771 ns |
8687 ns |
0.89 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
7666 ns |
7459 ns |
1.03 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
9187 ns |
9334 ns |
0.98 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
7604.5 ns |
7834 ns |
0.97 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA |
114805.5 ns |
116220 ns |
0.99 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/oneAPI |
3374772.5 ns |
3435001.5 ns |
0.98 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/AMDGPU |
67791 ns |
71133 ns |
0.95 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
11625 ns |
12125 ns |
0.96 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
11959 ns |
11958 ns |
1.00 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
12416 ns |
13000 ns |
0.96 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
11750 ns |
11750 ns |
1 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA |
596992 ns |
609643.5 ns |
0.98 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/oneAPI |
22147198 ns |
21784602 ns |
1.02 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/AMDGPU |
351603 ns |
341729 ns |
1.03 |
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/2 thread(s) |
291 ns |
292 ns |
1.00 |
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/4 thread(s) |
333 ns |
292 ns |
1.14 |
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/8 thread(s) |
333 ns |
292 ns |
1.14 |
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/1 thread(s) |
292 ns |
291 ns |
1.00 |
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/CUDA |
22715 ns |
22413 ns |
1.01 |
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/oneAPI |
2095326 ns |
2035110 ns |
1.03 |
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/AMDGPU |
47561 ns |
44053 ns |
1.08 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/2 thread(s) |
3333 ns |
3000 ns |
1.11 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/4 thread(s) |
2917 ns |
2917 ns |
1 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/8 thread(s) |
3250 ns |
3208 ns |
1.01 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/1 thread(s) |
3042 ns |
2916 ns |
1.04 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/CUDA |
193310 ns |
194923.5 ns |
0.99 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/oneAPI |
10025999 ns |
9225861.5 ns |
1.09 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/AMDGPU |
158052 ns |
154488.5 ns |
1.02 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
14625.5 ns |
11625 ns |
1.26 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
14167 ns |
10500 ns |
1.35 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
16542 ns |
12875 ns |
1.28 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
14000 ns |
11875 ns |
1.18 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA |
114348.5 ns |
115370 ns |
0.99 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/oneAPI |
3449562.5 ns |
3433218 ns |
1.00 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/AMDGPU |
236323 ns |
231793 ns |
1.02 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
25542 ns |
22667 ns |
1.13 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
25709 ns |
22104.5 ns |
1.16 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
26500 ns |
23625 ns |
1.12 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
25646 ns |
26729 ns |
0.96 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA |
549754 ns |
555861 ns |
0.99 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/oneAPI |
20207967.5 ns |
20482208 ns |
0.99 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/AMDGPU |
649557 ns |
545740 ns |
1.19 |
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/2 thread(s) |
4167 ns |
4334 ns |
0.96 |
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/4 thread(s) |
4208 ns |
4333 ns |
0.97 |
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/8 thread(s) |
4209 ns |
4208 ns |
1.00 |
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/1 thread(s) |
4209 ns |
4250 ns |
0.99 |
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/CUDA |
24817 ns |
23923 ns |
1.04 |
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/oneAPI |
2004466.5 ns |
2205811 ns |
0.91 |
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/AMDGPU |
48320 ns |
44864 ns |
1.08 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/2 thread(s) |
16167 ns |
16500 ns |
0.98 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/4 thread(s) |
16625 ns |
16333 ns |
1.02 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/8 thread(s) |
16333 ns |
16166 ns |
1.01 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/1 thread(s) |
16208 ns |
16292 ns |
0.99 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/CUDA |
315841 ns |
319806 ns |
0.99 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/oneAPI |
12092472 ns |
10190777 ns |
1.19 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/AMDGPU |
205562.5 ns |
186077 ns |
1.10 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
5708 ns |
2125 ns |
2.69 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
6000 ns |
2084 ns |
2.88 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
5916 ns |
2209 ns |
2.68 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
5958 ns |
2000 ns |
2.98 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA |
34686 ns |
35327 ns |
0.98 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/oneAPI |
1229247 ns |
1213779 ns |
1.01 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/AMDGPU |
206283 ns |
199242 ns |
1.04 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
21291 ns |
17104 ns |
1.24 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
22459 ns |
20167 ns |
1.11 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
21979.5 ns |
19000 ns |
1.16 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
19750 ns |
23083.5 ns |
0.86 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA |
280754 ns |
284984 ns |
0.99 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/oneAPI |
19480504 ns |
18211018 ns |
1.07 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/AMDGPU |
682467 ns |
583431 ns |
1.17 |
batchedmm(16, Bsize=512)/forward/CPU/2 thread(s) |
59375 ns |
59458 ns |
1.00 |
batchedmm(16, Bsize=512)/forward/CPU/4 thread(s) |
65187.5 ns |
65666 ns |
0.99 |
batchedmm(16, Bsize=512)/forward/CPU/8 thread(s) |
65875 ns |
66125 ns |
1.00 |
batchedmm(16, Bsize=512)/forward/CPU/1 thread(s) |
51292 ns |
52833 ns |
0.97 |
batchedmm(16, Bsize=512)/forward/GPU/CUDA |
66425.5 ns |
66304 ns |
1.00 |
batchedmm(16, Bsize=512)/forward/GPU/oneAPI |
87165834 ns |
87707222.5 ns |
0.99 |
batchedmm(16, Bsize=512)/forward/GPU/AMDGPU |
97421 ns |
110241 ns |
0.88 |
batchedmm(16, Bsize=512)/zygote/CPU/2 thread(s) |
192313 ns |
153041 ns |
1.26 |
batchedmm(16, Bsize=512)/zygote/CPU/4 thread(s) |
164333.5 ns |
155229 ns |
1.06 |
batchedmm(16, Bsize=512)/zygote/CPU/8 thread(s) |
159791.5 ns |
130209 ns |
1.23 |
batchedmm(16, Bsize=512)/zygote/CPU/1 thread(s) |
316000 ns |
286334 ns |
1.10 |
batchedmm(16, Bsize=512)/zygote/GPU/CUDA |
207817.5 ns |
210129.5 ns |
0.99 |
batchedmm(16, Bsize=512)/zygote/GPU/oneAPI |
150011863.5 ns |
149924497 ns |
1.00 |
batchedmm(16, Bsize=512)/zygote/GPU/AMDGPU |
568915 ns |
511145 ns |
1.11 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
83750 ns |
106521 ns |
0.79 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
123375 ns |
78958 ns |
1.56 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
85916 ns |
84042 ns |
1.02 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
91500 ns |
115521 ns |
0.79 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
192420 ns |
191513.5 ns |
1.00 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
5523456 ns |
5334020 ns |
1.04 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
169002 ns |
267630 ns |
0.63 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1886125 ns |
1894896 ns |
1.00 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1898541.5 ns |
1902375 ns |
1.00 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1924000 ns |
1878334 ns |
1.02 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1918083 ns |
1895250 ns |
1.01 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
504156 ns |
507442 ns |
0.99 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
26851109 ns |
28152566.5 ns |
0.95 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
921499.5 ns |
825763 ns |
1.12 |
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/2 thread(s) |
291 ns |
292 ns |
1.00 |
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/4 thread(s) |
333 ns |
292 ns |
1.14 |
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/8 thread(s) |
333 ns |
292 ns |
1.14 |
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/1 thread(s) |
292 ns |
291 ns |
1.00 |
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/CUDA |
21670 ns |
21516 ns |
1.01 |
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/oneAPI |
2049467 ns |
2100524 ns |
0.98 |
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/AMDGPU |
40101 ns |
35507 ns |
1.13 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/2 thread(s) |
1792 ns |
1792 ns |
1 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/4 thread(s) |
1834 ns |
1875 ns |
0.98 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/8 thread(s) |
1834 ns |
1834 ns |
1 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/1 thread(s) |
1834 ns |
1833 ns |
1.00 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/CUDA |
240558 ns |
245735 ns |
0.98 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/oneAPI |
10170156.5 ns |
9780504 ns |
1.04 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/AMDGPU |
177582 ns |
164548 ns |
1.08 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
8958.5 ns |
10916 ns |
0.82 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
8666.5 ns |
8291 ns |
1.05 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
11458 ns |
11146 ns |
1.03 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
8354.5 ns |
9500 ns |
0.88 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA |
113347 ns |
114788 ns |
0.99 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/oneAPI |
3369620 ns |
3351587 ns |
1.01 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/AMDGPU |
234943 ns |
232004 ns |
1.01 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
10000 ns |
8916 ns |
1.12 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
9958 ns |
8854.5 ns |
1.12 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
10541 ns |
10917 ns |
0.97 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
9833 ns |
9583 ns |
1.03 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA |
486676.5 ns |
491693 ns |
0.99 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/oneAPI |
19572305 ns |
19969043 ns |
0.98 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/AMDGPU |
630451.5 ns |
536332 ns |
1.18 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
57750 ns |
57958 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
47417 ns |
46625 ns |
1.02 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
46708 ns |
46750 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
81042 ns |
83166 ns |
0.97 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
38230 ns |
38476.5 ns |
0.99 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
1397410 ns |
1460287 ns |
0.96 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
77240 ns |
71814 ns |
1.08 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1885750 ns |
1905145.5 ns |
0.99 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1962125 ns |
1949542 ns |
1.01 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1974167 ns |
1958500 ns |
1.01 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1846708.5 ns |
1874958 ns |
0.98 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
209291 ns |
212675 ns |
0.98 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
32347400 ns |
33332615 ns |
0.97 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1006711 ns |
968925.5 ns |
1.04 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
269437.5 ns |
267500 ns |
1.01 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
267604 ns |
271479.5 ns |
0.99 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
267792 ns |
271209 ns |
0.99 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
267125 ns |
268209 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
192943 ns |
194219.5 ns |
0.99 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
7708822 ns |
7638787 ns |
1.01 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
283363 ns |
271267 ns |
1.04 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
585041 ns |
585333.5 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
596916.5 ns |
600292 ns |
0.99 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
613750 ns |
671042 ns |
0.91 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
587542 ns |
845604.5 ns |
0.69 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
984053 ns |
991966 ns |
0.99 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
45660970.5 ns |
42952243 ns |
1.06 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
897330 ns |
831153 ns |
1.08 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
2242875 ns |
2211666 ns |
1.01 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
2212687.5 ns |
2203958 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
2178500 ns |
2229083 ns |
0.98 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
2203687 ns |
2173792 ns |
1.01 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
159048 ns |
161646 ns |
0.98 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
7948607 ns |
8668502.5 ns |
0.92 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
410544 ns |
470965 ns |
0.87 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
5304459 ns |
5493104.5 ns |
0.97 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
5589291.5 ns |
5515875 ns |
1.01 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
5525312.5 ns |
5526542 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
5503750 ns |
6852458 ns |
0.80 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
923591 ns |
959137 ns |
0.96 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
50777266 ns |
49532486 ns |
1.03 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1712757.5 ns |
1437405 ns |
1.19 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/2 thread(s) |
988917 ns |
478292 ns |
2.07 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/4 thread(s) |
915208 ns |
345625 ns |
2.65 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/8 thread(s) |
913708 ns |
346750 ns |
2.64 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/1 thread(s) |
1332562.5 ns |
908542 ns |
1.47 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/CUDA |
46427 ns |
46909 ns |
0.99 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/oneAPI |
877905.5 ns |
871386 ns |
1.01 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/AMDGPU |
242262 ns |
393175 ns |
0.62 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) |
2614104.5 ns |
2137500 ns |
1.22 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) |
2325354 ns |
1869334 ns |
1.24 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) |
2332000 ns |
1859271 ns |
1.25 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) |
3535541.5 ns |
3380209 ns |
1.05 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/CUDA |
243101.5 ns |
264095.5 ns |
0.92 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/oneAPI |
15142156 ns |
13390420 ns |
1.13 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/AMDGPU |
767673 ns |
632907.5 ns |
1.21 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
57667 ns |
57458 ns |
1.00 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
46667 ns |
46166 ns |
1.01 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
46375 ns |
46250 ns |
1.00 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
82541 ns |
78667 ns |
1.05 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
28222 ns |
28560 ns |
0.99 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
1376753 ns |
1394875.5 ns |
0.99 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
74791 ns |
73147 ns |
1.02 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2027458.5 ns |
2029292 ns |
1.00 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
2099250.5 ns |
2078187.5 ns |
1.01 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2083875 ns |
2063250 ns |
1.01 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
2005208 ns |
1963958 ns |
1.02 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
222327 ns |
230846.5 ns |
0.96 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
35859255 ns |
36347331 ns |
0.99 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1037291 ns |
980522 ns |
1.06 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
58041 ns |
58083.5 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
47292 ns |
46584 ns |
1.02 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
46833 ns |
46917 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
79458 ns |
79958 ns |
0.99 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
48075 ns |
48944 ns |
0.98 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
799306 ns |
829446 ns |
0.96 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
68641 ns |
71428.5 ns |
0.96 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1892042 ns |
1871729 ns |
1.01 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1961875 ns |
1973604 ns |
0.99 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1967292 ns |
1944167 ns |
1.01 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1849541.5 ns |
1876792 ns |
0.99 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
229908 ns |
238010 ns |
0.97 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
17840443.5 ns |
18705710.5 ns |
0.95 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
909929 ns |
881607.5 ns |
1.03 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
333 ns |
292 ns |
1.14 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
375 ns |
292 ns |
1.28 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
375 ns |
375 ns |
1 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
333 ns |
291 ns |
1.14 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA |
33977 ns |
34878 ns |
0.97 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/oneAPI |
1467230.5 ns |
1190778.5 ns |
1.23 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/AMDGPU |
47440 ns |
47028 ns |
1.01 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
6834 ns |
6270.5 ns |
1.09 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
7000 ns |
6187.5 ns |
1.13 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
7583.5 ns |
7375 ns |
1.03 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
6375 ns |
6125 ns |
1.04 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA |
199504 ns |
211705.5 ns |
0.94 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/oneAPI |
19180704 ns |
20119098 ns |
0.95 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/AMDGPU |
364834 ns |
332741 ns |
1.10 |
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/2 thread(s) |
250 ns |
250 ns |
1 |
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/4 thread(s) |
292 ns |
291 ns |
1.00 |
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/8 thread(s) |
291 ns |
292 ns |
1.00 |
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/1 thread(s) |
292 ns |
250 ns |
1.17 |
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/CUDA |
32545 ns |
32902 ns |
0.99 |
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/oneAPI |
1210755.5 ns |
1224139 ns |
0.99 |
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/AMDGPU |
37950 ns |
36327 ns |
1.04 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/2 thread(s) |
3083 ns |
2667 ns |
1.16 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/4 thread(s) |
3292 ns |
2667 ns |
1.23 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/8 thread(s) |
3042 ns |
4292 ns |
0.71 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/1 thread(s) |
2917 ns |
3167 ns |
0.92 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/CUDA |
181917.5 ns |
187662.5 ns |
0.97 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/oneAPI |
7985622 ns |
5673429 ns |
1.41 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/AMDGPU |
144651.5 ns |
136635 ns |
1.06 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
1411437.5 ns |
467208 ns |
3.02 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
1417562.5 ns |
469417 ns |
3.02 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
1418583 ns |
466875 ns |
3.04 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
1409854.5 ns |
464979.5 ns |
3.03 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
135301 ns |
137312 ns |
0.99 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
5882962.5 ns |
5812904.5 ns |
1.01 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
323274 ns |
361475 ns |
0.89 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
5017125 ns |
4027749.5 ns |
1.25 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
5053042 ns |
4071500 ns |
1.24 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
5029562.5 ns |
4067417 ns |
1.24 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
4977854 ns |
5516750 ns |
0.90 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
669461.5 ns |
690445 ns |
0.97 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
31806387 ns |
32063716 ns |
0.99 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1465840.5 ns |
1091915 ns |
1.34 |
batchedmm(512, Bsize=32)/forward/CPU/2 thread(s) |
49837750 ns |
49879250 ns |
1.00 |
batchedmm(512, Bsize=32)/forward/CPU/4 thread(s) |
35538000 ns |
35487583 ns |
1.00 |
batchedmm(512, Bsize=32)/forward/CPU/8 thread(s) |
35513667 ns |
35512833.5 ns |
1.00 |
batchedmm(512, Bsize=32)/forward/CPU/1 thread(s) |
98403875 ns |
96974083 ns |
1.01 |
batchedmm(512, Bsize=32)/forward/GPU/CUDA |
1620153 ns |
1622377 ns |
1.00 |
batchedmm(512, Bsize=32)/forward/GPU/oneAPI |
56126123 ns |
55868634.5 ns |
1.00 |
batchedmm(512, Bsize=32)/forward/GPU/AMDGPU |
1047301 ns |
1579230 ns |
0.66 |
batchedmm(512, Bsize=32)/zygote/CPU/2 thread(s) |
154556249.5 ns |
154423062.5 ns |
1.00 |
batchedmm(512, Bsize=32)/zygote/CPU/4 thread(s) |
112408270.5 ns |
112364750 ns |
1.00 |
batchedmm(512, Bsize=32)/zygote/CPU/8 thread(s) |
112220083 ns |
112377416 ns |
1.00 |
batchedmm(512, Bsize=32)/zygote/CPU/1 thread(s) |
299786083 ns |
299989812 ns |
1.00 |
batchedmm(512, Bsize=32)/zygote/GPU/CUDA |
6499320.5 ns |
6468945 ns |
1.00 |
batchedmm(512, Bsize=32)/zygote/GPU/oneAPI |
126387633 ns |
126761495 ns |
1.00 |
batchedmm(512, Bsize=32)/zygote/GPU/AMDGPU |
5555408 ns |
7230228 ns |
0.77 |
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/2 thread(s) |
47417 ns |
19104.5 ns |
2.48 |
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/4 thread(s) |
48541 ns |
18375 ns |
2.64 |
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/8 thread(s) |
48041.5 ns |
17375.5 ns |
2.76 |
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/1 thread(s) |
47500 ns |
15083 ns |
3.15 |
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/CUDA |
20075 ns |
19621 ns |
1.02 |
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/oneAPI |
1215909.5 ns |
1223248 ns |
0.99 |
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/AMDGPU |
25930 ns |
28854 ns |
0.90 |
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/2 thread(s) |
50917 ns |
11062.5 ns |
4.60 |
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/4 thread(s) |
50521 ns |
8833 ns |
5.72 |
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/8 thread(s) |
50854.5 ns |
9291 ns |
5.47 |
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/1 thread(s) |
51020.5 ns |
17667 ns |
2.89 |
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/CUDA |
249429.5 ns |
252067.5 ns |
0.99 |
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/oneAPI |
10392543.5 ns |
9844493 ns |
1.06 |
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/AMDGPU |
145712 ns |
138484 ns |
1.05 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
8416 ns |
7937.5 ns |
1.06 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
8125 ns |
8125 ns |
1 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
10333 ns |
10375 ns |
1.00 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
7959 ns |
8708 ns |
0.91 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA |
119433.5 ns |
120230.5 ns |
0.99 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/oneAPI |
3587986.5 ns |
3557828.5 ns |
1.01 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/AMDGPU |
234342 ns |
235119 ns |
1.00 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
9916 ns |
9708 ns |
1.02 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
11000 ns |
9084 ns |
1.21 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
11084 ns |
9792 ns |
1.13 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
9959 ns |
10667 ns |
0.93 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA |
588933 ns |
599437 ns |
0.98 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/oneAPI |
21671588 ns |
22720103 ns |
0.95 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/AMDGPU |
652456 ns |
557070 ns |
1.17 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
8875 ns |
9291.5 ns |
0.96 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
8833 ns |
8812.5 ns |
1.00 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
11375 ns |
9917 ns |
1.15 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
8750 ns |
8958.5 ns |
0.98 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA |
116580 ns |
118821 ns |
0.98 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/oneAPI |
3383763 ns |
3465548.5 ns |
0.98 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/AMDGPU |
70650 ns |
71593 ns |
0.99 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
15499.5 ns |
13687.5 ns |
1.13 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
14833 ns |
13604.5 ns |
1.09 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
15125.5 ns |
14395.5 ns |
1.05 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
14083.5 ns |
14750 ns |
0.95 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA |
561017 ns |
570663 ns |
0.98 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/oneAPI |
19949150 ns |
20121784.5 ns |
0.99 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/AMDGPU |
344553 ns |
323504 ns |
1.07 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
958 ns |
542 ns |
1.77 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
1083 ns |
625 ns |
1.73 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
1083 ns |
584 ns |
1.85 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
1042 ns |
500 ns |
2.08 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA |
34592 ns |
35088 ns |
0.99 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/oneAPI |
1144490 ns |
1218149.5 ns |
0.94 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/AMDGPU |
206327.5 ns |
203871 ns |
1.01 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
8458 ns |
7562.5 ns |
1.12 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
9500 ns |
7667 ns |
1.24 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
8917 ns |
7875 ns |
1.13 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
8167 ns |
8520.5 ns |
0.96 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA |
228959.5 ns |
227876 ns |
1.00 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/oneAPI |
23249864.5 ns |
22566032 ns |
1.03 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/AMDGPU |
656487 ns |
569945 ns |
1.15 |
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/2 thread(s) |
23250 ns |
16458 ns |
1.41 |
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/4 thread(s) |
23666 ns |
17041 ns |
1.39 |
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/8 thread(s) |
24167 ns |
16209 ns |
1.49 |
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/1 thread(s) |
23542 ns |
10979 ns |
2.14 |
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/CUDA |
20583 ns |
20941 ns |
0.98 |
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/oneAPI |
1139709 ns |
1150830 ns |
0.99 |
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/AMDGPU |
188752 ns |
182992 ns |
1.03 |
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) |
54437 ns |
35666 ns |
1.53 |
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) |
52500 ns |
35167 ns |
1.49 |
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) |
53667 ns |
36000 ns |
1.49 |
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) |
53125 ns |
57833 ns |
0.92 |
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/CUDA |
261115 ns |
265749 ns |
0.98 |
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/oneAPI |
10843834 ns |
12188303 ns |
0.89 |
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/AMDGPU |
590356 ns |
534293 ns |
1.10 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
1410938 ns |
447500 ns |
3.15 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
1407250 ns |
488042 ns |
2.88 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
1404958.5 ns |
455709 ns |
3.08 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
1440583 ns |
496916 ns |
2.90 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
195363 ns |
195513 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
5637745 ns |
5997948.5 ns |
0.94 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
345038.5 ns |
328714 ns |
1.05 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
5017125 ns |
4024209 ns |
1.25 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
5068833 ns |
4055021 ns |
1.25 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
5019833.5 ns |
4053917 ns |
1.24 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
4892041.5 ns |
5501562.5 ns |
0.89 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
516054 ns |
521631.5 ns |
0.99 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
28647251.5 ns |
27256015 ns |
1.05 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1200248 ns |
1059038 ns |
1.13 |
batchedmm(512, Bsize=512)/forward/CPU/2 thread(s) |
828633875 ns |
836727208 ns |
0.99 |
batchedmm(512, Bsize=512)/forward/CPU/4 thread(s) |
550214334 ns |
553913292 ns |
0.99 |
batchedmm(512, Bsize=512)/forward/CPU/8 thread(s) |
540750875 ns |
540736625 ns |
1.00 |
batchedmm(512, Bsize=512)/forward/CPU/1 thread(s) |
1588891625.5 ns |
1517196875 ns |
1.05 |
batchedmm(512, Bsize=512)/forward/GPU/CUDA |
22557910 ns |
22767789 ns |
0.99 |
batchedmm(512, Bsize=512)/forward/GPU/oneAPI |
176100339 ns |
174930068 ns |
1.01 |
batchedmm(512, Bsize=512)/forward/GPU/AMDGPU |
14645884 ns |
10331681 ns |
1.42 |
batchedmm(512, Bsize=512)/zygote/CPU/2 thread(s) |
3850981209 ns |
3773348667 ns |
1.02 |
batchedmm(512, Bsize=512)/zygote/CPU/4 thread(s) |
1775008333 ns |
1782084291 ns |
1.00 |
batchedmm(512, Bsize=512)/zygote/CPU/8 thread(s) |
2246599000 ns |
1780399750 ns |
1.26 |
batchedmm(512, Bsize=512)/zygote/CPU/1 thread(s) |
6357143209 ns |
4786718666 ns |
1.33 |
batchedmm(512, Bsize=512)/zygote/GPU/CUDA |
118713618 ns |
118657187 ns |
1.00 |
batchedmm(512, Bsize=512)/zygote/GPU/oneAPI |
987258982 ns |
1332561794 ns |
0.74 |
batchedmm(512, Bsize=512)/zygote/GPU/AMDGPU |
87244864 ns |
67063298 ns |
1.30 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
78500 ns |
76542 ns |
1.03 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
76708 ns |
76584 ns |
1.00 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
79542 ns |
79583 ns |
1.00 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
77541 ns |
76708.5 ns |
1.01 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
195507.5 ns |
195943.5 ns |
1.00 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
5538798 ns |
5455658.5 ns |
1.02 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
107411.5 ns |
123300.5 ns |
0.87 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
279395.5 ns |
191292 ns |
1.46 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
194208.5 ns |
252042 ns |
0.77 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
266709 ns |
199562.5 ns |
1.34 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
194416.5 ns |
225542 ns |
0.86 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
1001402 ns |
1004442 ns |
1.00 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
42684248 ns |
43458500 ns |
0.98 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
631486 ns |
590764 ns |
1.07 |
batchedmm(512, Bsize=128)/forward/CPU/2 thread(s) |
199442208.5 ns |
199694520.5 ns |
1.00 |
batchedmm(512, Bsize=128)/forward/CPU/4 thread(s) |
138679541 ns |
138856500 ns |
1.00 |
batchedmm(512, Bsize=128)/forward/CPU/8 thread(s) |
139099334 ns |
139241166 ns |
1.00 |
batchedmm(512, Bsize=128)/forward/CPU/1 thread(s) |
394971583 ns |
393790959 ns |
1.00 |
batchedmm(512, Bsize=128)/forward/GPU/CUDA |
5839594 ns |
5842492 ns |
1.00 |
batchedmm(512, Bsize=128)/forward/GPU/oneAPI |
78661896 ns |
78913006.5 ns |
1.00 |
batchedmm(512, Bsize=128)/forward/GPU/AMDGPU |
3608983 ns |
4746717.5 ns |
0.76 |
batchedmm(512, Bsize=128)/zygote/CPU/2 thread(s) |
617352375.5 ns |
617676375.5 ns |
1.00 |
batchedmm(512, Bsize=128)/zygote/CPU/4 thread(s) |
438426875 ns |
439446917 ns |
1.00 |
batchedmm(512, Bsize=128)/zygote/CPU/8 thread(s) |
439509500 ns |
439765166.5 ns |
1.00 |
batchedmm(512, Bsize=128)/zygote/CPU/1 thread(s) |
1199223417 ns |
1174222000 ns |
1.02 |
batchedmm(512, Bsize=128)/zygote/GPU/CUDA |
26592101 ns |
26723523 ns |
1.00 |
batchedmm(512, Bsize=128)/zygote/GPU/oneAPI |
285808418.5 ns |
276392509 ns |
1.03 |
batchedmm(512, Bsize=128)/zygote/GPU/AMDGPU |
21811069 ns |
15854720 ns |
1.38 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
7334 ns |
7292 ns |
1.01 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
6167 ns |
6125 ns |
1.01 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
5958 ns |
5959 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
9917 ns |
9834 ns |
1.01 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
26841 ns |
26896.5 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
1234360 ns |
1173091 ns |
1.05 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
46450 ns |
55173 ns |
0.84 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
213895.5 ns |
213041.5 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
266937.5 ns |
227729 ns |
1.17 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
222916 ns |
220416.5 ns |
1.01 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
208083 ns |
206125 ns |
1.01 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
218249 ns |
219868 ns |
0.99 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
19255466 ns |
20153337 ns |
0.96 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
528005 ns |
541982 ns |
0.97 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
8250 ns |
8521 ns |
0.97 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
7708 ns |
7458 ns |
1.03 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
9500 ns |
11167 ns |
0.85 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
6834 ns |
9250 ns |
0.74 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA |
114813.5 ns |
115361 ns |
1.00 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/oneAPI |
3509204 ns |
3392154.5 ns |
1.03 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/AMDGPU |
70165.5 ns |
74069 ns |
0.95 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
9042 ns |
7562.5 ns |
1.20 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
8500 ns |
7958 ns |
1.07 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
8875 ns |
8167 ns |
1.09 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
8375 ns |
7395.5 ns |
1.13 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA |
492349 ns |
495697 ns |
0.99 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/oneAPI |
19405923 ns |
20965461 ns |
0.93 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/AMDGPU |
317353 ns |
309298 ns |
1.03 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
417 ns |
417 ns |
1 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
542 ns |
459 ns |
1.18 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
500 ns |
500 ns |
1 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
542 ns |
375 ns |
1.45 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA |
25659.5 ns |
26124 ns |
0.98 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/oneAPI |
1207404 ns |
1243719 ns |
0.97 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/AMDGPU |
46461 ns |
45334 ns |
1.02 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
9729.5 ns |
9584 ns |
1.02 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
9917 ns |
9062.5 ns |
1.09 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
9999.5 ns |
9792 ns |
1.02 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
9291.5 ns |
9542 ns |
0.97 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA |
247915 ns |
247606 ns |
1.00 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/oneAPI |
22484624 ns |
24899790.5 ns |
0.90 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/AMDGPU |
383749 ns |
382304 ns |
1.00 |
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/2 thread(s) |
351458 ns |
112312.5 ns |
3.13 |
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/4 thread(s) |
351708 ns |
103229 ns |
3.41 |
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/8 thread(s) |
353417 ns |
104104.5 ns |
3.39 |
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/1 thread(s) |
353583 ns |
155083 ns |
2.28 |
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/CUDA |
23647 ns |
23501 ns |
1.01 |
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/oneAPI |
817717.5 ns |
811475 ns |
1.01 |
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/AMDGPU |
188532 ns |
192539 ns |
0.98 |
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) |
779374.5 ns |
536562 ns |
1.45 |
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) |
778292 ns |
554250 ns |
1.40 |
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) |
785770.5 ns |
535291.5 ns |
1.47 |
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) |
816333.5 ns |
910854 ns |
0.90 |
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/CUDA |
219037 ns |
221242 ns |
0.99 |
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/oneAPI |
12364678.5 ns |
11751092 ns |
1.05 |
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/AMDGPU |
605051.5 ns |
560216.5 ns |
1.08 |
batchedmm(16, Bsize=32)/forward/CPU/2 thread(s) |
4937.5 ns |
5416.5 ns |
0.91 |
batchedmm(16, Bsize=32)/forward/CPU/4 thread(s) |
5708 ns |
6208.5 ns |
0.92 |
batchedmm(16, Bsize=32)/forward/CPU/8 thread(s) |
6458 ns |
6021 ns |
1.07 |
batchedmm(16, Bsize=32)/forward/CPU/1 thread(s) |
6417 ns |
4000 ns |
1.60 |
batchedmm(16, Bsize=32)/forward/GPU/CUDA |
17947.5 ns |
17520 ns |
1.02 |
batchedmm(16, Bsize=32)/forward/GPU/oneAPI |
73120248 ns |
72849606 ns |
1.00 |
batchedmm(16, Bsize=32)/forward/GPU/AMDGPU |
77421 ns |
73648 ns |
1.05 |
batchedmm(16, Bsize=32)/zygote/CPU/2 thread(s) |
12313 ns |
11562.5 ns |
1.06 |
batchedmm(16, Bsize=32)/zygote/CPU/4 thread(s) |
11646 ns |
11062 ns |
1.05 |
batchedmm(16, Bsize=32)/zygote/CPU/8 thread(s) |
11542 ns |
11000 ns |
1.05 |
batchedmm(16, Bsize=32)/zygote/CPU/1 thread(s) |
17125 ns |
16666 ns |
1.03 |
batchedmm(16, Bsize=32)/zygote/GPU/CUDA |
206798 ns |
207455.5 ns |
1.00 |
batchedmm(16, Bsize=32)/zygote/GPU/oneAPI |
99646801 ns |
97442684 ns |
1.02 |
batchedmm(16, Bsize=32)/zygote/GPU/AMDGPU |
364324 ns |
330387 ns |
1.10 |
batchedmm(16, Bsize=128)/forward/CPU/2 thread(s) |
39729 ns |
39667 ns |
1.00 |
batchedmm(16, Bsize=128)/forward/CPU/4 thread(s) |
51375 ns |
51291 ns |
1.00 |
batchedmm(16, Bsize=128)/forward/CPU/8 thread(s) |
52437.5 ns |
52958.5 ns |
0.99 |
batchedmm(16, Bsize=128)/forward/CPU/1 thread(s) |
14000 ns |
13625 ns |
1.03 |
batchedmm(16, Bsize=128)/forward/GPU/CUDA |
22811 ns |
20356 ns |
1.12 |
batchedmm(16, Bsize=128)/forward/GPU/oneAPI |
76357271 ns |
76663129 ns |
1.00 |
batchedmm(16, Bsize=128)/forward/GPU/AMDGPU |
85726 ns |
98364 ns |
0.87 |
batchedmm(16, Bsize=128)/zygote/CPU/2 thread(s) |
42979 ns |
36375.5 ns |
1.18 |
batchedmm(16, Bsize=128)/zygote/CPU/4 thread(s) |
31833.5 ns |
31417 ns |
1.01 |
batchedmm(16, Bsize=128)/zygote/CPU/8 thread(s) |
32125 ns |
31229.5 ns |
1.03 |
batchedmm(16, Bsize=128)/zygote/CPU/1 thread(s) |
65271 ns |
57000 ns |
1.15 |
batchedmm(16, Bsize=128)/zygote/GPU/CUDA |
184581 ns |
184178 ns |
1.00 |
batchedmm(16, Bsize=128)/zygote/GPU/oneAPI |
112866310 ns |
111708023 ns |
1.01 |
batchedmm(16, Bsize=128)/zygote/GPU/AMDGPU |
405654 ns |
355254 ns |
1.14 |
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/2 thread(s) |
3500 ns |
1750 ns |
2 |
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/4 thread(s) |
3625 ns |
2042 ns |
1.78 |
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/8 thread(s) |
4042 ns |
2208 ns |
1.83 |
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/1 thread(s) |
3584 ns |
1875 ns |
1.91 |
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/CUDA |
19709 ns |
19575 ns |
1.01 |
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/oneAPI |
1207802 ns |
1219758.5 ns |
0.99 |
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/AMDGPU |
29141 ns |
29099.5 ns |
1.00 |
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/2 thread(s) |
4208 ns |
2208 ns |
1.91 |
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/4 thread(s) |
4333 ns |
2167 ns |
2.00 |
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/8 thread(s) |
4583 ns |
2375 ns |
1.93 |
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/1 thread(s) |
4250 ns |
2208 ns |
1.92 |
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/CUDA |
197529 ns |
198996.5 ns |
0.99 |
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/oneAPI |
10465817 ns |
8766738.5 ns |
1.19 |
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/AMDGPU |
136751 ns |
128571 ns |
1.06 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
6250 ns |
4583 ns |
1.36 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
4542 ns |
4417 ns |
1.03 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
6229 ns |
6729 ns |
0.93 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
4833 ns |
3958 ns |
1.22 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA |
141517 ns |
143699.5 ns |
0.98 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/oneAPI |
5697543 ns |
5704411.5 ns |
1.00 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/AMDGPU |
61831 ns |
61955.5 ns |
1.00 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
8375 ns |
8334 ns |
1.00 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
8583 ns |
8083.5 ns |
1.06 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
8666.5 ns |
8709 ns |
1.00 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
8791 ns |
8583 ns |
1.02 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA |
825195 ns |
836045.5 ns |
0.99 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/oneAPI |
39430730.5 ns |
39725172 ns |
0.99 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/AMDGPU |
382689 ns |
364891 ns |
1.05 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
204959 ns |
54833 ns |
3.74 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
211312.5 ns |
55833 ns |
3.78 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
210042 ns |
55583 ns |
3.78 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
200833 ns |
56000 ns |
3.59 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
36707 ns |
36570 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
1193144 ns |
1345223 ns |
0.89 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
208072 ns |
202568 ns |
1.03 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
649875 ns |
476729 ns |
1.36 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
622959 ns |
494500 ns |
1.26 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
627750 ns |
494208 ns |
1.27 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
626583 ns |
641625 ns |
0.98 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
260696 ns |
259886 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
27024723 ns |
28017517.5 ns |
0.96 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
800078 ns |
705894 ns |
1.13 |
batchedmm(128, Bsize=128)/forward/CPU/2 thread(s) |
3314520.5 ns |
3310333 ns |
1.00 |
batchedmm(128, Bsize=128)/forward/CPU/4 thread(s) |
2333042 ns |
2334062.5 ns |
1.00 |
batchedmm(128, Bsize=128)/forward/CPU/8 thread(s) |
2334667 ns |
2333375 ns |
1.00 |
batchedmm(128, Bsize=128)/forward/CPU/1 thread(s) |
6298459 ns |
6300479 ns |
1.00 |
batchedmm(128, Bsize=128)/forward/GPU/CUDA |
205748 ns |
204581.5 ns |
1.01 |
batchedmm(128, Bsize=128)/forward/GPU/oneAPI |
76861698.5 ns |
77398976 ns |
0.99 |
batchedmm(128, Bsize=128)/forward/GPU/AMDGPU |
216863 ns |
373097 ns |
0.58 |
batchedmm(128, Bsize=128)/zygote/CPU/2 thread(s) |
11451687 ns |
11459729 ns |
1.00 |
batchedmm(128, Bsize=128)/zygote/CPU/4 thread(s) |
8308625 ns |
8305729.5 ns |
1.00 |
batchedmm(128, Bsize=128)/zygote/CPU/8 thread(s) |
8341250 ns |
8342854 ns |
1.00 |
batchedmm(128, Bsize=128)/zygote/CPU/1 thread(s) |
21350500 ns |
21088292 ns |
1.01 |
batchedmm(128, Bsize=128)/zygote/GPU/CUDA |
733789 ns |
744676 ns |
0.99 |
batchedmm(128, Bsize=128)/zygote/GPU/oneAPI |
121292322.5 ns |
121497637 ns |
1.00 |
batchedmm(128, Bsize=128)/zygote/GPU/AMDGPU |
1069846 ns |
1994797.5 ns |
0.54 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
6250.5 ns |
4833 ns |
1.29 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
4625 ns |
4646 ns |
1.00 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
6562 ns |
7520.5 ns |
0.87 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
5604.5 ns |
4917 ns |
1.14 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA |
131336.5 ns |
133339 ns |
0.98 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/oneAPI |
5403092 ns |
5450569.5 ns |
0.99 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/AMDGPU |
55485.5 ns |
61520 ns |
0.90 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
8458 ns |
7083 ns |
1.19 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
10417 ns |
7291.5 ns |
1.43 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
7500 ns |
7500 ns |
1 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
8167 ns |
7416.5 ns |
1.10 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA |
720749 ns |
725863 ns |
0.99 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/oneAPI |
38254432 ns |
33872141 ns |
1.13 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/AMDGPU |
369044 ns |
353680 ns |
1.04 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
146937.5 ns |
100459 ns |
1.46 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
119354 ns |
123042 ns |
0.97 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
99458 ns |
102417 ns |
0.97 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
119750 ns |
121458.5 ns |
0.99 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
151150 ns |
151940.5 ns |
0.99 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
6157451.5 ns |
5695179 ns |
1.08 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
182732 ns |
233346 ns |
0.78 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2012395.5 ns |
2033271 ns |
0.99 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
2034875 ns |
2026417 ns |
1.00 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2027292 ns |
1997458.5 ns |
1.01 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
2040354 ns |
2041833 ns |
1.00 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
677835 ns |
678763 ns |
1.00 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
31280000.5 ns |
31810809 ns |
0.98 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1115061 ns |
931831 ns |
1.20 |
batchedmm(2, Bsize=4)/forward/CPU/2 thread(s) |
34166 ns |
32666 ns |
1.05 |
batchedmm(2, Bsize=4)/forward/CPU/4 thread(s) |
36583 ns |
36562.5 ns |
1.00 |
batchedmm(2, Bsize=4)/forward/CPU/8 thread(s) |
36125 ns |
36167 ns |
1.00 |
batchedmm(2, Bsize=4)/forward/CPU/1 thread(s) |
583 ns |
667 ns |
0.87 |
batchedmm(2, Bsize=4)/forward/GPU/CUDA |
16380 ns |
15627 ns |
1.05 |
batchedmm(2, Bsize=4)/forward/GPU/oneAPI |
72098869 ns |
72187220 ns |
1.00 |
batchedmm(2, Bsize=4)/forward/GPU/AMDGPU |
78340 ns |
70121 ns |
1.12 |
batchedmm(2, Bsize=4)/zygote/CPU/2 thread(s) |
2959 ns |
2604.5 ns |
1.14 |
batchedmm(2, Bsize=4)/zygote/CPU/4 thread(s) |
3500 ns |
2958 ns |
1.18 |
batchedmm(2, Bsize=4)/zygote/CPU/8 thread(s) |
3083 ns |
2937.5 ns |
1.05 |
batchedmm(2, Bsize=4)/zygote/CPU/1 thread(s) |
2500 ns |
2167 ns |
1.15 |
batchedmm(2, Bsize=4)/zygote/GPU/CUDA |
139410.5 ns |
139744 ns |
1.00 |
batchedmm(2, Bsize=4)/zygote/GPU/oneAPI |
92975756 ns |
92749943 ns |
1.00 |
batchedmm(2, Bsize=4)/zygote/GPU/AMDGPU |
338763.5 ns |
289641 ns |
1.17 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
7333 ns |
7208 ns |
1.02 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
6042 ns |
6000 ns |
1.01 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
5958 ns |
5916 ns |
1.01 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
9875 ns |
9917 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
36097 ns |
35855 ns |
1.01 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
1203254 ns |
1252207 ns |
0.96 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
47460 ns |
53911 ns |
0.88 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
244875 ns |
212958.5 ns |
1.15 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
221041.5 ns |
222708 ns |
0.99 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
221708 ns |
219917 ns |
1.01 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
211396 ns |
206209 ns |
1.03 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
243135 ns |
243430 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
25810257 ns |
27468024.5 ns |
0.94 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
502405 ns |
513269 ns |
0.98 |
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/2 thread(s) |
3750 ns |
3750 ns |
1 |
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/4 thread(s) |
3750 ns |
3750 ns |
1 |
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/8 thread(s) |
3750 ns |
3750 ns |
1 |
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/1 thread(s) |
3750 ns |
3791 ns |
0.99 |
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/CUDA |
22535 ns |
21959 ns |
1.03 |
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/oneAPI |
1978273 ns |
2194149 ns |
0.90 |
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/AMDGPU |
43281 ns |
35557 ns |
1.22 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/2 thread(s) |
14459 ns |
14500 ns |
1.00 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/4 thread(s) |
14709 ns |
14500 ns |
1.01 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/8 thread(s) |
14667 ns |
14500 ns |
1.01 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/1 thread(s) |
14500 ns |
14459 ns |
1.00 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/CUDA |
303497 ns |
302419 ns |
1.00 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/oneAPI |
11149007.5 ns |
11036089 ns |
1.01 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/AMDGPU |
195222 ns |
179841 ns |
1.09 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
128875 ns |
128041 ns |
1.01 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
127875 ns |
144417 ns |
0.89 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
103500 ns |
106917 ns |
0.97 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
118729 ns |
151959 ns |
0.78 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
135839 ns |
140874 ns |
0.96 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
5908414 ns |
5963081 ns |
0.99 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
168882 ns |
236762 ns |
0.71 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1884000 ns |
1924583 ns |
0.98 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1930708 ns |
1920500 ns |
1.01 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1926583.5 ns |
1914229.5 ns |
1.01 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1718687.5 ns |
1928875 ns |
0.89 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
666777 ns |
673452 ns |
0.99 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
30322056 ns |
29935915 ns |
1.01 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1214247.5 ns |
899671 ns |
1.35 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
18000 ns |
17333 ns |
1.04 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
18166.5 ns |
17354.5 ns |
1.05 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
20292 ns |
21208 ns |
0.96 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
18167 ns |
17375 ns |
1.05 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
107411 ns |
108833.5 ns |
0.99 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
3376295 ns |
3415955 ns |
0.99 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
78171 ns |
91100 ns |
0.86 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
239666.5 ns |
216917 ns |
1.10 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
218395.5 ns |
252646 ns |
0.86 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
223333 ns |
222166 ns |
1.01 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
231958.5 ns |
229125 ns |
1.01 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
503307.5 ns |
508535.5 ns |
0.99 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
21773470.5 ns |
19323488.5 ns |
1.13 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
479765 ns |
419764 ns |
1.14 |
batchedmm(16, Bsize=4)/forward/CPU/2 thread(s) |
27875 ns |
24271 ns |
1.15 |
batchedmm(16, Bsize=4)/forward/CPU/4 thread(s) |
32604.5 ns |
30791.5 ns |
1.06 |
batchedmm(16, Bsize=4)/forward/CPU/8 thread(s) |
29749.5 ns |
29437.5 ns |
1.01 |
batchedmm(16, Bsize=4)/forward/CPU/1 thread(s) |
1209 ns |
1584 ns |
0.76 |
batchedmm(16, Bsize=4)/forward/GPU/CUDA |
16715.5 ns |
16398 ns |
1.02 |
batchedmm(16, Bsize=4)/forward/GPU/oneAPI |
71649972 ns |
72518390 ns |
0.99 |
batchedmm(16, Bsize=4)/forward/GPU/AMDGPU |
86706 ns |
76093 ns |
1.14 |
batchedmm(16, Bsize=4)/zygote/CPU/2 thread(s) |
5167 ns |
4500 ns |
1.15 |
batchedmm(16, Bsize=4)/zygote/CPU/4 thread(s) |
5750 ns |
4916 ns |
1.17 |
batchedmm(16, Bsize=4)/zygote/CPU/8 thread(s) |
5208 ns |
5125 ns |
1.02 |
batchedmm(16, Bsize=4)/zygote/CPU/1 thread(s) |
5145.5 ns |
4625 ns |
1.11 |
batchedmm(16, Bsize=4)/zygote/GPU/CUDA |
203064.5 ns |
204364 ns |
0.99 |
batchedmm(16, Bsize=4)/zygote/GPU/oneAPI |
93125518 ns |
94073985 ns |
0.99 |
batchedmm(16, Bsize=4)/zygote/GPU/AMDGPU |
389114 ns |
331675 ns |
1.17 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
226729.5 ns |
222666 ns |
1.02 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
221083 ns |
220666.5 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
224687.5 ns |
225667 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
222958 ns |
220583 ns |
1.01 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
220347 ns |
222506.5 ns |
0.99 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
7691389.5 ns |
7881934.5 ns |
0.98 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
273573 ns |
267871 ns |
1.02 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
546375 ns |
495084 ns |
1.10 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
532104.5 ns |
511812.5 ns |
1.04 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
551562.5 ns |
500854 ns |
1.10 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
560666.5 ns |
675750 ns |
0.83 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
1047561 ns |
1053634 ns |
0.99 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
45073460 ns |
42862742 ns |
1.05 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
850834 ns |
780999 ns |
1.09 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
19687.5 ns |
20375 ns |
0.97 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
19792 ns |
20000 ns |
0.99 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
21500 ns |
23875 ns |
0.90 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
20854.5 ns |
18792 ns |
1.11 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
114522.5 ns |
114286 ns |
1.00 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
3434961.5 ns |
3510843 ns |
0.98 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
79215.5 ns |
89858 ns |
0.88 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
214500.5 ns |
212375 ns |
1.01 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
218000 ns |
213041 ns |
1.02 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
220833.5 ns |
214458 ns |
1.03 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
226166.5 ns |
212541 ns |
1.06 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
722949 ns |
727333.5 ns |
0.99 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
25437401 ns |
24570511 ns |
1.04 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
539225.5 ns |
469036 ns |
1.15 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
6604 ns |
6666 ns |
0.99 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
6333.5 ns |
6604.5 ns |
0.96 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
8334 ns |
8750.5 ns |
0.95 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
5937 ns |
6208 ns |
0.96 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA |
134733 ns |
137142 ns |
0.98 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/oneAPI |
5747111 ns |
5605207 ns |
1.03 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/AMDGPU |
65771 ns |
60974 ns |
1.08 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
11834 ns |
9791 ns |
1.21 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
14209 ns |
10084 ns |
1.41 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
10583 ns |
10750 ns |
0.98 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
11375 ns |
10750 ns |
1.06 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA |
785256 ns |
794651.5 ns |
0.99 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/oneAPI |
33719661 ns |
37034174 ns |
0.91 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/AMDGPU |
375714 ns |
370101.5 ns |
1.02 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
4562 ns |
4666 ns |
0.98 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
4874.5 ns |
4708 ns |
1.04 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
7042 ns |
7437.5 ns |
0.95 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
6084 ns |
4917 ns |
1.24 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA |
137237 ns |
138544.5 ns |
0.99 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/oneAPI |
5382212 ns |
5520602 ns |
0.97 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/AMDGPU |
56651 ns |
59692 ns |
0.95 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
8229.5 ns |
7458 ns |
1.10 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
7709 ns |
7166 ns |
1.08 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
7833 ns |
7791 ns |
1.01 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
7937.5 ns |
7708 ns |
1.03 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA |
748170 ns |
755761 ns |
0.99 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/oneAPI |
39081859 ns |
37179182 ns |
1.05 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/AMDGPU |
390474 ns |
376523 ns |
1.04 |
batchedmm(128, Bsize=512)/forward/CPU/2 thread(s) |
14481645.5 ns |
14498417 ns |
1.00 |
batchedmm(128, Bsize=512)/forward/CPU/4 thread(s) |
10092792 ns |
10124125 ns |
1.00 |
batchedmm(128, Bsize=512)/forward/CPU/8 thread(s) |
10114250 ns |
10094833 ns |
1.00 |
batchedmm(128, Bsize=512)/forward/CPU/1 thread(s) |
27708083 ns |
27748583.5 ns |
1.00 |
batchedmm(128, Bsize=512)/forward/GPU/CUDA |
532624 ns |
532665 ns |
1.00 |
batchedmm(128, Bsize=512)/forward/GPU/oneAPI |
94993046.5 ns |
94795139 ns |
1.00 |
batchedmm(128, Bsize=512)/forward/GPU/AMDGPU |
395044 ns |
866850 ns |
0.46 |
batchedmm(128, Bsize=512)/zygote/CPU/2 thread(s) |
46261583.5 ns |
46333437 ns |
1.00 |
batchedmm(128, Bsize=512)/zygote/CPU/4 thread(s) |
33410959 ns |
33447541.5 ns |
1.00 |
batchedmm(128, Bsize=512)/zygote/CPU/8 thread(s) |
33486333 ns |
33510458 ns |
1.00 |
batchedmm(128, Bsize=512)/zygote/CPU/1 thread(s) |
86587583 ns |
85445667 ns |
1.01 |
batchedmm(128, Bsize=512)/zygote/GPU/CUDA |
2654436 ns |
2636151 ns |
1.01 |
batchedmm(128, Bsize=512)/zygote/GPU/oneAPI |
194923650.5 ns |
192783631 ns |
1.01 |
batchedmm(128, Bsize=512)/zygote/GPU/AMDGPU |
3295955 ns |
5189385.5 ns |
0.64 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
190708.5 ns |
66458 ns |
2.87 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
186083.5 ns |
65687.5 ns |
2.83 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
188146 ns |
70500 ns |
2.67 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
185917 ns |
66500 ns |
2.80 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
118475.5 ns |
118172.5 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
3648822.5 ns |
3662360 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
231833 ns |
237313 ns |
0.98 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
639375 ns |
467958 ns |
1.37 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
598729.5 ns |
480333.5 ns |
1.25 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
616416.5 ns |
474916.5 ns |
1.30 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
595166.5 ns |
686583.5 ns |
0.87 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
713561 ns |
715446 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
25853973 ns |
26609747 ns |
0.97 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
789453.5 ns |
655875 ns |
1.20 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
667 ns |
542 ns |
1.23 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
708 ns |
625 ns |
1.13 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
625 ns |
583 ns |
1.07 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
666 ns |
500 ns |
1.33 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA |
33227.5 ns |
32877 ns |
1.01 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/oneAPI |
1196192 ns |
1227269 ns |
0.97 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/AMDGPU |
49141 ns |
47579 ns |
1.03 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
9208 ns |
8750 ns |
1.05 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
10375 ns |
9208 ns |
1.13 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
9916 ns |
9104.5 ns |
1.09 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
12625 ns |
9750 ns |
1.29 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA |
282925 ns |
280778.5 ns |
1.01 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/oneAPI |
21246808 ns |
21881943 ns |
0.97 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/AMDGPU |
372549 ns |
355484 ns |
1.05 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/2 thread(s) |
26291 ns |
9500 ns |
2.77 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/4 thread(s) |
26291 ns |
9500 ns |
2.77 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/8 thread(s) |
26667 ns |
9500 ns |
2.81 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/1 thread(s) |
26292 ns |
9500 ns |
2.77 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/CUDA |
23969 ns |
23273 ns |
1.03 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/oneAPI |
2066696 ns |
1862112.5 ns |
1.11 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/AMDGPU |
211022 ns |
200655 ns |
1.05 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) |
67167 ns |
50209 ns |
1.34 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) |
67459 ns |
50250 ns |
1.34 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) |
68250 ns |
50500 ns |
1.35 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) |
67604.5 ns |
72375 ns |
0.93 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/CUDA |
278435 ns |
278469.5 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/oneAPI |
11499929 ns |
13204061 ns |
0.87 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/AMDGPU |
607747 ns |
491037 ns |
1.24 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
203667 ns |
54917 ns |
3.71 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
211000 ns |
55667 ns |
3.79 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
209209 ns |
55584 ns |
3.76 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
199875 ns |
56000 ns |
3.57 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
27769 ns |
28169 ns |
0.99 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
1427804.5 ns |
1174691 ns |
1.22 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
204902 ns |
203240 ns |
1.01 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
647833.5 ns |
518854 ns |
1.25 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
672374.5 ns |
500625 ns |
1.34 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
627625 ns |
497750 ns |
1.26 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
591021 ns |
643417 ns |
0.92 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
238384 ns |
238777 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
32020482 ns |
31628121.5 ns |
1.01 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
835558 ns |
758938 ns |
1.10 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
677042 ns |
655042 ns |
1.03 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
644084 ns |
613083 ns |
1.05 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
624417 ns |
652541 ns |
0.96 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
652000 ns |
678416.5 ns |
0.96 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
191709.5 ns |
192069 ns |
1.00 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
8119814 ns |
8140636 ns |
1.00 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
250313 ns |
269704 ns |
0.93 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2245166.5 ns |
2167104.5 ns |
1.04 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
2263083.5 ns |
2233125 ns |
1.01 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2243937 ns |
2241292 ns |
1.00 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1633354 ns |
2230208.5 ns |
0.73 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
919212 ns |
929752.5 ns |
0.99 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
49553095.5 ns |
55073105 ns |
0.90 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1363434 ns |
1217770.5 ns |
1.12 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
24313 ns |
19500 ns |
1.25 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
19166.5 ns |
19208.5 ns |
1.00 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
22000 ns |
23542 ns |
0.93 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
20124.5 ns |
20000 ns |
1.01 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
110904.5 ns |
111306 ns |
1.00 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
3666568 ns |
3589059.5 ns |
1.02 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
78541 ns |
91551 ns |
0.86 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
261875 ns |
220459 ns |
1.19 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
240917 ns |
226458 ns |
1.06 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
231187 ns |
223104.5 ns |
1.04 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
233562.5 ns |
219708 ns |
1.06 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
712476 ns |
714110 ns |
1.00 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
24685366 ns |
26626181 ns |
0.93 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
557241 ns |
487481 ns |
1.14 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
500 ns |
625 ns |
0.80 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
708 ns |
583 ns |
1.21 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
667 ns |
584 ns |
1.14 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
625 ns |
500 ns |
1.25 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA |
23419 ns |
23491 ns |
1.00 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/oneAPI |
1200557 ns |
1232519 ns |
0.97 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/AMDGPU |
47620 ns |
43771 ns |
1.09 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
9917 ns |
9417 ns |
1.05 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
11250 ns |
9291.5 ns |
1.21 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
10937.5 ns |
9708 ns |
1.13 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
10666 ns |
9646 ns |
1.11 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA |
262591 ns |
261581 ns |
1.00 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/oneAPI |
23985377 ns |
23734390 ns |
1.01 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/AMDGPU |
399964 ns |
381618 ns |
1.05 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
7916.5 ns |
8917 ns |
0.89 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
7958.5 ns |
7583 ns |
1.05 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
10041.5 ns |
11854.5 ns |
0.85 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
8041.5 ns |
9042 ns |
0.89 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA |
115917 ns |
115935.5 ns |
1.00 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/oneAPI |
3281620 ns |
3441325 ns |
0.95 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/AMDGPU |
66841 ns |
70456.5 ns |
0.95 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
7916 ns |
8125 ns |
0.97 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
8917 ns |
7542 ns |
1.18 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
7875 ns |
8000 ns |
0.98 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
9709 ns |
7292 ns |
1.33 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA |
480674 ns |
484010 ns |
0.99 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/oneAPI |
17996514 ns |
17813154.5 ns |
1.01 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/AMDGPU |
321543 ns |
302215 ns |
1.06 |
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/2 thread(s) |
2104.5 ns |
1417 ns |
1.49 |
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/4 thread(s) |
2375 ns |
1667 ns |
1.42 |
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/8 thread(s) |
2542 ns |
1959 ns |
1.30 |
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/1 thread(s) |
2396 ns |
1500 ns |
1.60 |
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/CUDA |
20098 ns |
20030 ns |
1.00 |
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/oneAPI |
1066690 ns |
1146657 ns |
0.93 |
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/AMDGPU |
190702 ns |
184144 ns |
1.04 |
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) |
6292 ns |
3708 ns |
1.70 |
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) |
6833 ns |
3625 ns |
1.88 |
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) |
6750 ns |
3833 ns |
1.76 |
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) |
6834 ns |
4917 ns |
1.39 |
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/CUDA |
211335.5 ns |
213101.5 ns |
0.99 |
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/oneAPI |
9983042 ns |
10511562.5 ns |
0.95 |
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/AMDGPU |
579151 ns |
524324.5 ns |
1.10 |
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/2 thread(s) |
747417 ns |
148729 ns |
5.03 |
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/4 thread(s) |
750542 ns |
128917 ns |
5.82 |
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/8 thread(s) |
747271 ns |
129917 ns |
5.75 |
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/1 thread(s) |
748709 ns |
235541 ns |
3.18 |
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/CUDA |
23157 ns |
22778 ns |
1.02 |
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/oneAPI |
1175098.5 ns |
1179919.5 ns |
1.00 |
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/AMDGPU |
36460.5 ns |
46868 ns |
0.78 |
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/2 thread(s) |
793625 ns |
143645.5 ns |
5.52 |
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/4 thread(s) |
774979 ns |
130875 ns |
5.92 |
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/8 thread(s) |
776479 ns |
138417 ns |
5.61 |
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/1 thread(s) |
811000 ns |
290021 ns |
2.80 |
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/CUDA |
209522 ns |
211960 ns |
0.99 |
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/oneAPI |
10334508 ns |
10741797 ns |
0.96 |
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/AMDGPU |
233752.5 ns |
223578 ns |
1.05 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
7333 ns |
7167 ns |
1.02 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
6000 ns |
5958 ns |
1.01 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
6125 ns |
5958.5 ns |
1.03 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
10125 ns |
10000 ns |
1.01 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
33425 ns |
33236 ns |
1.01 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
1255061.5 ns |
1203805 ns |
1.04 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
50400 ns |
57207 ns |
0.88 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
261396.5 ns |
221249.5 ns |
1.18 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
261479.5 ns |
238542 ns |
1.10 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
229333 ns |
264500 ns |
0.87 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
236062 ns |
213250 ns |
1.11 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
259516 ns |
259447 ns |
1.00 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
24445196 ns |
27707385 ns |
0.88 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
521036 ns |
530542 ns |
0.98 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
13417 ns |
13209 ns |
1.02 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
12271.5 ns |
12166 ns |
1.01 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
13875 ns |
13584 ns |
1.02 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
11749.5 ns |
12667 ns |
0.93 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA |
133835.5 ns |
135078 ns |
0.99 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/oneAPI |
5374770 ns |
5685986 ns |
0.95 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/AMDGPU |
234562 ns |
227730.5 ns |
1.03 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
23687.5 ns |
23917 ns |
0.99 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
24417 ns |
24083.5 ns |
1.01 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
25208 ns |
24750 ns |
1.02 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
24542 ns |
30146 ns |
0.81 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA |
827187.5 ns |
833527 ns |
0.99 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/oneAPI |
39569351 ns |
39963084.5 ns |
0.99 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/AMDGPU |
678787.5 ns |
615374.5 ns |
1.10 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
9834 ns |
9271 ns |
1.06 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
8979 ns |
9541 ns |
0.94 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
10083 ns |
10375 ns |
0.97 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
9146 ns |
9250 ns |
0.99 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA |
119836.5 ns |
119628 ns |
1.00 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/oneAPI |
3554576 ns |
3356719.5 ns |
1.06 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/AMDGPU |
71830.5 ns |
74940 ns |
0.96 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
13583.5 ns |
14041 ns |
0.97 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
15375 ns |
13958 ns |
1.10 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
14625 ns |
14750 ns |
0.99 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
14521 ns |
13459 ns |
1.08 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA |
637852 ns |
638262 ns |
1.00 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/oneAPI |
21976253 ns |
22466836 ns |
0.98 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/AMDGPU |
364954 ns |
344824 ns |
1.06 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
9166 ns |
9666.5 ns |
0.95 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
8292 ns |
9208 ns |
0.90 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
10625 ns |
10959 ns |
0.97 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
9042 ns |
9083.5 ns |
1.00 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA |
118116 ns |
118521 ns |
1.00 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/oneAPI |
3346671 ns |
3571671.5 ns |
0.94 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/AMDGPU |
70810.5 ns |
79399 ns |
0.89 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
12958 ns |
13416 ns |
0.97 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
14541.5 ns |
12416 ns |
1.17 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
13583.5 ns |
13479.5 ns |
1.01 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
15312 ns |
12708 ns |
1.20 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA |
525394.5 ns |
530027 ns |
0.99 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/oneAPI |
20373782 ns |
19360325 ns |
1.05 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/AMDGPU |
338973.5 ns |
317163 ns |
1.07 |
batchedmm(2, Bsize=128)/forward/CPU/2 thread(s) |
30520.5 ns |
30896 ns |
0.99 |
batchedmm(2, Bsize=128)/forward/CPU/4 thread(s) |
33917 ns |
33813 ns |
1.00 |
batchedmm(2, Bsize=128)/forward/CPU/8 thread(s) |
32250 ns |
32249.5 ns |
1.00 |
batchedmm(2, Bsize=128)/forward/CPU/1 thread(s) |
1833 ns |
1875 ns |
0.98 |
batchedmm(2, Bsize=128)/forward/GPU/CUDA |
16956 ns |
16425 ns |
1.03 |
batchedmm(2, Bsize=128)/forward/GPU/oneAPI |
76609280 ns |
76985679 ns |
1.00 |
batchedmm(2, Bsize=128)/forward/GPU/AMDGPU |
77851 ns |
76663 ns |
1.02 |
batchedmm(2, Bsize=128)/zygote/CPU/2 thread(s) |
5291 ns |
5417 ns |
0.98 |
batchedmm(2, Bsize=128)/zygote/CPU/4 thread(s) |
5709 ns |
5000 ns |
1.14 |
batchedmm(2, Bsize=128)/zygote/CPU/8 thread(s) |
5792 ns |
5479.5 ns |
1.06 |
batchedmm(2, Bsize=128)/zygote/CPU/1 thread(s) |
6917 ns |
6270.5 ns |
1.10 |
batchedmm(2, Bsize=128)/zygote/GPU/CUDA |
137577 ns |
138278 ns |
0.99 |
batchedmm(2, Bsize=128)/zygote/GPU/oneAPI |
110674438 ns |
109824422.5 ns |
1.01 |
batchedmm(2, Bsize=128)/zygote/GPU/AMDGPU |
381429.5 ns |
340566 ns |
1.12 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
292 ns |
333 ns |
0.88 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
416 ns |
375 ns |
1.11 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
375 ns |
375 ns |
1 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
375 ns |
291 ns |
1.29 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA |
25257 ns |
25574 ns |
0.99 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/oneAPI |
1228478 ns |
1142450 ns |
1.08 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/AMDGPU |
48831 ns |
45666 ns |
1.07 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
6334 ns |
6458 ns |
0.98 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
7479.5 ns |
6375 ns |
1.17 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
6959 ns |
6791.5 ns |
1.02 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
7209 ns |
6458.5 ns |
1.12 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA |
184635.5 ns |
185923.5 ns |
0.99 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/oneAPI |
23837580 ns |
22900684.5 ns |
1.04 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/AMDGPU |
378815 ns |
365402.5 ns |
1.04 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
5834 ns |
2084 ns |
2.80 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
6000 ns |
2084 ns |
2.88 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
5958 ns |
2083 ns |
2.86 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
5958 ns |
2000 ns |
2.98 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA |
25900 ns |
26453 ns |
0.98 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/oneAPI |
1250594 ns |
1207656 ns |
1.04 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/AMDGPU |
206832 ns |
203645.5 ns |
1.02 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
20458 ns |
18041 ns |
1.13 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
23792 ns |
17166.5 ns |
1.39 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
22334 ns |
17750 ns |
1.26 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
22875 ns |
23458.5 ns |
0.98 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA |
266845.5 ns |
268326 ns |
0.99 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/oneAPI |
25451299.5 ns |
24994377.5 ns |
1.02 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/AMDGPU |
692177 ns |
600702.5 ns |
1.15 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
166604 ns |
147875 ns |
1.13 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
148104.5 ns |
155437.5 ns |
0.95 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
154125 ns |
155125 ns |
0.99 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
178166 ns |
151708 ns |
1.17 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
190719 ns |
190890.5 ns |
1.00 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
7757870 ns |
7974634 ns |
0.97 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
193662 ns |
271146.5 ns |
0.71 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1331209 ns |
1321937.5 ns |
1.01 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1339083 ns |
1330625 ns |
1.01 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1319166 ns |
1308375 ns |
1.01 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1332625 ns |
1285166 ns |
1.04 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
860379 ns |
867140 ns |
0.99 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
45780672 ns |
45331705.5 ns |
1.01 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1115822 ns |
1006962 ns |
1.11 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
26708 ns |
25500 ns |
1.05 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
23334 ns |
23542 ns |
0.99 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
27729 ns |
28708.5 ns |
0.97 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
24333 ns |
24416.5 ns |
1.00 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
225932.5 ns |
226899 ns |
1.00 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
8143765 ns |
7680667 ns |
1.06 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
103591 ns |
128029 ns |
0.81 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
131042 ns |
125062.5 ns |
1.05 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
178667 ns |
165729.5 ns |
1.08 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
127000 ns |
125854.5 ns |
1.01 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
171834 ns |
180062 ns |
0.95 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
990219.5 ns |
998018.5 ns |
0.99 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
46480369 ns |
44411227 ns |
1.05 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
605656 ns |
568743 ns |
1.06 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
292 ns |
375 ns |
0.78 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
375 ns |
375 ns |
1 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
375 ns |
375 ns |
1 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
375 ns |
250 ns |
1.50 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA |
23167 ns |
23453 ns |
0.99 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/oneAPI |
1211323.5 ns |
1190116 ns |
1.02 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/AMDGPU |
48770 ns |
44533 ns |
1.10 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
6167 ns |
6895.5 ns |
0.89 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
8459 ns |
6458 ns |
1.31 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
7041 ns |
6958 ns |
1.01 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
6833 ns |
6520.5 ns |
1.05 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA |
201408.5 ns |
201834 ns |
1.00 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/oneAPI |
23826399 ns |
23542895 ns |
1.01 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/AMDGPU |
383144 ns |
372536 ns |
1.03 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
6292 ns |
5645.5 ns |
1.11 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
5291 ns |
5375 ns |
0.98 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
6478.5 ns |
7979 ns |
0.81 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
5625 ns |
5166 ns |
1.09 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA |
137210.5 ns |
139838.5 ns |
0.98 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/oneAPI |
5528693 ns |
5619575.5 ns |
0.98 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/AMDGPU |
234353 ns |
229750 ns |
1.02 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
9792 ns |
9958 ns |
0.98 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
10500 ns |
10042 ns |
1.05 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
10250 ns |
10417 ns |
0.98 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
9833 ns |
10854.5 ns |
0.91 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA |
854499 ns |
866511 ns |
0.99 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/oneAPI |
41863490 ns |
43130156 ns |
0.97 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/AMDGPU |
668967 ns |
603858 ns |
1.11 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/2 thread(s) |
1625 ns |
708 ns |
2.30 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/4 thread(s) |
1625 ns |
708 ns |
2.30 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/8 thread(s) |
1584 ns |
750 ns |
2.11 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/1 thread(s) |
1584 ns |
667 ns |
2.37 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/CUDA |
23312 ns |
22827 ns |
1.02 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/oneAPI |
2038595.5 ns |
2079377 ns |
0.98 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/AMDGPU |
208892 ns |
202368 ns |
1.03 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) |
5750 ns |
4834 ns |
1.19 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) |
6125 ns |
4833 ns |
1.27 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) |
6208 ns |
5125 ns |
1.21 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) |
5750 ns |
6291 ns |
0.91 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/CUDA |
221857.5 ns |
222098 ns |
1.00 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/oneAPI |
9884849 ns |
9952955 ns |
0.99 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/AMDGPU |
583581 ns |
471721 ns |
1.24 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
9083 ns |
8750 ns |
1.04 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
7562.5 ns |
7834 ns |
0.97 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
9917 ns |
9375 ns |
1.06 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
8104 ns |
7646 ns |
1.06 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA |
118095 ns |
117939.5 ns |
1.00 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/oneAPI |
3469707 ns |
3568146 ns |
0.97 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/AMDGPU |
70621 ns |
74409 ns |
0.95 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
8334 ns |
8792 ns |
0.95 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
10208 ns |
8583 ns |
1.19 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
8667 ns |
8875 ns |
0.98 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
8542 ns |
8083 ns |
1.06 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA |
565017 ns |
568724.5 ns |
0.99 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/oneAPI |
22916388 ns |
20842961 ns |
1.10 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/AMDGPU |
345694 ns |
335106 ns |
1.03 |
batchedmm(128, Bsize=4)/forward/CPU/2 thread(s) |
126854.5 ns |
126042 ns |
1.01 |
batchedmm(128, Bsize=4)/forward/CPU/4 thread(s) |
129334 ns |
129208 ns |
1.00 |
batchedmm(128, Bsize=4)/forward/CPU/8 thread(s) |
130000 ns |
129542 ns |
1.00 |
batchedmm(128, Bsize=4)/forward/CPU/1 thread(s) |
185917 ns |
180792 ns |
1.03 |
batchedmm(128, Bsize=4)/forward/GPU/CUDA |
46980 ns |
46423 ns |
1.01 |
batchedmm(128, Bsize=4)/forward/GPU/oneAPI |
72077654 ns |
72616088 ns |
0.99 |
batchedmm(128, Bsize=4)/forward/GPU/AMDGPU |
99391 ns |
101850 ns |
0.98 |
batchedmm(128, Bsize=4)/zygote/CPU/2 thread(s) |
334333 ns |
315875 ns |
1.06 |
batchedmm(128, Bsize=4)/zygote/CPU/4 thread(s) |
342437 ns |
334166.5 ns |
1.02 |
batchedmm(128, Bsize=4)/zygote/CPU/8 thread(s) |
345333.5 ns |
323291.5 ns |
1.07 |
batchedmm(128, Bsize=4)/zygote/CPU/1 thread(s) |
611000 ns |
609395.5 ns |
1.00 |
batchedmm(128, Bsize=4)/zygote/GPU/CUDA |
186967 ns |
187684 ns |
1.00 |
batchedmm(128, Bsize=4)/zygote/GPU/oneAPI |
93626369 ns |
93899553 ns |
1.00 |
batchedmm(128, Bsize=4)/zygote/GPU/AMDGPU |
501325.5 ns |
405833.5 ns |
1.24 |
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/2 thread(s) |
397417 ns |
397500 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/4 thread(s) |
288084 ns |
287979.5 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/8 thread(s) |
288083 ns |
288375 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/1 thread(s) |
756125 ns |
756000 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/CUDA |
44378.5 ns |
43964 ns |
1.01 |
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/oneAPI |
1435945 ns |
1424885 ns |
1.01 |
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/AMDGPU |
80171 ns |
79439 ns |
1.01 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/2 thread(s) |
1455833 ns |
1461000 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/4 thread(s) |
1133250 ns |
1133834 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/8 thread(s) |
1127875 ns |
1129645.5 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/1 thread(s) |
2439437.5 ns |
2449292 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/CUDA |
253386 ns |
254140 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/oneAPI |
10085376 ns |
11042616 ns |
0.91 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/AMDGPU |
348033 ns |
254646 ns |
1.37 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
660041.5 ns |
626500 ns |
1.05 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
640166 ns |
657208.5 ns |
0.97 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
639708.5 ns |
649750.5 ns |
0.98 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
651937.5 ns |
642417 ns |
1.01 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
187406 ns |
185720.5 ns |
1.01 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
8211515 ns |
8332264.5 ns |
0.99 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
232427.5 ns |
264649 ns |
0.88 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2450084 ns |
2452625 ns |
1.00 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
2484375 ns |
2465208.5 ns |
1.01 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2446834 ns |
2459375 ns |
0.99 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
2445125.5 ns |
2376375 ns |
1.03 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
943950 ns |
949649 ns |
0.99 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
51038800.5 ns |
53455476.5 ns |
0.95 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1448370.5 ns |
1323598 ns |
1.09 |
batchedmm(2, Bsize=32)/forward/CPU/2 thread(s) |
32750 ns |
32458 ns |
1.01 |
batchedmm(2, Bsize=32)/forward/CPU/4 thread(s) |
35459 ns |
36521 ns |
0.97 |
batchedmm(2, Bsize=32)/forward/CPU/8 thread(s) |
34937 ns |
34833 ns |
1.00 |
batchedmm(2, Bsize=32)/forward/CPU/1 thread(s) |
834 ns |
959 ns |
0.87 |
batchedmm(2, Bsize=32)/forward/GPU/CUDA |
15874 ns |
15902 ns |
1.00 |
batchedmm(2, Bsize=32)/forward/GPU/oneAPI |
73383109 ns |
73782106 ns |
0.99 |
batchedmm(2, Bsize=32)/forward/GPU/AMDGPU |
77231 ns |
74499.5 ns |
1.04 |
batchedmm(2, Bsize=32)/zygote/CPU/2 thread(s) |
3104.5 ns |
3125 ns |
0.99 |
batchedmm(2, Bsize=32)/zygote/CPU/4 thread(s) |
4125 ns |
3250 ns |
1.27 |
batchedmm(2, Bsize=32)/zygote/CPU/8 thread(s) |
3437.5 ns |
3375 ns |
1.02 |
batchedmm(2, Bsize=32)/zygote/CPU/1 thread(s) |
3292 ns |
3062.5 ns |
1.07 |
batchedmm(2, Bsize=32)/zygote/GPU/CUDA |
136649 ns |
137187.5 ns |
1.00 |
batchedmm(2, Bsize=32)/zygote/GPU/oneAPI |
98346436.5 ns |
98822060.5 ns |
1.00 |
batchedmm(2, Bsize=32)/zygote/GPU/AMDGPU |
353034 ns |
314258 ns |
1.12 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
1458833 ns |
436500 ns |
3.34 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
1500667 ns |
438625 ns |
3.42 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
1501812.5 ns |
438791 ns |
3.42 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
1437625 ns |
445917 ns |
3.22 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
43141 ns |
42826 ns |
1.01 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
1467999 ns |
1503651 ns |
0.98 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
241463 ns |
374379.5 ns |
0.64 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
5133250 ns |
4140000 ns |
1.24 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
5305083 ns |
4271375 ns |
1.24 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
5266666.5 ns |
4270687.5 ns |
1.23 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
5012583 ns |
5468750 ns |
0.92 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
234347 ns |
236201.5 ns |
0.99 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
36702447 ns |
36248116 ns |
1.01 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1231083 ns |
1135862 ns |
1.08 |
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/2 thread(s) |
3792 ns |
3750 ns |
1.01 |
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/4 thread(s) |
3791 ns |
3791 ns |
1 |
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/8 thread(s) |
3750 ns |
3750 ns |
1 |
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/1 thread(s) |
3750 ns |
3709 ns |
1.01 |
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/CUDA |
35135 ns |
34158 ns |
1.03 |
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/oneAPI |
1216141 ns |
1274307 ns |
0.95 |
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/AMDGPU |
39590 ns |
41117 ns |
0.96 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/2 thread(s) |
15333 ns |
15375 ns |
1.00 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/4 thread(s) |
16125 ns |
15334 ns |
1.05 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/8 thread(s) |
15584 ns |
15500 ns |
1.01 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/1 thread(s) |
15334 ns |
15250 ns |
1.01 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/CUDA |
257091.5 ns |
255579 ns |
1.01 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/oneAPI |
8614070 ns |
8309435 ns |
1.04 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/AMDGPU |
171612 ns |
158606 ns |
1.08 |
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/2 thread(s) |
404458 ns |
404792 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/4 thread(s) |
295917 ns |
295917 ns |
1 |
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/8 thread(s) |
296334 ns |
295958 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/1 thread(s) |
760750 ns |
759750 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/CUDA |
114125 ns |
113245 ns |
1.01 |
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/oneAPI |
1028506.5 ns |
1043498 ns |
0.99 |
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/AMDGPU |
89181 ns |
91962 ns |
0.97 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/2 thread(s) |
1475687.5 ns |
1482854 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/4 thread(s) |
1156625 ns |
1158625 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/8 thread(s) |
1152792 ns |
1150334 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/1 thread(s) |
2466666.5 ns |
2466708 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/CUDA |
233821 ns |
236768.5 ns |
0.99 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/oneAPI |
12310672 ns |
9725420.5 ns |
1.27 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/AMDGPU |
353244 ns |
298578 ns |
1.18 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
1042 ns |
584 ns |
1.78 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
1083 ns |
625 ns |
1.73 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
1042 ns |
584 ns |
1.78 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
1083 ns |
542 ns |
2.00 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA |
25063 ns |
25569 ns |
0.98 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/oneAPI |
1050546 ns |
1198679 ns |
0.88 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/AMDGPU |
208912 ns |
202679 ns |
1.03 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
8334 ns |
8083 ns |
1.03 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
10250 ns |
7792 ns |
1.32 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
8625 ns |
8375 ns |
1.03 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
8458 ns |
8437.5 ns |
1.00 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA |
206386 ns |
207068.5 ns |
1.00 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/oneAPI |
26827043 ns |
25228707 ns |
1.06 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/AMDGPU |
684852.5 ns |
593474 ns |
1.15 |
batchedmm(128, Bsize=32)/forward/CPU/2 thread(s) |
833625 ns |
829375 ns |
1.01 |
batchedmm(128, Bsize=32)/forward/CPU/4 thread(s) |
616583.5 ns |
617667 ns |
1.00 |
batchedmm(128, Bsize=32)/forward/CPU/8 thread(s) |
618792 ns |
618667 ns |
1.00 |
batchedmm(128, Bsize=32)/forward/CPU/1 thread(s) |
1443834 ns |
1544417 ns |
0.93 |
batchedmm(128, Bsize=32)/forward/GPU/CUDA |
130567 ns |
130866 ns |
1.00 |
batchedmm(128, Bsize=32)/forward/GPU/oneAPI |
74402337 ns |
74874331.5 ns |
0.99 |
batchedmm(128, Bsize=32)/forward/GPU/AMDGPU |
165932 ns |
211214 ns |
0.79 |
batchedmm(128, Bsize=32)/zygote/CPU/2 thread(s) |
2682542 ns |
2686104.5 ns |
1.00 |
batchedmm(128, Bsize=32)/zygote/CPU/4 thread(s) |
1999375 ns |
1994542 ns |
1.00 |
batchedmm(128, Bsize=32)/zygote/CPU/8 thread(s) |
1997459 ns |
1998375 ns |
1.00 |
batchedmm(128, Bsize=32)/zygote/CPU/1 thread(s) |
4921208 ns |
4960479 ns |
0.99 |
batchedmm(128, Bsize=32)/zygote/GPU/CUDA |
234980 ns |
234509 ns |
1.00 |
batchedmm(128, Bsize=32)/zygote/GPU/oneAPI |
102325987.5 ns |
102181218 ns |
1.00 |
batchedmm(128, Bsize=32)/zygote/GPU/AMDGPU |
857239 ns |
831293.5 ns |
1.03 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
333 ns |
292 ns |
1.14 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
375 ns |
375 ns |
1 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
375 ns |
375 ns |
1 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
334 ns |
250 ns |
1.34 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA |
32616 ns |
32562 ns |
1.00 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/oneAPI |
1250784 ns |
1276503 ns |
0.98 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/AMDGPU |
49801 ns |
48691 ns |
1.02 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
6292 ns |
6333 ns |
0.99 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
8208 ns |
6375 ns |
1.29 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
6583 ns |
6667 ns |
0.99 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
6625 ns |
6104.5 ns |
1.09 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA |
221633 ns |
227701 ns |
0.97 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/oneAPI |
19957066 ns |
21756022 ns |
0.92 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/AMDGPU |
361473 ns |
346728 ns |
1.04 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
1788479 ns |
1760625 ns |
1.02 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
1743999.5 ns |
1749875 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
1729500 ns |
1744292 ns |
0.99 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
1754416.5 ns |
1755166 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
188571.5 ns |
189332 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
7643974 ns |
7765672 ns |
0.98 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
354044 ns |
413433 ns |
0.86 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
4362854 ns |
4360416 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
4448958 ns |
4366917 ns |
1.02 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
4373500 ns |
4349104 ns |
1.01 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
4374708 ns |
5705104 ns |
0.77 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
845544 ns |
849205 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
48180808 ns |
48802559 ns |
0.99 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1251573.5 ns |
1205562.5 ns |
1.04 |
bias_activation(512, act=relu)(512 x 128)/forward/CPU/2 thread(s) |
7084 ns |
9604 ns |
0.74 |
bias_activation(512, act=relu)(512 x 128)/forward/CPU/4 thread(s) |
6959 ns |
6916 ns |
1.01 |
bias_activation(512, act=relu)(512 x 128)/forward/CPU/8 thread(s) |
7250 ns |
8208 ns |
0.88 |
bias_activation(512, act=relu)(512 x 128)/forward/CPU/1 thread(s) |
6875 ns |
6854 ns |
1.00 |
bias_activation(512, act=relu)(512 x 128)/forward/GPU/CUDA |
22661 ns |
22924.5 ns |
0.99 |
bias_activation(512, act=relu)(512 x 128)/forward/GPU/oneAPI |
1150024 ns |
1184238.5 ns |
0.97 |
bias_activation(512, act=relu)(512 x 128)/forward/GPU/AMDGPU |
36771 ns |
46437 ns |
0.79 |
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/2 thread(s) |
48958 ns |
50604.5 ns |
0.97 |
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/4 thread(s) |
71583 ns |
52166 ns |
1.37 |
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/8 thread(s) |
32959 ns |
45458.5 ns |
0.73 |
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/1 thread(s) |
45416 ns |
33312.5 ns |
1.36 |
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/CUDA |
209801 ns |
211538 ns |
0.99 |
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/oneAPI |
10775850 ns |
10576796.5 ns |
1.02 |
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/AMDGPU |
220142 ns |
226508 ns |
0.97 |
batchedmm(2, Bsize=512)/forward/CPU/2 thread(s) |
22145.5 ns |
21646 ns |
1.02 |
batchedmm(2, Bsize=512)/forward/CPU/4 thread(s) |
26041 ns |
26083.5 ns |
1.00 |
batchedmm(2, Bsize=512)/forward/CPU/8 thread(s) |
24917 ns |
24958.5 ns |
1.00 |
batchedmm(2, Bsize=512)/forward/CPU/1 thread(s) |
5417 ns |
5291.5 ns |
1.02 |
batchedmm(2, Bsize=512)/forward/GPU/CUDA |
18543 ns |
18121 ns |
1.02 |
batchedmm(2, Bsize=512)/forward/GPU/oneAPI |
87577603.5 ns |
88732630 ns |
0.99 |
batchedmm(2, Bsize=512)/forward/GPU/AMDGPU |
89291 ns |
73668 ns |
1.21 |
batchedmm(2, Bsize=512)/zygote/CPU/2 thread(s) |
11896 ns |
12125 ns |
0.98 |
batchedmm(2, Bsize=512)/zygote/CPU/4 thread(s) |
10875 ns |
10667 ns |
1.02 |
batchedmm(2, Bsize=512)/zygote/CPU/8 thread(s) |
10895.5 ns |
10833 ns |
1.01 |
batchedmm(2, Bsize=512)/zygote/CPU/1 thread(s) |
18166 ns |
18042 ns |
1.01 |
batchedmm(2, Bsize=512)/zygote/GPU/CUDA |
221059 ns |
221707 ns |
1.00 |
batchedmm(2, Bsize=512)/zygote/GPU/oneAPI |
150298637 ns |
148404121 ns |
1.01 |
batchedmm(2, Bsize=512)/zygote/GPU/AMDGPU |
380694 ns |
322703 ns |
1.18 |
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/2 thread(s) |
406375 ns |
405917 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/4 thread(s) |
297166 ns |
296791.5 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/8 thread(s) |
296458 ns |
297167 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/1 thread(s) |
757291.5 ns |
756709 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/CUDA |
47570 ns |
46696 ns |
1.02 |
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/oneAPI |
1357729 ns |
1393570.5 ns |
0.97 |
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/AMDGPU |
89771 ns |
90770 ns |
0.99 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/2 thread(s) |
1484417 ns |
1487375 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/4 thread(s) |
1165583 ns |
1163500 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/8 thread(s) |
1160604 ns |
1157209 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/1 thread(s) |
2473104 ns |
2472417 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/CUDA |
284069.5 ns |
283340.5 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/oneAPI |
13947031.5 ns |
11947586 ns |
1.17 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/AMDGPU |
374114 ns |
269032 ns |
1.39 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
1485042 ns |
436458 ns |
3.40 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
1528500 ns |
443270.5 ns |
3.45 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
1528542 ns |
440750 ns |
3.47 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
1464167 ns |
449000 ns |
3.26 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
54497 ns |
53940 ns |
1.01 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
1041173 ns |
1027722 ns |
1.01 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
236502 ns |
323133 ns |
0.73 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
5136792 ns |
4138541 ns |
1.24 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
5286271 ns |
4268354.5 ns |
1.24 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
5287958 ns |
4258750 ns |
1.24 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
4972333.5 ns |
5475229.5 ns |
0.91 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
257449 ns |
255597 ns |
1.01 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
31450700.5 ns |
31502698.5 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1207368 ns |
1132896.5 ns |
1.07 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/2 thread(s) |
28292 ns |
9333 ns |
3.03 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/4 thread(s) |
28875 ns |
8000 ns |
3.61 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/8 thread(s) |
28375 ns |
8000 ns |
3.55 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/1 thread(s) |
28375 ns |
13250 ns |
2.14 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/CUDA |
25060 ns |
23885 ns |
1.05 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/oneAPI |
2164878.5 ns |
1973050 ns |
1.10 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/AMDGPU |
212672 ns |
202528 ns |
1.05 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) |
66709 ns |
49625 ns |
1.34 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) |
66542 ns |
49667 ns |
1.34 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) |
67875 ns |
49583 ns |
1.37 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) |
66625 ns |
71667 ns |
0.93 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/CUDA |
339582 ns |
336641 ns |
1.01 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/oneAPI |
12629760 ns |
13058534 ns |
0.97 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/AMDGPU |
650592 ns |
508895.5 ns |
1.28 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
111167 ns |
108270.5 ns |
1.03 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
90500 ns |
86167 ns |
1.05 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
86500 ns |
86500 ns |
1 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
122542 ns |
146083 ns |
0.84 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
192161 ns |
192063 ns |
1.00 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
5883194 ns |
5750624 ns |
1.02 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
185492 ns |
267851 ns |
0.69 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2014792 ns |
2018917 ns |
1.00 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
2027520.5 ns |
2016937.5 ns |
1.01 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2013916 ns |
2011375 ns |
1.00 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1895917 ns |
2024000.5 ns |
0.94 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
515387 ns |
511598 ns |
1.01 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
26786371 ns |
30563079 ns |
0.88 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
948350 ns |
860237 ns |
1.10 |
This comment was automatically generated by workflow using github-action-benchmark.
avik-pal
changed the title
fix: remove certain LV usage
refactor: move JuliaSIMD deps to extensions
Oct 17, 2024
avik-pal
force-pushed
the
ap/segfault
branch
2 times, most recently
from
October 17, 2024 20:22
68fc1b3
to
1c7ac61
Compare
avik-pal
force-pushed
the
ap/segfault
branch
4 times, most recently
from
October 17, 2024 23:08
6634a80
to
27fa286
Compare
avik-pal
changed the title
refactor: move JuliaSIMD deps to extensions
refactor: move Oct 18, 2024
JuliaSIMD
deps to extensions
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
No description provided.