Skip to content

Commit

Permalink
ci: run all pre-release
Browse files Browse the repository at this point in the history
  • Loading branch information
avik-pal authored Oct 18, 2024
1 parent 1e783df commit 33e5432
Showing 1 changed file with 10 additions and 10 deletions.
20 changes: 10 additions & 10 deletions .github/workflows/CIPreRelease.yml
Original file line number Diff line number Diff line change
Expand Up @@ -32,16 +32,16 @@ jobs:
os:
- ubuntu-latest
test_group:
# - "core_layers"
# - "contrib"
# - "helpers"
# - "distributed"
# - "normalize_layers"
# - "others"
# - "autodiff"
# - "recurrent_layers"
# - "eltype_match"
# - "fluxcompat"
- "core_layers"
- "contrib"
- "helpers"
- "distributed"
- "normalize_layers"
- "others"
- "autodiff"
- "recurrent_layers"
- "eltype_match"
- "fluxcompat"
- "reactant"
steps:
- uses: actions/checkout@v4
Expand Down

1 comment on commit 33e5432

@github-actions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lux Benchmarks

Benchmark suite Current: 33e5432 Previous: 1e783df Ratio
Dense(512 => 512, identity)(512 x 128)/forward/CPU/2 thread(s) 411833 ns 410479.5 ns 1.00
Dense(512 => 512, identity)(512 x 128)/forward/CPU/4 thread(s) 322270.5 ns 322979 ns 1.00
Dense(512 => 512, identity)(512 x 128)/forward/CPU/8 thread(s) 322687.5 ns 243583 ns 1.32
Dense(512 => 512, identity)(512 x 128)/forward/CPU/1 thread(s) 739792 ns 740125 ns 1.00
Dense(512 => 512, identity)(512 x 128)/forward/GPU/CUDA 43717 ns 43310 ns 1.01
Dense(512 => 512, identity)(512 x 128)/zygote/CPU/2 thread(s) 592458 ns 1312625 ns 0.45
Dense(512 => 512, identity)(512 x 128)/zygote/CPU/4 thread(s) 485750 ns 2418334 ns 0.20
Dense(512 => 512, identity)(512 x 128)/zygote/CPU/8 thread(s) 472146 ns 16373020.5 ns 0.028836829465888718
Dense(512 => 512, identity)(512 x 128)/zygote/CPU/1 thread(s) 916416 ns 958000 ns 0.96
Dense(512 => 512, identity)(512 x 128)/zygote/GPU/CUDA 193389 ns 190740 ns 1.01
Dense(512 => 512, identity)(512 x 128)/enzyme/CPU/2 thread(s) 732083 ns 1378500 ns 0.53
Dense(512 => 512, identity)(512 x 128)/enzyme/CPU/4 thread(s) 630020.5 ns 2610979.5 ns 0.24
Dense(512 => 512, identity)(512 x 128)/enzyme/CPU/8 thread(s) 590250 ns 16066041 ns 0.036738982553324744
Dense(512 => 512, identity)(512 x 128)/enzyme/CPU/1 thread(s) 1008000 ns 967958 ns 1.04
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) 1531625.5 ns 1773750 ns 0.86
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) 1199500 ns 1093875 ns 1.10
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) 1370166 ns 1520104 ns 0.90
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) 2432729.5 ns 2458417 ns 0.99
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/GPU/CUDA 211497 ns 209499 ns 1.01
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) 12247917 ns 12121583 ns 1.01
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) 9551854.5 ns 8834833 ns 1.08
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) 9290625 ns 9223542 ns 1.01
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) 17955583 ns 17972771 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/GPU/CUDA 1916393.5 ns 1903079 ns 1.01
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) 17351270.5 ns 17300562 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) 14353042 ns 13987625 ns 1.03
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) 14309667 ns 14513146 ns 0.99
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) 21080250 ns 21072834 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) 121821646 ns 250439208 ns 0.49
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) 174069521 ns 148115625 ns 1.18
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) 148056167 ns 117228750 ns 1.26
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) 106139667 ns 104041542 ns 1.02
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/GPU/CUDA 5478633 ns 5463821 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) 596837750 ns 1224682250 ns 0.49
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) 543667792 ns 933837625 ns 0.58
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) 445085375 ns 835803479 ns 0.53
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) 626736625 ns 628560812 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/GPU/CUDA 38176542 ns 35032007 ns 1.09
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) 652965479.5 ns 1141719792 ns 0.57
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) 674093584 ns 983678666.5 ns 0.69
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) 632863021 ns 1377974646 ns 0.46
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) 743445292 ns 746244021 ns 1.00
lenet(28, 28, 1, 32)/forward/CPU/2 thread(s) 849625 ns 1114917 ns 0.76
lenet(28, 28, 1, 32)/forward/CPU/4 thread(s) 832854.5 ns 1628542 ns 0.51
lenet(28, 28, 1, 32)/forward/CPU/8 thread(s) 1217000 ns 4086771 ns 0.30
lenet(28, 28, 1, 32)/forward/CPU/1 thread(s) 966042 ns 959792 ns 1.01
lenet(28, 28, 1, 32)/forward/GPU/CUDA 266296.5 ns 272035 ns 0.98
lenet(28, 28, 1, 32)/zygote/CPU/2 thread(s) 2721500 ns 2981354.5 ns 0.91
lenet(28, 28, 1, 32)/zygote/CPU/4 thread(s) 2466917 ns 4115937.5 ns 0.60
lenet(28, 28, 1, 32)/zygote/CPU/8 thread(s) 3314395.5 ns 9608958 ns 0.34
lenet(28, 28, 1, 32)/zygote/CPU/1 thread(s) 3364958.5 ns 3297500.5 ns 1.02
lenet(28, 28, 1, 32)/zygote/GPU/CUDA 1061958 ns 1076584 ns 0.99
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) 2259875 ns 2355125 ns 0.96
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) 1580250 ns 1453000 ns 1.09
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) 1752416.5 ns 1602646 ns 1.09
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) 3779541 ns 3770125 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/GPU/CUDA 212874 ns 215196 ns 0.99
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) 20464770.5 ns 20246500 ns 1.01
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) 17681833 ns 16965833.5 ns 1.04
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) 17968916 ns 18330417 ns 0.98
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) 26220958.5 ns 26150209 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/GPU/CUDA 1983562 ns 1980657 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) 44361875 ns 44324250 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) 42037625 ns 41015042 ns 1.02
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) 41240937.5 ns 41295750 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) 47003375 ns 47634416 ns 0.99
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) 4301083.5 ns 4656667 ns 0.92
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) 2876167 ns 2867250 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) 2986437.5 ns 2754917 ns 1.08
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) 7412625 ns 7179750 ns 1.03
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/GPU/CUDA 515223 ns 515735.5 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) 40138542 ns 40447166.5 ns 0.99
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) 34883937.5 ns 33885499.5 ns 1.03
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) 33862542 ns 34257187.5 ns 0.99
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) 51421084 ns 51082812.5 ns 1.01
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/GPU/CUDA 2979770 ns 3174195 ns 0.94
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) 88409354.5 ns 109744583 ns 0.81
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) 84462416 ns 135227938 ns 0.62
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) 83166916.5 ns 270381750 ns 0.31
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) 93812228.5 ns 95391167 ns 0.98
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) 143119041 ns 270563333 ns 0.53
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) 186909958.5 ns 161054417 ns 1.16
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) 160607000 ns 125340042 ns 1.28
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) 149056313 ns 146582812.5 ns 1.02
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/GPU/CUDA 7091795 ns 7052057 ns 1.01
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) 876576041.5 ns 1502349770.5 ns 0.58
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) 819011417 ns 1201703584 ns 0.68
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) 713621416.5 ns 1090436625 ns 0.65
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) 1026954750.5 ns 1030635583 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/GPU/CUDA 33962668 ns 33863530 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) 1654338292 ns 2004525437 ns 0.83
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) 1556399750 ns 1793970792 ns 0.87
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) 1456365229 ns 2094682166.5 ns 0.70
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) 1581565875 ns 1594796917 ns 0.99
lenet(28, 28, 1, 128)/forward/CPU/2 thread(s) 1500042 ns 1816417 ns 0.83
lenet(28, 28, 1, 128)/forward/CPU/4 thread(s) 1281708 ns 2535417 ns 0.51
lenet(28, 28, 1, 128)/forward/CPU/8 thread(s) 1629875 ns 9580729.5 ns 0.17
lenet(28, 28, 1, 128)/forward/CPU/1 thread(s) 2163395.5 ns 2124083 ns 1.02
lenet(28, 28, 1, 128)/forward/GPU/CUDA 262650.5 ns 265598 ns 0.99
lenet(28, 28, 1, 128)/zygote/CPU/2 thread(s) 7601959 ns 9396125 ns 0.81
lenet(28, 28, 1, 128)/zygote/CPU/4 thread(s) 6596916 ns 11490250 ns 0.57
lenet(28, 28, 1, 128)/zygote/CPU/8 thread(s) 7128375 ns 25636708 ns 0.28
lenet(28, 28, 1, 128)/zygote/CPU/1 thread(s) 10476396 ns 10456812.5 ns 1.00
lenet(28, 28, 1, 128)/zygote/GPU/CUDA 1087771 ns 1095109 ns 0.99
vgg16(32, 32, 3, 32)/forward/CPU/2 thread(s) 185964437.5 ns 381007729.5 ns 0.49
vgg16(32, 32, 3, 32)/forward/CPU/4 thread(s) 146352312.5 ns 283558854 ns 0.52
vgg16(32, 32, 3, 32)/forward/CPU/8 thread(s) 130050146 ns 264714708 ns 0.49
vgg16(32, 32, 3, 32)/forward/CPU/1 thread(s) 179543416.5 ns 179954521 ns 1.00
vgg16(32, 32, 3, 32)/forward/GPU/CUDA 4845696 ns 4874412 ns 0.99
vgg16(32, 32, 3, 32)/zygote/CPU/2 thread(s) 643688917 ns 1154043958 ns 0.56
vgg16(32, 32, 3, 32)/zygote/CPU/4 thread(s) 604191917 ns 991918083 ns 0.61
vgg16(32, 32, 3, 32)/zygote/CPU/8 thread(s) 537019041 ns 1078324541 ns 0.50
vgg16(32, 32, 3, 32)/zygote/CPU/1 thread(s) 663244750 ns 668069084 ns 0.99
vgg16(32, 32, 3, 32)/zygote/GPU/CUDA 16664478 ns 16315510 ns 1.02
lenet(28, 28, 1, 64)/forward/CPU/2 thread(s) 1073937.5 ns 1054520.5 ns 1.02
lenet(28, 28, 1, 64)/forward/CPU/4 thread(s) 979688 ns 1957562.5 ns 0.50
lenet(28, 28, 1, 64)/forward/CPU/8 thread(s) 1338583 ns 6624334 ns 0.20
lenet(28, 28, 1, 64)/forward/CPU/1 thread(s) 1380812 ns 1352146 ns 1.02
lenet(28, 28, 1, 64)/forward/GPU/CUDA 265966 ns 267010 ns 1.00
lenet(28, 28, 1, 64)/zygote/CPU/2 thread(s) 6009021 ns 6499937.5 ns 0.92
lenet(28, 28, 1, 64)/zygote/CPU/4 thread(s) 4658625 ns 13781958 ns 0.34
lenet(28, 28, 1, 64)/zygote/CPU/8 thread(s) 4922187.5 ns 20923250 ns 0.24
lenet(28, 28, 1, 64)/zygote/CPU/1 thread(s) 5723978.5 ns 5707062.5 ns 1.00
lenet(28, 28, 1, 64)/zygote/GPU/CUDA 1137942.5 ns 1115597.5 ns 1.02
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) 23733624.5 ns 70442792 ns 0.34
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) 35284771.5 ns 43467103.5 ns 0.81
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) 37100750.5 ns 39734999.5 ns 0.93
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) 35260167 ns 35200125 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/GPU/CUDA 1834016 ns 1845136 ns 0.99
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) 184898625 ns 356138708 ns 0.52
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) 160642834 ns 270050583 ns 0.59
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) 144248000 ns 254207104 ns 0.57
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) 271530583 ns 271696541.5 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/GPU/CUDA 16393096 ns 16499812 ns 0.99
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) 296257000 ns 395249958 ns 0.75
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) 245304833 ns 396501625 ns 0.62
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) 301408687 ns 738492916.5 ns 0.41
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) 446273791 ns 447067000 ns 1.00
vgg16(32, 32, 3, 128)/forward/CPU/2 thread(s) 656873875 ns 1189294541 ns 0.55
vgg16(32, 32, 3, 128)/forward/CPU/4 thread(s) 433591937.5 ns 689030520.5 ns 0.63
vgg16(32, 32, 3, 128)/forward/CPU/8 thread(s) 402349417 ns 650962625 ns 0.62
vgg16(32, 32, 3, 128)/forward/CPU/1 thread(s) 677798728.5 ns 681961562 ns 0.99
vgg16(32, 32, 3, 128)/forward/GPU/CUDA 12482697 ns 12470086 ns 1.00
vgg16(32, 32, 3, 128)/zygote/CPU/2 thread(s) 1891955437.5 ns 3681028375 ns 0.51
vgg16(32, 32, 3, 128)/zygote/CPU/4 thread(s) 1637549708 ns 2822971000 ns 0.58
vgg16(32, 32, 3, 128)/zygote/CPU/8 thread(s) 1514000729 ns 2698825750 ns 0.56
vgg16(32, 32, 3, 128)/zygote/CPU/1 thread(s) 2113439354.5 ns 2121646854.5 ns 1.00
vgg16(32, 32, 3, 128)/zygote/GPU/CUDA 49760182 ns 49909051 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) 3046500 ns 3408458 ns 0.89
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) 2098166 ns 2063208 ns 1.02
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) 2287292 ns 2518458 ns 0.91
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) 4866125 ns 4888750 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/GPU/CUDA 582507.5 ns 580004.5 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) 25579833 ns 25958666 ns 0.99
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) 20277104 ns 18964292 ns 1.07
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) 19545458 ns 19447166.5 ns 1.01
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) 36687292 ns 36745416.5 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/GPU/CUDA 2979368 ns 3191777 ns 0.93
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) 35578625 ns 55195125 ns 0.64
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) 28390167 ns 81683979.5 ns 0.35
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) 30144895.5 ns 174851250 ns 0.17
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) 42776229 ns 42883916.5 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) 1650667 ns 1788312.5 ns 0.92
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) 1204458 ns 1100250 ns 1.09
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) 1396750 ns 1558396 ns 0.90
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) 2509645.5 ns 2464688 ns 1.02
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/GPU/CUDA 218107 ns 215197 ns 1.01
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) 12697333 ns 12518625 ns 1.01
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) 9973959 ns 9205333 ns 1.08
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) 9758687 ns 9628104 ns 1.01
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) 18284458 ns 18331625 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/GPU/CUDA 1944527.5 ns 1949026.5 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) 17688854 ns 17616875 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) 14754291 ns 14310166 ns 1.03
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) 14674374.5 ns 14557291.5 ns 1.01
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) 21468083.5 ns 21449812.5 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) 23681167 ns 70367541.5 ns 0.34
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) 34404604 ns 43412916.5 ns 0.79
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) 37545958 ns 39742938 ns 0.94
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) 35268000 ns 35448542 ns 0.99
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/GPU/CUDA 1848561 ns 1795063 ns 1.03
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) 190505958.5 ns 360004208 ns 0.53
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) 237366917 ns 346542937 ns 0.68
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) 194090667 ns 307664333.5 ns 0.63
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) 460122917 ns 463480458 ns 0.99
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/GPU/CUDA 13928578 ns 13962488.5 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) 301146020.5 ns 418770999.5 ns 0.72
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) 250240417 ns 421592709 ns 0.59
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) 308748000 ns 780166249.5 ns 0.40
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) 395462625 ns 393782854 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/2 thread(s) 1916083.5 ns 1880375 ns 1.02
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/4 thread(s) 1556917 ns 1570562.5 ns 0.99
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/8 thread(s) 1579625 ns 1246416.5 ns 1.27
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/1 thread(s) 2659291.5 ns 2596208.5 ns 1.02
mlp7layer_bn(gelu)(32 x 256)/forward/GPU/CUDA 570148 ns 564741 ns 1.01
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/2 thread(s) 6146812.5 ns 9321042 ns 0.66
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/4 thread(s) 5943834 ns 13025292 ns 0.46
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/8 thread(s) 5926041 ns 33090166 ns 0.18
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/1 thread(s) 6788041.5 ns 6518396.5 ns 1.04
mlp7layer_bn(gelu)(32 x 256)/zygote/GPU/CUDA 1353691.5 ns 1351683.5 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/2 thread(s) 18785021 ns 22256291 ns 0.84
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/4 thread(s) 19131625 ns 27788229 ns 0.69
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/8 thread(s) 19125833 ns 54815104 ns 0.35
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/1 thread(s) 15678041 ns 15723000 ns 1.00
Dense(512 => 512, relu)(512 x 128)/forward/CPU/2 thread(s) 68937 ns 660437.5 ns 0.10
Dense(512 => 512, relu)(512 x 128)/forward/CPU/4 thread(s) 68625 ns 564125.5 ns 0.12
Dense(512 => 512, relu)(512 x 128)/forward/CPU/8 thread(s) 70792 ns 1067959 ns 0.06628718892766483
Dense(512 => 512, relu)(512 x 128)/forward/CPU/1 thread(s) 69854 ns 68833 ns 1.01
Dense(512 => 512, relu)(512 x 128)/forward/GPU/CUDA 47405.5 ns 48015 ns 0.99
Dense(512 => 512, relu)(512 x 128)/zygote/CPU/2 thread(s) 287792 ns 1518999.5 ns 0.19
Dense(512 => 512, relu)(512 x 128)/zygote/CPU/4 thread(s) 312812.5 ns 1050917 ns 0.30
Dense(512 => 512, relu)(512 x 128)/zygote/CPU/8 thread(s) 280416 ns 1571000 ns 0.18
Dense(512 => 512, relu)(512 x 128)/zygote/CPU/1 thread(s) 281521 ns 325084 ns 0.87
Dense(512 => 512, relu)(512 x 128)/zygote/GPU/CUDA 211915 ns 216110 ns 0.98
Dense(512 => 512, relu)(512 x 128)/enzyme/CPU/2 thread(s) 444500 ns 1555895.5 ns 0.29
Dense(512 => 512, relu)(512 x 128)/enzyme/CPU/4 thread(s) 448250 ns 1060292 ns 0.42
Dense(512 => 512, relu)(512 x 128)/enzyme/CPU/8 thread(s) 391667 ns 1624541 ns 0.24
Dense(512 => 512, relu)(512 x 128)/enzyme/CPU/1 thread(s) 357041.5 ns 374750 ns 0.95
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) 3044791 ns 3421708 ns 0.89
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) 2094645.5 ns 2057375 ns 1.02
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) 2278916.5 ns 2472729 ns 0.92
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) 4567208 ns 4540646 ns 1.01
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/GPU/CUDA 585440 ns 585099 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) 23578062.5 ns 24053333 ns 0.98
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) 18085666 ns 17186833 ns 1.05
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) 16978625 ns 17114833.5 ns 0.99
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) 34976833 ns 35115834 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/GPU/CUDA 2912837 ns 3096781.5 ns 0.94
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) 33419374.5 ns 53599104 ns 0.62
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) 27788708 ns 80093333 ns 0.35
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) 27373667 ns 172009854 ns 0.16
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) 42059688 ns 42254666 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) 118607334 ns 249876333.5 ns 0.47
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) 173693458.5 ns 148299229 ns 1.17
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) 147902833 ns 116785208 ns 1.27
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) 108303292 ns 106758125 ns 1.01
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/GPU/CUDA 5451158 ns 5452339 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) 470478958 ns 1100542291 ns 0.43
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) 467481645.5 ns 855735416.5 ns 0.55
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) 434223083.5 ns 831274375 ns 0.52
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) 737222479.5 ns 738168166.5 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/GPU/CUDA 35181339 ns 32317772.5 ns 1.09
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) 635200500 ns 1001895729 ns 0.63
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) 665043396 ns 966598875 ns 0.69
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) 582947041.5 ns 1307543687 ns 0.45
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) 731724375 ns 738405458 ns 0.99
mlp7layer_bn(relu)(32 x 256)/forward/CPU/2 thread(s) 1304833 ns 1230583 ns 1.06
mlp7layer_bn(relu)(32 x 256)/forward/CPU/4 thread(s) 937167 ns 962250 ns 0.97
mlp7layer_bn(relu)(32 x 256)/forward/CPU/8 thread(s) 903709 ns 796604 ns 1.13
mlp7layer_bn(relu)(32 x 256)/forward/CPU/1 thread(s) 2036958 ns 2036541 ns 1.00
mlp7layer_bn(relu)(32 x 256)/forward/GPU/CUDA 564089 ns 567146.5 ns 0.99
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/2 thread(s) 2960625 ns 5691500 ns 0.52
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/4 thread(s) 2635667 ns 6401396 ns 0.41
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/8 thread(s) 2619417 ns 25408000 ns 0.10
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/1 thread(s) 3698292 ns 3697229 ns 1.00
mlp7layer_bn(relu)(32 x 256)/zygote/GPU/CUDA 1319613 ns 1332396 ns 0.99
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/2 thread(s) 6561416 ns 9370333 ns 0.70
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/4 thread(s) 6499959 ns 13058291 ns 0.50
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/8 thread(s) 6497875 ns 32481708 ns 0.20
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/1 thread(s) 4438375 ns 4424396 ns 1.00
Dense(128 => 128, gelu)(128 x 128)/forward/CPU/2 thread(s) 39271 ns 390896 ns 0.10
Dense(128 => 128, gelu)(128 x 128)/forward/CPU/4 thread(s) 32458.5 ns 458604 ns 0.07077674856739148
Dense(128 => 128, gelu)(128 x 128)/forward/CPU/8 thread(s) 32062.5 ns 2946292 ns 0.01088232259395878
Dense(128 => 128, gelu)(128 x 128)/forward/CPU/1 thread(s) 54437.5 ns 54375 ns 1.00
Dense(128 => 128, gelu)(128 x 128)/forward/GPU/CUDA 27919 ns 28214 ns 0.99
Dense(128 => 128, gelu)(128 x 128)/zygote/CPU/2 thread(s) 179042 ns 360312.5 ns 0.50
Dense(128 => 128, gelu)(128 x 128)/zygote/CPU/4 thread(s) 175541 ns 439417 ns 0.40
Dense(128 => 128, gelu)(128 x 128)/zygote/CPU/8 thread(s) 175167 ns 5063292 ns 0.034595476618768974
Dense(128 => 128, gelu)(128 x 128)/zygote/CPU/1 thread(s) 190708.5 ns 190708 ns 1.00
Dense(128 => 128, gelu)(128 x 128)/zygote/GPU/CUDA 219938 ns 219423.5 ns 1.00
Dense(128 => 128, gelu)(128 x 128)/enzyme/CPU/2 thread(s) 442334 ns 632709 ns 0.70
Dense(128 => 128, gelu)(128 x 128)/enzyme/CPU/4 thread(s) 463458.5 ns 711770.5 ns 0.65
Dense(128 => 128, gelu)(128 x 128)/enzyme/CPU/8 thread(s) 442417 ns 5249812.5 ns 0.0842729145088515
Dense(128 => 128, gelu)(128 x 128)/enzyme/CPU/1 thread(s) 429500 ns 429750 ns 1.00
Dense(128 => 128, relu)(128 x 128)/forward/CPU/2 thread(s) 13562.5 ns 335333.5 ns 0.04044481091212181
Dense(128 => 128, relu)(128 x 128)/forward/CPU/4 thread(s) 13437.5 ns 393604 ns 0.03413964288980803
Dense(128 => 128, relu)(128 x 128)/forward/CPU/8 thread(s) 14416 ns 765792 ns 0.018824955079185992
Dense(128 => 128, relu)(128 x 128)/forward/CPU/1 thread(s) 14375 ns 13458 ns 1.07
Dense(128 => 128, relu)(128 x 128)/forward/GPU/CUDA 28121 ns 28223 ns 1.00
Dense(128 => 128, relu)(128 x 128)/zygote/CPU/2 thread(s) 25917 ns 286125 ns 0.09057929226736566
Dense(128 => 128, relu)(128 x 128)/zygote/CPU/4 thread(s) 25667 ns 310708 ns 0.08260810793413752
Dense(128 => 128, relu)(128 x 128)/zygote/CPU/8 thread(s) 25625 ns 733437.5 ns 0.03493821900298253
Dense(128 => 128, relu)(128 x 128)/zygote/CPU/1 thread(s) 26250 ns 25916 ns 1.01
Dense(128 => 128, relu)(128 x 128)/zygote/GPU/CUDA 209865 ns 209427 ns 1.00
Dense(128 => 128, relu)(128 x 128)/enzyme/CPU/2 thread(s) 45437.5 ns 302000 ns 0.15
Dense(128 => 128, relu)(128 x 128)/enzyme/CPU/4 thread(s) 46479.5 ns 328375 ns 0.14
Dense(128 => 128, relu)(128 x 128)/enzyme/CPU/8 thread(s) 46041 ns 842791.5 ns 0.054629169848058504
Dense(128 => 128, relu)(128 x 128)/enzyme/CPU/1 thread(s) 28209 ns 28333 ns 1.00
vgg16(32, 32, 3, 64)/forward/CPU/2 thread(s) 318266167 ns 602432125 ns 0.53
vgg16(32, 32, 3, 64)/forward/CPU/4 thread(s) 238108104 ns 430731937.5 ns 0.55
vgg16(32, 32, 3, 64)/forward/CPU/8 thread(s) 203733333 ns 392016750 ns 0.52
vgg16(32, 32, 3, 64)/forward/CPU/1 thread(s) 322939875 ns 322757833 ns 1.00
vgg16(32, 32, 3, 64)/forward/GPU/CUDA 7668589 ns 7676293 ns 1.00
vgg16(32, 32, 3, 64)/zygote/CPU/2 thread(s) 1098692854.5 ns 2003927916.5 ns 0.55
vgg16(32, 32, 3, 64)/zygote/CPU/4 thread(s) 952627249.5 ns 1623931938 ns 0.59
vgg16(32, 32, 3, 64)/zygote/CPU/8 thread(s) 856876291 ns 1626427584 ns 0.53
vgg16(32, 32, 3, 64)/zygote/CPU/1 thread(s) 1173710250 ns 1179210042 ns 1.00
vgg16(32, 32, 3, 64)/zygote/GPU/CUDA 27280510.5 ns 27131071 ns 1.01
Dense(512 => 512, gelu)(512 x 128)/forward/CPU/2 thread(s) 193124.5 ns 523645.5 ns 0.37
Dense(512 => 512, gelu)(512 x 128)/forward/CPU/4 thread(s) 168542 ns 450709 ns 0.37
Dense(512 => 512, gelu)(512 x 128)/forward/CPU/8 thread(s) 168187.5 ns 2446250 ns 0.06875319366377108
Dense(512 => 512, gelu)(512 x 128)/forward/CPU/1 thread(s) 218458.5 ns 219187.5 ns 1.00
Dense(512 => 512, gelu)(512 x 128)/forward/GPU/CUDA 47292 ns 47774.5 ns 0.99
Dense(512 => 512, gelu)(512 x 128)/zygote/CPU/2 thread(s) 1214729 ns 1875042 ns 0.65
Dense(512 => 512, gelu)(512 x 128)/zygote/CPU/4 thread(s) 1095750 ns 2602792 ns 0.42
Dense(512 => 512, gelu)(512 x 128)/zygote/CPU/8 thread(s) 1014896 ns 16587416.5 ns 0.06118469383101341
Dense(512 => 512, gelu)(512 x 128)/zygote/CPU/1 thread(s) 1504666 ns 1501583 ns 1.00
Dense(512 => 512, gelu)(512 x 128)/zygote/GPU/CUDA 222578.5 ns 226318.5 ns 0.98
Dense(512 => 512, gelu)(512 x 128)/enzyme/CPU/2 thread(s) 2298292 ns 2982667 ns 0.77
Dense(512 => 512, gelu)(512 x 128)/enzyme/CPU/4 thread(s) 2283250 ns 5736062.5 ns 0.40
Dense(512 => 512, gelu)(512 x 128)/enzyme/CPU/8 thread(s) 2158334 ns 17019146 ns 0.13
Dense(512 => 512, gelu)(512 x 128)/enzyme/CPU/1 thread(s) 2476833 ns 2470812.5 ns 1.00
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/2 thread(s) 1582437.5 ns 1498583 ns 1.06
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/4 thread(s) 1264833 ns 1193771 ns 1.06
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/8 thread(s) 1174562.5 ns 1029042 ns 1.14
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/1 thread(s) 2357375 ns 2235875 ns 1.05
mlp7layer_bn(tanh)(32 x 256)/forward/GPU/CUDA 571094.5 ns 572216 ns 1.00
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/2 thread(s) 3197541 ns 5950125 ns 0.54
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/4 thread(s) 2843042 ns 4653916 ns 0.61
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/8 thread(s) 2853458 ns 27167500 ns 0.11
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/1 thread(s) 3931104 ns 3927896 ns 1.00
mlp7layer_bn(tanh)(32 x 256)/zygote/GPU/CUDA 1330355 ns 1342658.5 ns 0.99
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/2 thread(s) 8842250 ns 11627667 ns 0.76
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/4 thread(s) 8776708 ns 14277520.5 ns 0.61
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/8 thread(s) 8804292 ns 36899542 ns 0.24
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/1 thread(s) 6342000 ns 6331458.5 ns 1.00
Dense(16 => 16, relu)(16 x 128)/forward/CPU/2 thread(s) 4625 ns 2333 ns 1.98
Dense(16 => 16, relu)(16 x 128)/forward/CPU/4 thread(s) 2458 ns 2166 ns 1.13
Dense(16 => 16, relu)(16 x 128)/forward/CPU/8 thread(s) 2542 ns 3333 ns 0.76
Dense(16 => 16, relu)(16 x 128)/forward/CPU/1 thread(s) 2416 ns 2646 ns 0.91
Dense(16 => 16, relu)(16 x 128)/forward/GPU/CUDA 24562 ns 25097 ns 0.98
Dense(16 => 16, relu)(16 x 128)/zygote/CPU/2 thread(s) 7125 ns 7333 ns 0.97
Dense(16 => 16, relu)(16 x 128)/zygote/CPU/4 thread(s) 7125 ns 7125 ns 1
Dense(16 => 16, relu)(16 x 128)/zygote/CPU/8 thread(s) 7417 ns 7375 ns 1.01
Dense(16 => 16, relu)(16 x 128)/zygote/CPU/1 thread(s) 7292 ns 7250 ns 1.01
Dense(16 => 16, relu)(16 x 128)/zygote/GPU/CUDA 186417 ns 189428.5 ns 0.98
Dense(16 => 16, relu)(16 x 128)/enzyme/CPU/2 thread(s) 8541 ns 8167 ns 1.05
Dense(16 => 16, relu)(16 x 128)/enzyme/CPU/4 thread(s) 8500 ns 8250 ns 1.03
Dense(16 => 16, relu)(16 x 128)/enzyme/CPU/8 thread(s) 8709 ns 8542 ns 1.02
Dense(16 => 16, relu)(16 x 128)/enzyme/CPU/1 thread(s) 6125 ns 6083 ns 1.01
Dense(16 => 16, gelu)(16 x 128)/forward/CPU/2 thread(s) 10625 ns 10667 ns 1.00
Dense(16 => 16, gelu)(16 x 128)/forward/CPU/4 thread(s) 14792 ns 14041.5 ns 1.05
Dense(16 => 16, gelu)(16 x 128)/forward/CPU/8 thread(s) 12000 ns 11125 ns 1.08
Dense(16 => 16, gelu)(16 x 128)/forward/CPU/1 thread(s) 7500 ns 7333 ns 1.02
Dense(16 => 16, gelu)(16 x 128)/forward/GPU/CUDA 24702.5 ns 25251 ns 0.98
Dense(16 => 16, gelu)(16 x 128)/zygote/CPU/2 thread(s) 21458 ns 21917 ns 0.98
Dense(16 => 16, gelu)(16 x 128)/zygote/CPU/4 thread(s) 21583 ns 21708.5 ns 0.99
Dense(16 => 16, gelu)(16 x 128)/zygote/CPU/8 thread(s) 22042 ns 21750 ns 1.01
Dense(16 => 16, gelu)(16 x 128)/zygote/CPU/1 thread(s) 21792 ns 21916 ns 0.99
Dense(16 => 16, gelu)(16 x 128)/zygote/GPU/CUDA 196629 ns 198645 ns 0.99
Dense(16 => 16, gelu)(16 x 128)/enzyme/CPU/2 thread(s) 56833 ns 53625 ns 1.06
Dense(16 => 16, gelu)(16 x 128)/enzyme/CPU/4 thread(s) 59166 ns 53500 ns 1.11
Dense(16 => 16, gelu)(16 x 128)/enzyme/CPU/8 thread(s) 57208 ns 53625 ns 1.07
Dense(16 => 16, gelu)(16 x 128)/enzyme/CPU/1 thread(s) 54542 ns 54583 ns 1.00
Dense(128 => 128, identity)(128 x 128)/forward/CPU/2 thread(s) 28687.5 ns 28395.5 ns 1.01
Dense(128 => 128, identity)(128 x 128)/forward/CPU/4 thread(s) 28709 ns 28667 ns 1.00
Dense(128 => 128, identity)(128 x 128)/forward/CPU/8 thread(s) 28792 ns 28417 ns 1.01
Dense(128 => 128, identity)(128 x 128)/forward/CPU/1 thread(s) 46041 ns 46084 ns 1.00
Dense(128 => 128, identity)(128 x 128)/forward/GPU/CUDA 25795 ns 26326 ns 0.98
Dense(128 => 128, identity)(128 x 128)/zygote/CPU/2 thread(s) 44250 ns 224125 ns 0.20
Dense(128 => 128, identity)(128 x 128)/zygote/CPU/4 thread(s) 47667 ns 272959 ns 0.17
Dense(128 => 128, identity)(128 x 128)/zygote/CPU/8 thread(s) 44000 ns 4409500 ns 0.009978455607211702
Dense(128 => 128, identity)(128 x 128)/zygote/CPU/1 thread(s) 63916 ns 65708 ns 0.97
Dense(128 => 128, identity)(128 x 128)/zygote/GPU/CUDA 167633.5 ns 170084 ns 0.99
Dense(128 => 128, identity)(128 x 128)/enzyme/CPU/2 thread(s) 68417 ns 240562 ns 0.28
Dense(128 => 128, identity)(128 x 128)/enzyme/CPU/4 thread(s) 68292 ns 290792 ns 0.23
Dense(128 => 128, identity)(128 x 128)/enzyme/CPU/8 thread(s) 68083 ns 4409209 ns 0.015441091588083032
Dense(128 => 128, identity)(128 x 128)/enzyme/CPU/1 thread(s) 68125 ns 71541 ns 0.95
Dense(16 => 16, identity)(16 x 128)/forward/CPU/2 thread(s) 2500 ns 1708.5 ns 1.46
Dense(16 => 16, identity)(16 x 128)/forward/CPU/4 thread(s) 1750 ns 1792 ns 0.98
Dense(16 => 16, identity)(16 x 128)/forward/CPU/8 thread(s) 1792 ns 2541.5 ns 0.71
Dense(16 => 16, identity)(16 x 128)/forward/CPU/1 thread(s) 1708 ns 1917 ns 0.89
Dense(16 => 16, identity)(16 x 128)/forward/GPU/CUDA 23041 ns 23384 ns 0.99
Dense(16 => 16, identity)(16 x 128)/zygote/CPU/2 thread(s) 5375 ns 5292 ns 1.02
Dense(16 => 16, identity)(16 x 128)/zygote/CPU/4 thread(s) 5083 ns 5291 ns 0.96
Dense(16 => 16, identity)(16 x 128)/zygote/CPU/8 thread(s) 5416 ns 5459 ns 0.99
Dense(16 => 16, identity)(16 x 128)/zygote/CPU/1 thread(s) 5125 ns 5208.5 ns 0.98
Dense(16 => 16, identity)(16 x 128)/zygote/GPU/CUDA 171497 ns 173533 ns 0.99
Dense(16 => 16, identity)(16 x 128)/enzyme/CPU/2 thread(s) 8375 ns 7417 ns 1.13
Dense(16 => 16, identity)(16 x 128)/enzyme/CPU/4 thread(s) 8167 ns 7500 ns 1.09
Dense(16 => 16, identity)(16 x 128)/enzyme/CPU/8 thread(s) 8208 ns 7708 ns 1.06
Dense(16 => 16, identity)(16 x 128)/enzyme/CPU/1 thread(s) 5708 ns 5625 ns 1.01
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) 34068625 ns 81107833 ns 0.42
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) 40361624.5 ns 49783792 ns 0.81
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) 43432603.5 ns 43745208 ns 0.99
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) 56216958.5 ns 56305270.5 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/GPU/CUDA 2631639 ns 2634961 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) 453239687.5 ns 620785875 ns 0.73
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) 319327021 ns 429264250 ns 0.74
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) 307674396 ns 416731125 ns 0.74
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) 506119959 ns 507694646.5 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/GPU/CUDA 15174112 ns 15139001 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) 735455458 ns 871599625 ns 0.84
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) 706582229 ns 839558208.5 ns 0.84
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) 743368604 ns 1206593209 ns 0.62
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) 910398833 ns 921408813 ns 0.99

This comment was automatically generated by workflow using github-action-benchmark.

Please sign in to comment.