Skip to content

Commit

Permalink
chore: bump version for release
Browse files Browse the repository at this point in the history
  • Loading branch information
avik-pal authored Oct 22, 2024
1 parent 0f4e2dd commit 1a0c48a
Showing 1 changed file with 1 addition and 1 deletion.
2 changes: 1 addition & 1 deletion Project.toml
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
name = "Lux"
uuid = "b2108857-7c20-44ae-9111-449ecde12c47"
authors = ["Avik Pal <[email protected]> and contributors"]
version = "1.2.0-DEV"
version = "1.2.0"

[deps]
ADTypes = "47edcb42-4c32-4615-8424-f2b9edc5f35b"
Expand Down

3 comments on commit 1a0c48a

@avik-pal
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JuliaRegistrator
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Registration pull request created: JuliaRegistries/General/117857

Tip: Release Notes

Did you know you can add release notes too? Just add markdown formatted text underneath the comment after the text
"Release notes:" and it will be added to the registry PR, and if TagBot is installed it will also be added to the
release that TagBot creates. i.e.

@JuliaRegistrator register

Release notes:

## Breaking changes

- blah

To add them here just re-invoke and the PR will be updated.

Tagging

After the above pull request is merged, it is recommended that a tag is created on this repository for the registered package version.

This will be done automatically if the Julia TagBot GitHub Action is installed, or can be done manually through the github interface, or via:

git tag -a v1.2.0 -m "<description of version>" 1a0c48aa3550accbfba749c127119c6db835f1d0
git push origin v1.2.0

@github-actions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lux Benchmarks

Benchmark suite Current: 1a0c48a Previous: 1a701d2 Ratio
Dense(512 => 512, identity)(512 x 128)/forward/CPU/2 thread(s) 72166 ns 70812.5 ns 1.02
Dense(512 => 512, identity)(512 x 128)/forward/CPU/4 thread(s) 72208 ns 246375 ns 0.29
Dense(512 => 512, identity)(512 x 128)/forward/CPU/8 thread(s) 73792 ns 73979.5 ns 1.00
Dense(512 => 512, identity)(512 x 128)/forward/CPU/1 thread(s) 716209 ns 72062.5 ns 9.94
Dense(512 => 512, identity)(512 x 128)/forward/GPU/CUDA 43839 ns 43880 ns 1.00
Dense(512 => 512, identity)(512 x 128)/zygote/CPU/2 thread(s) 235500 ns 308958.5 ns 0.76
Dense(512 => 512, identity)(512 x 128)/zygote/CPU/4 thread(s) 292209 ns 793604 ns 0.37
Dense(512 => 512, identity)(512 x 128)/zygote/CPU/8 thread(s) 233771 ns 233375 ns 1.00
Dense(512 => 512, identity)(512 x 128)/zygote/CPU/1 thread(s) 2230250 ns 282166 ns 7.90
Dense(512 => 512, identity)(512 x 128)/zygote/GPU/CUDA 190422 ns 190452 ns 1.00
Dense(512 => 512, identity)(512 x 128)/enzyme/CPU/2 thread(s) 417042 ns 415000 ns 1.00
Dense(512 => 512, identity)(512 x 128)/enzyme/CPU/4 thread(s) 391834 ns 849833 ns 0.46
Dense(512 => 512, identity)(512 x 128)/enzyme/CPU/8 thread(s) 406666 ns 415875 ns 0.98
Dense(512 => 512, identity)(512 x 128)/enzyme/CPU/1 thread(s) 2236292 ns 300084 ns 7.45
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) 1617833 ns 1656000 ns 0.98
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) 1192083 ns 1060416 ns 1.12
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) 1372417 ns 1381084 ns 0.99
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) 3021645.5 ns 2397416 ns 1.26
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/GPU/CUDA 212820 ns 209181.5 ns 1.02
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) 12269375 ns 12352250 ns 0.99
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) 9554312.5 ns 8849958.5 ns 1.08
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) 9312229 ns 9323292 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) 18597917 ns 18049625 ns 1.03
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/GPU/CUDA 1906055 ns 1913367.5 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) 17342416.5 ns 17385417 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) 14370124.5 ns 13997625 ns 1.03
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) 14320000 ns 14382770.5 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) 21879042 ns 21102542 ns 1.04
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) 119514416 ns 122037708.5 ns 0.98
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) 174474958 ns 148978417 ns 1.17
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) 148469291 ns 147416375 ns 1.01
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) 448126458.5 ns 103471542 ns 4.33
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/GPU/CUDA 5495054 ns 5473319 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) 581881396 ns 584214291.5 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) 535710708 ns 924299000 ns 0.58
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) 443551604 ns 438064084 ns 1.01
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) 1628778458 ns 630297125 ns 2.58
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/GPU/CUDA 34985718 ns 38161414 ns 0.92
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) 743015729 ns 699393958.5 ns 1.06
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) 671791166 ns 1004148063 ns 0.67
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) 629284000 ns 601864833.5 ns 1.05
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) 1731283646 ns 740760041 ns 2.34
lenet(28, 28, 1, 32)/forward/CPU/2 thread(s) 860083.5 ns 878333 ns 0.98
lenet(28, 28, 1, 32)/forward/CPU/4 thread(s) 826167 ns 1659792 ns 0.50
lenet(28, 28, 1, 32)/forward/CPU/8 thread(s) 1216291.5 ns 1217875 ns 1.00
lenet(28, 28, 1, 32)/forward/CPU/1 thread(s) 790000 ns 959125 ns 0.82
lenet(28, 28, 1, 32)/forward/GPU/CUDA 263585.5 ns 265941.5 ns 0.99
lenet(28, 28, 1, 32)/zygote/CPU/2 thread(s) 2689458 ns 2700312.5 ns 1.00
lenet(28, 28, 1, 32)/zygote/CPU/4 thread(s) 2459645.5 ns 4392063 ns 0.56
lenet(28, 28, 1, 32)/zygote/CPU/8 thread(s) 3304937 ns 3302937.5 ns 1.00
lenet(28, 28, 1, 32)/zygote/CPU/1 thread(s) 3169583 ns 3374167 ns 0.94
lenet(28, 28, 1, 32)/zygote/GPU/CUDA 1050300 ns 1052375 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) 6771791 ns 6807874.5 ns 0.99
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) 6379542 ns 6268583 ns 1.02
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) 6492417 ns 6489125 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) 8179208 ns 7605063 ns 1.08
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/GPU/CUDA 211697 ns 209645 ns 1.01
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) 24030062 ns 24577000 ns 0.98
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) 21863145.5 ns 21078000 ns 1.04
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) 21209895.5 ns 21705458 ns 0.98
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) 30339645.5 ns 29719834 ns 1.02
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/GPU/CUDA 1970788 ns 1972195 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) 37405667 ns 48847875 ns 0.77
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) 34394458.5 ns 45336083.5 ns 0.76
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) 45812625 ns 34513729.5 ns 1.33
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) 49950334 ns 49269917 ns 1.01
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) 13380646 ns 13343291 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) 12414229 ns 12534687.5 ns 0.99
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) 12490042 ns 12487312 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) 16341833 ns 14837625 ns 1.10
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/GPU/CUDA 512690 ns 516082 ns 0.99
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) 47504708 ns 47462208 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) 41894125 ns 41011646 ns 1.02
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) 41039083 ns 41180667 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) 60879833 ns 58172313 ns 1.05
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/GPU/CUDA 3230861.5 ns 3228529 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) 74751500 ns 97347333.5 ns 0.77
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) 68922833 ns 146675958 ns 0.47
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) 91225334 ns 69089125 ns 1.32
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) 101557250 ns 98763625 ns 1.03
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) 283992291.5 ns 286756312.5 ns 0.99
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) 339835667 ns 316971625 ns 1.07
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) 313710583.5 ns 313660292 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) 612185542 ns 269927875 ns 2.27
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/GPU/CUDA 7106627 ns 7102725 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) 971662042 ns 977913167 ns 0.99
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) 878835625 ns 1267309084 ns 0.69
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) 824019771 ns 832686083.5 ns 0.99
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) 2224878916.5 ns 1109108334 ns 2.01
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/GPU/CUDA 33878542 ns 33843856 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) 1404909708 ns 1781568125 ns 0.79
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) 1358005896 ns 2031944646 ns 0.67
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) 1650625792 ns 1227897000 ns 1.34
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) 2661202542 ns 1668096458 ns 1.60
lenet(28, 28, 1, 128)/forward/CPU/2 thread(s) 1539792 ns 1549791.5 ns 0.99
lenet(28, 28, 1, 128)/forward/CPU/4 thread(s) 1272125 ns 3030500 ns 0.42
lenet(28, 28, 1, 128)/forward/CPU/8 thread(s) 1622375 ns 1627020.5 ns 1.00
lenet(28, 28, 1, 128)/forward/CPU/1 thread(s) 2495749.5 ns 2159334 ns 1.16
lenet(28, 28, 1, 128)/forward/GPU/CUDA 269633 ns 273208.5 ns 0.99
lenet(28, 28, 1, 128)/zygote/CPU/2 thread(s) 7849125 ns 7876334 ns 1.00
lenet(28, 28, 1, 128)/zygote/CPU/4 thread(s) 6602687.5 ns 12090041.5 ns 0.55
lenet(28, 28, 1, 128)/zygote/CPU/8 thread(s) 7142875 ns 7093625 ns 1.01
lenet(28, 28, 1, 128)/zygote/CPU/1 thread(s) 11764167 ns 10467416.5 ns 1.12
lenet(28, 28, 1, 128)/zygote/GPU/CUDA 1113950 ns 1133081.5 ns 0.98
vgg16(32, 32, 3, 32)/forward/CPU/2 thread(s) 192026500 ns 192676916.5 ns 1.00
vgg16(32, 32, 3, 32)/forward/CPU/4 thread(s) 144990959 ns 311575249.5 ns 0.47
vgg16(32, 32, 3, 32)/forward/CPU/8 thread(s) 129332541.5 ns 127609750 ns 1.01
vgg16(32, 32, 3, 32)/forward/CPU/1 thread(s) 447329791.5 ns 177249416 ns 2.52
vgg16(32, 32, 3, 32)/forward/GPU/CUDA 4844326 ns 4872584 ns 0.99
vgg16(32, 32, 3, 32)/zygote/CPU/2 thread(s) 623445292 ns 784043583 ns 0.80
vgg16(32, 32, 3, 32)/zygote/CPU/4 thread(s) 512510000 ns 1024411500 ns 0.50
vgg16(32, 32, 3, 32)/zygote/CPU/8 thread(s) 592532958 ns 559332750 ns 1.06
vgg16(32, 32, 3, 32)/zygote/CPU/1 thread(s) 1242092000 ns 794744875 ns 1.56
vgg16(32, 32, 3, 32)/zygote/GPU/CUDA 16447225 ns 16268331 ns 1.01
lenet(28, 28, 1, 64)/forward/CPU/2 thread(s) 1063792 ns 1083500 ns 0.98
lenet(28, 28, 1, 64)/forward/CPU/4 thread(s) 975750 ns 1648687.5 ns 0.59
lenet(28, 28, 1, 64)/forward/CPU/8 thread(s) 1351416.5 ns 1362000 ns 0.99
lenet(28, 28, 1, 64)/forward/CPU/1 thread(s) 1388875 ns 1380541 ns 1.01
lenet(28, 28, 1, 64)/forward/GPU/CUDA 271754 ns 273825 ns 0.99
lenet(28, 28, 1, 64)/zygote/CPU/2 thread(s) 4470209 ns 4431666.5 ns 1.01
lenet(28, 28, 1, 64)/zygote/CPU/4 thread(s) 3848624.5 ns 6601874.5 ns 0.58
lenet(28, 28, 1, 64)/zygote/CPU/8 thread(s) 4529125 ns 4608020.5 ns 0.98
lenet(28, 28, 1, 64)/zygote/CPU/1 thread(s) 6089375 ns 5706917 ns 1.07
lenet(28, 28, 1, 64)/zygote/GPU/CUDA 1134621 ns 1148559 ns 0.99
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) 23850583 ns 23836750 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) 34071208 ns 43746958 ns 0.78
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) 36834208 ns 37494917 ns 0.98
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) 132533000 ns 35098834 ns 3.78
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/GPU/CUDA 1820300 ns 1834843 ns 0.99
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) 184146437.5 ns 184322479.5 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) 159330625 ns 270582042 ns 0.59
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) 184542562.5 ns 185507041 ns 0.99
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) 647733209 ns 382851417 ns 1.69
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/GPU/CUDA 16521951 ns 16531854 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) 292798042 ns 294425541 ns 0.99
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) 244693792 ns 374504875 ns 0.65
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) 296969875.5 ns 293550625.5 ns 1.01
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) 646009938 ns 433623500 ns 1.49
vgg16(32, 32, 3, 128)/forward/CPU/2 thread(s) 762535917 ns 764247416 ns 1.00
vgg16(32, 32, 3, 128)/forward/CPU/4 thread(s) 492838000 ns 816447791 ns 0.60
vgg16(32, 32, 3, 128)/forward/CPU/8 thread(s) 438050083.5 ns 432863312 ns 1.01
vgg16(32, 32, 3, 128)/forward/CPU/1 thread(s) 1947787375 ns 859004750 ns 2.27
vgg16(32, 32, 3, 128)/forward/GPU/CUDA 12469626 ns 12470955 ns 1.00
vgg16(32, 32, 3, 128)/zygote/CPU/2 thread(s) 2013595563 ns 1863360604 ns 1.08
vgg16(32, 32, 3, 128)/zygote/CPU/4 thread(s) 1549070125 ns 2802913125 ns 0.55
vgg16(32, 32, 3, 128)/zygote/CPU/8 thread(s) 1579947145.5 ns 1500209542 ns 1.05
vgg16(32, 32, 3, 128)/zygote/CPU/1 thread(s) 4927858875 ns 2077162895.5 ns 2.37
vgg16(32, 32, 3, 128)/zygote/GPU/CUDA 49732109 ns 49776341 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) 3065875 ns 3069416.5 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) 2056250 ns 2083292 ns 0.99
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) 2274916 ns 2310104 ns 0.98
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) 6030541.5 ns 4825875 ns 1.25
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/GPU/CUDA 578458 ns 579507 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) 25488666.5 ns 25429750 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) 19719458 ns 18860542 ns 1.05
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) 18691208.5 ns 18995771 ns 0.98
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) 39310417 ns 36731291 ns 1.07
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/GPU/CUDA 3197549 ns 3198905 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) 34796812.5 ns 34791875 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) 29570667 ns 81624291 ns 0.36
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) 29341958 ns 29509583.5 ns 0.99
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) 45325625 ns 42723125 ns 1.06
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) 1645833 ns 1656125 ns 0.99
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) 1199459 ns 1096584 ns 1.09
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) 1394895.5 ns 1407125 ns 0.99
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) 3030917 ns 2444459 ns 1.24
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/GPU/CUDA 214156 ns 215245.5 ns 0.99
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) 12708499.5 ns 12711292 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) 9952917 ns 9213500 ns 1.08
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) 9680666 ns 9770917 ns 0.99
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) 19008687.5 ns 18346563 ns 1.04
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/GPU/CUDA 1955141 ns 1941213 ns 1.01
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) 17685396 ns 17682833 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) 14735750 ns 14335354 ns 1.03
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) 14612083 ns 14654500 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) 22192833.5 ns 21449958 ns 1.03
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) 23611041 ns 23917958 ns 0.99
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) 34102333 ns 43787792 ns 0.78
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) 37338792 ns 37445542 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) 132620542 ns 35102208 ns 3.78
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/GPU/CUDA 1824553 ns 1837807 ns 0.99
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) 301585020.5 ns 302651708 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) 228577270.5 ns 340997792 ns 0.67
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) 191160459 ns 191785000 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) 600527188 ns 389240750 ns 1.54
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/GPU/CUDA 13961516.5 ns 17931121 ns 0.78
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) 297018625 ns 298716042 ns 0.99
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) 251347292 ns 431554958 ns 0.58
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) 301596896 ns 297489208.5 ns 1.01
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) 651795542 ns 437979250 ns 1.49
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/2 thread(s) 2419583 ns 2420395.5 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/4 thread(s) 2415875 ns 3814687.5 ns 0.63
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/8 thread(s) 2283687.5 ns 2371562.5 ns 0.96
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/1 thread(s) 4100250 ns 2407083 ns 1.70
mlp7layer_bn(gelu)(32 x 256)/forward/GPU/CUDA 574636 ns 570784 ns 1.01
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/2 thread(s) 6528500 ns 6550584 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/4 thread(s) 6517354.5 ns 10128729 ns 0.64
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/8 thread(s) 6521916.5 ns 6537000 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/1 thread(s) 11653375 ns 6522791 ns 1.79
mlp7layer_bn(gelu)(32 x 256)/zygote/GPU/CUDA 1367239 ns 1368166.5 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/2 thread(s) 17536000.5 ns 17591937.5 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/4 thread(s) 17510833 ns 21731104 ns 0.81
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/8 thread(s) 17511291 ns 17540854 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/1 thread(s) 18970666 ns 14118125 ns 1.34
Dense(512 => 512, relu)(512 x 128)/forward/CPU/2 thread(s) 69583 ns 65625 ns 1.06
Dense(512 => 512, relu)(512 x 128)/forward/CPU/4 thread(s) 68459 ns 578000 ns 0.12
Dense(512 => 512, relu)(512 x 128)/forward/CPU/8 thread(s) 70042 ns 70583 ns 0.99
Dense(512 => 512, relu)(512 x 128)/forward/CPU/1 thread(s) 725125 ns 68770.5 ns 10.54
Dense(512 => 512, relu)(512 x 128)/forward/GPU/CUDA 47418 ns 47926 ns 0.99
Dense(512 => 512, relu)(512 x 128)/zygote/CPU/2 thread(s) 329958.5 ns 331792 ns 0.99
Dense(512 => 512, relu)(512 x 128)/zygote/CPU/4 thread(s) 326187.5 ns 1017417 ns 0.32
Dense(512 => 512, relu)(512 x 128)/zygote/CPU/8 thread(s) 285625 ns 316208 ns 0.90
Dense(512 => 512, relu)(512 x 128)/zygote/CPU/1 thread(s) 2298750 ns 331208 ns 6.94
Dense(512 => 512, relu)(512 x 128)/zygote/GPU/CUDA 210185 ns 214904 ns 0.98
Dense(512 => 512, relu)(512 x 128)/enzyme/CPU/2 thread(s) 440375 ns 446708 ns 0.99
Dense(512 => 512, relu)(512 x 128)/enzyme/CPU/4 thread(s) 445167 ns 1090500 ns 0.41
Dense(512 => 512, relu)(512 x 128)/enzyme/CPU/8 thread(s) 429896 ns 430604.5 ns 1.00
Dense(512 => 512, relu)(512 x 128)/enzyme/CPU/1 thread(s) 2260958 ns 374959 ns 6.03
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) 3056625 ns 3044333 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) 2083417 ns 2071416 ns 1.01
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) 2246583 ns 2282063 ns 0.98
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) 6008729.5 ns 4836833 ns 1.24
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/GPU/CUDA 576345.5 ns 580025 ns 0.99
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) 23667021 ns 23703041.5 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) 18070167 ns 17227208.5 ns 1.05
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) 18267041 ns 18457937.5 ns 0.99
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) 38420083 ns 36049312.5 ns 1.07
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/GPU/CUDA 3100732.5 ns 3101815 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) 34097125 ns 34744792 ns 0.98
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) 27584000 ns 80182375 ns 0.34
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) 29094416 ns 29270125 ns 0.99
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) 44555479.5 ns 41879958 ns 1.06
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) 119734792 ns 120065104.5 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) 173704250 ns 148623146 ns 1.17
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) 147841375 ns 147898792 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) 447866083 ns 107010541 ns 4.19
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/GPU/CUDA 5452572 ns 5474909 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) 472213958 ns 469653562.5 ns 1.01
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) 534077167 ns 926515542 ns 0.58
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) 434948646 ns 432504750 ns 1.01
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) 1735672750 ns 724938750 ns 2.39
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/GPU/CUDA 32301623 ns 35169265.5 ns 0.92
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) 638421125 ns 638211020.5 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) 660879000 ns 983543708 ns 0.67
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) 598776187.5 ns 613930458.5 ns 0.98
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) 1717902229 ns 724015375 ns 2.37
mlp7layer_bn(relu)(32 x 256)/forward/CPU/2 thread(s) 330166 ns 420167 ns 0.79
mlp7layer_bn(relu)(32 x 256)/forward/CPU/4 thread(s) 452791.5 ns 1717812.5 ns 0.26
mlp7layer_bn(relu)(32 x 256)/forward/CPU/8 thread(s) 329750 ns 330104 ns 1.00
mlp7layer_bn(relu)(32 x 256)/forward/CPU/1 thread(s) 2049250 ns 420291 ns 4.88
mlp7layer_bn(relu)(32 x 256)/forward/GPU/CUDA 568084 ns 565510.5 ns 1.00
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/2 thread(s) 2007083.5 ns 2023542 ns 0.99
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/4 thread(s) 1957667 ns 5126750 ns 0.38
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/8 thread(s) 2009791.5 ns 2014333 ns 1.00
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/1 thread(s) 7110375 ns 2011167 ns 3.54
mlp7layer_bn(relu)(32 x 256)/zygote/GPU/CUDA 1325105 ns 1320420 ns 1.00
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/2 thread(s) 5779208 ns 5781000 ns 1.00
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/4 thread(s) 5777292 ns 8875375 ns 0.65
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/8 thread(s) 5772521 ns 5781291 ns 1.00
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/1 thread(s) 7808917 ns 2873875 ns 2.72
Dense(128 => 128, gelu)(128 x 128)/forward/CPU/2 thread(s) 104666.5 ns 102959 ns 1.02
Dense(128 => 128, gelu)(128 x 128)/forward/CPU/4 thread(s) 104375 ns 536583 ns 0.19
Dense(128 => 128, gelu)(128 x 128)/forward/CPU/8 thread(s) 105292 ns 104500 ns 1.01
Dense(128 => 128, gelu)(128 x 128)/forward/CPU/1 thread(s) 140750 ns 103770.5 ns 1.36
Dense(128 => 128, gelu)(128 x 128)/forward/GPU/CUDA 27866 ns 27852 ns 1.00
Dense(128 => 128, gelu)(128 x 128)/zygote/CPU/2 thread(s) 208750 ns 210334 ns 0.99
Dense(128 => 128, gelu)(128 x 128)/zygote/CPU/4 thread(s) 209458 ns 515750 ns 0.41
Dense(128 => 128, gelu)(128 x 128)/zygote/CPU/8 thread(s) 209708 ns 209292 ns 1.00
Dense(128 => 128, gelu)(128 x 128)/zygote/CPU/1 thread(s) 333875 ns 209896 ns 1.59
Dense(128 => 128, gelu)(128 x 128)/zygote/GPU/CUDA 219920 ns 219036 ns 1.00
Dense(128 => 128, gelu)(128 x 128)/enzyme/CPU/2 thread(s) 706667 ns 707208.5 ns 1.00
Dense(128 => 128, gelu)(128 x 128)/enzyme/CPU/4 thread(s) 706750 ns 1040041 ns 0.68
Dense(128 => 128, gelu)(128 x 128)/enzyme/CPU/8 thread(s) 707166.5 ns 706750 ns 1.00
Dense(128 => 128, gelu)(128 x 128)/enzyme/CPU/1 thread(s) 807417 ns 687937.5 ns 1.17
Dense(128 => 128, relu)(128 x 128)/forward/CPU/2 thread(s) 12437.5 ns 13084 ns 0.95
Dense(128 => 128, relu)(128 x 128)/forward/CPU/4 thread(s) 13395.5 ns 441042 ns 0.030372390838060773
Dense(128 => 128, relu)(128 x 128)/forward/CPU/8 thread(s) 14542 ns 13958 ns 1.04
Dense(128 => 128, relu)(128 x 128)/forward/CPU/1 thread(s) 54479.5 ns 13354 ns 4.08
Dense(128 => 128, relu)(128 x 128)/forward/GPU/CUDA 27908 ns 27961 ns 1.00
Dense(128 => 128, relu)(128 x 128)/zygote/CPU/2 thread(s) 25854.5 ns 25792 ns 1.00
Dense(128 => 128, relu)(128 x 128)/zygote/CPU/4 thread(s) 25833 ns 321521 ns 0.08034622932872192
Dense(128 => 128, relu)(128 x 128)/zygote/CPU/8 thread(s) 25875 ns 25792 ns 1.00
Dense(128 => 128, relu)(128 x 128)/zygote/CPU/1 thread(s) 151125 ns 26083 ns 5.79
Dense(128 => 128, relu)(128 x 128)/zygote/GPU/CUDA 208325 ns 210134.5 ns 0.99
Dense(128 => 128, relu)(128 x 128)/enzyme/CPU/2 thread(s) 46458 ns 46084 ns 1.01
Dense(128 => 128, relu)(128 x 128)/enzyme/CPU/4 thread(s) 45042 ns 335833.5 ns 0.13
Dense(128 => 128, relu)(128 x 128)/enzyme/CPU/8 thread(s) 45959 ns 46083 ns 1.00
Dense(128 => 128, relu)(128 x 128)/enzyme/CPU/1 thread(s) 150875 ns 27042 ns 5.58
vgg16(32, 32, 3, 64)/forward/CPU/2 thread(s) 319195333 ns 320271709 ns 1.00
vgg16(32, 32, 3, 64)/forward/CPU/4 thread(s) 234131917 ns 430953624.5 ns 0.54
vgg16(32, 32, 3, 64)/forward/CPU/8 thread(s) 279995959 ns 272585584 ns 1.03
vgg16(32, 32, 3, 64)/forward/CPU/1 thread(s) 864882791.5 ns 319252125 ns 2.71
vgg16(32, 32, 3, 64)/forward/GPU/CUDA 7664558.5 ns 7617926.5 ns 1.01
vgg16(32, 32, 3, 64)/zygote/CPU/2 thread(s) 1232626937.5 ns 1240631125 ns 0.99
vgg16(32, 32, 3, 64)/zygote/CPU/4 thread(s) 1005847687.5 ns 1686654750.5 ns 0.60
vgg16(32, 32, 3, 64)/zygote/CPU/8 thread(s) 784149062.5 ns 879062875 ns 0.89
vgg16(32, 32, 3, 64)/zygote/CPU/1 thread(s) 3031224208 ns 1565071083 ns 1.94
vgg16(32, 32, 3, 64)/zygote/GPU/CUDA 27144525 ns 27143469 ns 1.00
Dense(512 => 512, gelu)(512 x 128)/forward/CPU/2 thread(s) 417750 ns 418250 ns 1.00
Dense(512 => 512, gelu)(512 x 128)/forward/CPU/4 thread(s) 416459 ns 622375 ns 0.67
Dense(512 => 512, gelu)(512 x 128)/forward/CPU/8 thread(s) 422103.5 ns 430500 ns 0.98
Dense(512 => 512, gelu)(512 x 128)/forward/CPU/1 thread(s) 1068209 ns 415333 ns 2.57
Dense(512 => 512, gelu)(512 x 128)/forward/GPU/CUDA 47170.5 ns 47795.5 ns 0.99
Dense(512 => 512, gelu)(512 x 128)/zygote/CPU/2 thread(s) 1085083 ns 1080750 ns 1.00
Dense(512 => 512, gelu)(512 x 128)/zygote/CPU/4 thread(s) 1064667 ns 1617583 ns 0.66
Dense(512 => 512, gelu)(512 x 128)/zygote/CPU/8 thread(s) 1085854.5 ns 1080791.5 ns 1.00
Dense(512 => 512, gelu)(512 x 128)/zygote/CPU/1 thread(s) 3028125 ns 1082812.5 ns 2.80
Dense(512 => 512, gelu)(512 x 128)/zygote/GPU/CUDA 223831 ns 225487.5 ns 0.99
Dense(512 => 512, gelu)(512 x 128)/enzyme/CPU/2 thread(s) 3098396 ns 3105916.5 ns 1.00
Dense(512 => 512, gelu)(512 x 128)/enzyme/CPU/4 thread(s) 3111521 ns 3686708 ns 0.84
Dense(512 => 512, gelu)(512 x 128)/enzyme/CPU/8 thread(s) 3113958 ns 3109708 ns 1.00
Dense(512 => 512, gelu)(512 x 128)/enzyme/CPU/1 thread(s) 4879084 ns 3032000 ns 1.61
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/2 thread(s) 447667 ns 583375 ns 0.77
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/4 thread(s) 527917 ns 1019792 ns 0.52
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/8 thread(s) 584791 ns 581583 ns 1.01
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/1 thread(s) 2202771 ns 530292 ns 4.15
mlp7layer_bn(tanh)(32 x 256)/forward/GPU/CUDA 569915 ns 575876.5 ns 0.99
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/2 thread(s) 2138042 ns 2132729 ns 1.00
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/4 thread(s) 2129688 ns 4908875 ns 0.43
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/8 thread(s) 2137667 ns 2119979.5 ns 1.01
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/1 thread(s) 7240666.5 ns 2135125 ns 3.39
mlp7layer_bn(tanh)(32 x 256)/zygote/GPU/CUDA 1331738 ns 1355203 ns 0.98
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/2 thread(s) 7933875 ns 7940395.5 ns 1.00
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/4 thread(s) 7937583 ns 10769750 ns 0.74
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/8 thread(s) 7946916 ns 7927833.5 ns 1.00
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/1 thread(s) 9720958.5 ns 4851791.5 ns 2.00
Dense(16 => 16, relu)(16 x 128)/forward/CPU/2 thread(s) 6958 ns 6541 ns 1.06
Dense(16 => 16, relu)(16 x 128)/forward/CPU/4 thread(s) 7000 ns 4375 ns 1.60
Dense(16 => 16, relu)(16 x 128)/forward/CPU/8 thread(s) 7666.5 ns 8229.5 ns 0.93
Dense(16 => 16, relu)(16 x 128)/forward/CPU/1 thread(s) 4583 ns 6875 ns 0.67
Dense(16 => 16, relu)(16 x 128)/forward/GPU/CUDA 24682 ns 25194 ns 0.98
Dense(16 => 16, relu)(16 x 128)/zygote/CPU/2 thread(s) 7541 ns 7875 ns 0.96
Dense(16 => 16, relu)(16 x 128)/zygote/CPU/4 thread(s) 7084 ns 9250 ns 0.77
Dense(16 => 16, relu)(16 x 128)/zygote/CPU/8 thread(s) 7875 ns 7417 ns 1.06
Dense(16 => 16, relu)(16 x 128)/zygote/CPU/1 thread(s) 9292 ns 7625 ns 1.22
Dense(16 => 16, relu)(16 x 128)/zygote/GPU/CUDA 186807 ns 191956.5 ns 0.97
Dense(16 => 16, relu)(16 x 128)/enzyme/CPU/2 thread(s) 9084 ns 9000 ns 1.01
Dense(16 => 16, relu)(16 x 128)/enzyme/CPU/4 thread(s) 9083 ns 9459 ns 0.96
Dense(16 => 16, relu)(16 x 128)/enzyme/CPU/8 thread(s) 9208.5 ns 9062.5 ns 1.02
Dense(16 => 16, relu)(16 x 128)/enzyme/CPU/1 thread(s) 7167 ns 6041 ns 1.19
Dense(16 => 16, gelu)(16 x 128)/forward/CPU/2 thread(s) 20083 ns 19958 ns 1.01
Dense(16 => 16, gelu)(16 x 128)/forward/CPU/4 thread(s) 19458 ns 15709 ns 1.24
Dense(16 => 16, gelu)(16 x 128)/forward/CPU/8 thread(s) 20750 ns 21312 ns 0.97
Dense(16 => 16, gelu)(16 x 128)/forward/CPU/1 thread(s) 15292 ns 19583 ns 0.78
Dense(16 => 16, gelu)(16 x 128)/forward/GPU/CUDA 25020 ns 25221 ns 0.99
Dense(16 => 16, gelu)(16 x 128)/zygote/CPU/2 thread(s) 33959 ns 34042 ns 1.00
Dense(16 => 16, gelu)(16 x 128)/zygote/CPU/4 thread(s) 33959 ns 30792 ns 1.10
Dense(16 => 16, gelu)(16 x 128)/zygote/CPU/8 thread(s) 33542 ns 33479.5 ns 1.00
Dense(16 => 16, gelu)(16 x 128)/zygote/CPU/1 thread(s) 30625 ns 33792 ns 0.91
Dense(16 => 16, gelu)(16 x 128)/zygote/GPU/CUDA 196006 ns 200953.5 ns 0.98
Dense(16 => 16, gelu)(16 x 128)/enzyme/CPU/2 thread(s) 95104.5 ns 94500 ns 1.01
Dense(16 => 16, gelu)(16 x 128)/enzyme/CPU/4 thread(s) 94875 ns 91250 ns 1.04
Dense(16 => 16, gelu)(16 x 128)/enzyme/CPU/8 thread(s) 94583.5 ns 94666 ns 1.00
Dense(16 => 16, gelu)(16 x 128)/enzyme/CPU/1 thread(s) 88541 ns 92125 ns 0.96
Dense(128 => 128, identity)(128 x 128)/forward/CPU/2 thread(s) 13500 ns 13208 ns 1.02
Dense(128 => 128, identity)(128 x 128)/forward/CPU/4 thread(s) 11792 ns 357500 ns 0.03298461538461538
Dense(128 => 128, identity)(128 x 128)/forward/CPU/8 thread(s) 19042 ns 14646 ns 1.30
Dense(128 => 128, identity)(128 x 128)/forward/CPU/1 thread(s) 50750 ns 13458 ns 3.77
Dense(128 => 128, identity)(128 x 128)/forward/GPU/CUDA 26010 ns 26262 ns 0.99
Dense(128 => 128, identity)(128 x 128)/zygote/CPU/2 thread(s) 24250 ns 23667 ns 1.02
Dense(128 => 128, identity)(128 x 128)/zygote/CPU/4 thread(s) 23625 ns 261229 ns 0.09043789165827684
Dense(128 => 128, identity)(128 x 128)/zygote/CPU/8 thread(s) 24125 ns 23333 ns 1.03
Dense(128 => 128, identity)(128 x 128)/zygote/CPU/1 thread(s) 146750 ns 24125 ns 6.08
Dense(128 => 128, identity)(128 x 128)/zygote/GPU/CUDA 167198.5 ns 171941 ns 0.97
Dense(128 => 128, identity)(128 x 128)/enzyme/CPU/2 thread(s) 57250 ns 57250 ns 1
Dense(128 => 128, identity)(128 x 128)/enzyme/CPU/4 thread(s) 57250 ns 283208 ns 0.20
Dense(128 => 128, identity)(128 x 128)/enzyme/CPU/8 thread(s) 57500 ns 57250 ns 1.00
Dense(128 => 128, identity)(128 x 128)/enzyme/CPU/1 thread(s) 152209 ns 34083 ns 4.47
Dense(16 => 16, identity)(16 x 128)/forward/CPU/2 thread(s) 6584 ns 6250 ns 1.05
Dense(16 => 16, identity)(16 x 128)/forward/CPU/4 thread(s) 6542 ns 3417 ns 1.91
Dense(16 => 16, identity)(16 x 128)/forward/CPU/8 thread(s) 7479 ns 7834 ns 0.95
Dense(16 => 16, identity)(16 x 128)/forward/CPU/1 thread(s) 3292 ns 6167 ns 0.53
Dense(16 => 16, identity)(16 x 128)/forward/GPU/CUDA 22935 ns 23362 ns 0.98
Dense(16 => 16, identity)(16 x 128)/zygote/CPU/2 thread(s) 5250 ns 5417 ns 0.97
Dense(16 => 16, identity)(16 x 128)/zygote/CPU/4 thread(s) 5334 ns 6875 ns 0.78
Dense(16 => 16, identity)(16 x 128)/zygote/CPU/8 thread(s) 5542 ns 5667 ns 0.98
Dense(16 => 16, identity)(16 x 128)/zygote/CPU/1 thread(s) 7000 ns 5292 ns 1.32
Dense(16 => 16, identity)(16 x 128)/zygote/GPU/CUDA 170791 ns 176236 ns 0.97
Dense(16 => 16, identity)(16 x 128)/enzyme/CPU/2 thread(s) 9667 ns 9125 ns 1.06
Dense(16 => 16, identity)(16 x 128)/enzyme/CPU/4 thread(s) 9000 ns 8333 ns 1.08
Dense(16 => 16, identity)(16 x 128)/enzyme/CPU/8 thread(s) 9250 ns 9333 ns 0.99
Dense(16 => 16, identity)(16 x 128)/enzyme/CPU/1 thread(s) 6083 ns 6083 ns 1
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) 106049667 ns 107034791.5 ns 0.99
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) 117278625 ns 128490375 ns 0.91
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) 120820500.5 ns 120207791 ns 1.01
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) 214863521 ns 118338292 ns 1.82
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/GPU/CUDA 2646634.5 ns 2634352 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) 397432312.5 ns 397535521 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) 375661416 ns 482736437.5 ns 0.78
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) 403543791.5 ns 395708541 ns 1.02
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) 890588792 ns 634072166 ns 1.40
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/GPU/CUDA 15247832 ns 15152647.5 ns 1.01
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) 622643125 ns 805393791.5 ns 0.77
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) 576393229 ns 959710667 ns 0.60
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) 815424292 ns 630441458 ns 1.29
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) 1176606500 ns 909742750 ns 1.29

This comment was automatically generated by workflow using github-action-benchmark.

Please sign in to comment.