-
Notifications
You must be signed in to change notification settings - Fork 282
Benchmarks
We evaluate performance with 5 widely-used standard models: VGG16, GoogleNet(Inception-V1), ResNet50, MobileNet, SqueezeNet and DenseNet-121, respectively. Our test bed 6 different ARM devices, including 2 cellphones, 2 server-class processors and 2 embedded dev boards, whose specifications can be found in the following table. All results are averaged from 20 consecutive runs and measured in milliseconds.
Table1: Specifications
Device | Processor | #CPUs @ Clock Speed | CPU Arch. | Memory (MB) | OS | SOC Power |
---|---|---|---|---|---|---|
Samsung S8 | Snapdragon 835 | 4 @ 2.45Ghz + 4 @ 1.90GHz | Kryo | 4GB | Android 7.0 | ~5W |
iPhone 7 | A10 Fusion | 2 @ 2.34Ghz + 2 @ 1.05GHz | Hurricane | 2GB | iOS 11.1 | ~5W |
Huawei D05 | Hi1616 | 2 * 32 @ 2.40GHz | Cortex-A72 | 256GB | Ubuntu 16.04 | >100W |
Phytium FT1500A/16 | FTC660 | 16 @ 1.50GHz | Earth | 64GB | Kylin 5.0 | 35W |
RK3399 | RK3399 | 2 @ 1.8Ghz + 4 @ 1.40GHz | Cortex-A72 | 2GB | Debian | 6.05W |
Raspberry Pi3 | Broadcom BCM2837 | 4 @ 1.2Ghz | Cortex-A53 | 1GB | Ubuntu 16.04 | ~5W |
The VGG series models run through many unit-stride 3x3 convolution layers (operators), therefore can be perfectly accelerated by the Winograd algorithm. We use VGG-16 as reference model to evaluate our Winograd implementation performance.
Table1: Avg. inference time (ms) on Arm-based CPUs.
Devices\Cores | 1 | 2 | 4 | 8 | GPU |
---|---|---|---|---|---|
Galaxy S8 | 925 | 630 | 489 | ||
iPhone 7 | 374 | 284 | |||
Huawei D05 | 755 | 399 | 226 | 149 | |
Phytium FT1500A | 1769 | 1020 | 639 | 444 | |
RK3399 | 1673 | 1420 | |||
TensorFlow lite on iPhone7 | 905 | ||||
ACL on RK3399 | 4103 | 1645 | |||
TVM on RK3399 | - | 1293 | |||
Intel Movidius* | 812 |
- Intel Movidius operates at FP16 precision.
We conduct layer-by-layer benchmarks on Samsung S8 and iPhone 7 along with NNPACK. We benchmark three FeatherCNN methods: img2col GEMM and Winograd F(2x2,3x3)/F(6x6,3x3), and also NNPACK F(6x6,3x3).
To evaluate the scalabiltiy of state-of-art CNN inference tools, Huawei D05 Server is a domestically made many-core arm server with 64 arm A72 cores. All these 64 cores are inter-connected with a token-ring network.
Network | 1 | 2 | 4 | 8 | 16 | 32 | 64 |
---|---|---|---|---|---|---|---|
VGG-16 | 1333 | 697 | 385 | 218 | 157 | 117 | 102 |
GoogleNet | 333 | 210 | 154 | 125 | 126 | 151 | 230 |
ResNet-50 | 573 | 356 | 187 | 117 | 104 | 65 | 194 |
SqueezeNet | 149 | 79 | 44 | 28 | 29 | 35 | 67 |
MobileNet | 124 | 70 | 42 | 36 | 34 | 52 | 76 |
DenseNet-121 | 517 | 273 | 156 | 98 | 113 | 160 | 331 |
Network | 1 | 2 | 4 | 8 | 16 | 32 | 64 | FeatherCNN Speedup |
---|---|---|---|---|---|---|---|---|
VGG-16 | 3329 | 2227 | 1443 | 1108 | 1137 | 2109 | 3721 | 10.86 |
GoogleNet | 1028 | 929 | 861 | 831 | 822 | 848 | 857 | 13.7 |
Resnet-50 | 728 | 490 | 347 | 278 | 252 | 346 | 365 | 3.88 |
SqueezeNet | 190 | 127 | 92 | 76 | 74 | 84 | 92 | 1.68 |
MobileNet | 211 | 166 | 146 | 139 | 137 | 153 | 184 | 4.03 |
DenseNet-121 | 865 | 593 | 438 | 373 | 354 | 655 | 856 | 3.08 |
Network | 1 | 2 | 4 | 8 | 16 | 32 | 64 | speedup |
---|---|---|---|---|---|---|---|---|
VGG-16 | 3267 | 2173 | 1550 | 1310 | 1385 | 1323 | 1401 | 12.84 |
GoogleNet | 351 | 347 | 267 | 306 | 894 | 2422 | 3938 | 4.45 |
Resnet-50 | 869 | 549 | 374 | 262 | 149 | 355 | 724 | 2.29 |
SqueezeNet | 91 | 65 | 55 | 87 | 221 | 628 | 723 | 1.25 |
MobileNet | 174 | 139 | 110 | 90 | 110 | 171 | 592 | 2.65 |
As ARM has a unique big.little architecture for energy saving, to evaluate the adaptation of schduling algortihm and blocking strategies with this big.little archtecture, RK3399 is selected as an widely used embeded developing board for testing. RK3399 has 2 big cores with 1.8GHz, and 4 little cores with 1.4GHz.
Network | 1 | 2 | 1 | 2 | 4 | all | Memory (MB) |
---|---|---|---|---|---|---|---|
[VGG16] | 2268 | 1620 | 6122 | 3422 | 2269 | 1932 | 904 |
[GoogleNet] | 416 | 250 | 927 | 524 | 333 | 294 | 168 |
[Resnet-50] | 857 | 517 | 1834 | 1009 | 671 | 555 | 466 |
[squeezenet] | 236 | 144 | 539 | 315 | 210 | 172 | 404 |
[mobilenet] | 242 | 137 | 487 | 271 | 165 | 153 | 176 |
[densenet-121] | 842 | 543 | 1854 | 1050 | 686 | 543 | 111 |
Network | 1 | 2 | 4 |
---|---|---|---|
[VGG16] | - | - | - |
[GoogleNet] | 1058 | 642 | 809 |
[Resnet-50] | 2107 | 1255 | 1540 |
[squeezenet] | 638 | 399 | 501 |
[mobilenet] | 451 | 275 | 206 |
[densenet-121] | 630 | 396 | 459 |
Network | 1 | 2 | 1 | 2 | 4 | all |
---|---|---|---|---|---|---|
[VGG16] | 1325 | 706 | 2540 | 1507 | 1226 | 844 |
[GoogleNet] | 274 | 146 | 366 | 206 | 127 | 105 |
[Resnet-50] | 480 | 266 | 759 | 417 | 261 | 215 |
[squeezenet] | 88 | 115 | 73 | 61 | 204 | 153 |
[mobilenet] | 156 | 87 | 211 | 116 | 68 | 56 |
[densenet-121] | - | - | - | - | - | - |
Network | 1 | 2 | 1 | 2 | 4 | all |
---|---|---|---|---|---|---|
[VGG16] | x | x | - | - | - | - |
[GoogleNet] | x | x | - | - | - | - |
[Resnet-50] | x | x | - | - | - | - |
[squeezenet] | x | x | x | - | - | - |
[mobilenet] | x | x | x | - | - | - |
[densenet-121] | - | - | - | - | - | - |