Skip to content

Benchmarks

Haidong Lan edited this page Feb 6, 2019 · 39 revisions

Benchmarks

Arm Architectures -- Cellphones, Servers and Embedded boards

Test bed settings

We evaluate performance with 5 widely-used standard models: VGG16, GoogleNet(Inception-V1), ResNet50, MobileNet, SqueezeNet and DenseNet-121, respectively. Our test bed 6 different ARM devices, including 2 cellphones, 2 server-class processors and 2 embedded dev boards, whose specifications can be found in the following table. All results are averaged from 20 consecutive runs and measured in milliseconds.

Table1: Specifications

Device Processor #CPUs @ Clock Speed CPU Arch. Memory (MB) OS SOC Power
Samsung S8 Snapdragon 835 4 @ 2.45Ghz + 4 @ 1.90GHz Kryo 4GB Android 7.0 ~5W
iPhone 7 A10 Fusion 2 @ 2.34Ghz + 2 @ 1.05GHz Hurricane 2GB iOS 11.1 ~5W
Huawei D05 Hi1616 2 * 32 @ 2.40GHz Cortex-A72 256GB Ubuntu 16.04 >100W
Phytium FT1500A/16 FTC660 16 @ 1.50GHz Earth 64GB Kylin 5.0 35W
RK3399 RK3399 2 @ 1.8Ghz + 4 @ 1.40GHz Cortex-A72 2GB Debian 6.05W
Raspberry Pi3     Broadcom BCM2837 4 @ 1.2Ghz Cortex-A53 1GB   Ubuntu 16.04 ~5W

VGG-16 end-to-end benchmark

The VGG series models run through many unit-stride 3x3 convolution layers (operators), therefore can be perfectly accelerated by the Winograd algorithm. We use VGG-16 as reference model to evaluate our Winograd implementation performance.

Table1: Avg. inference time (ms) on Arm-based CPUs.

Devices\Cores 1 2 4 8 GPU
Galaxy S8 925 630 489
iPhone 7 374 284
Huawei D05 755 399 226 149
Phytium FT1500A 1769 1020 639 444
RK3399 1673 1420
TensorFlow lite on iPhone7 905
ACL on RK3399 4103 1645
TVM on RK3399 - 1293
Intel Movidius* 812
  • Intel Movidius operates at FP16 precision.

VGG-16 layer-by-layer benchmark

We conduct layer-by-layer benchmarks on Samsung S8 and iPhone 7 along with NNPACK. We benchmark three FeatherCNN methods: img2col GEMM and Winograd F(2x2,3x3)/F(6x6,3x3), and also NNPACK F(6x6,3x3).

1. Huawei D05 Server (64-core, dual sockets)

To evaluate the scalabiltiy of state-of-art CNN inference tools, Huawei D05 Server is a domestically made many-core arm server with 64 arm A72 cores. All these 64 cores are inter-connected with a token-ring network.

1.1 FeatherCNN-F(2x2,3x3)

Network 1 2 4 8 16 32 64
VGG-16 1333 697 385 218 157 117 102
GoogleNet 333 210 154 125 126 151 230
ResNet-50 573 356 187 117 104 65 194
SqueezeNet 149 79 44 28 29 35 67
MobileNet 124 70 42 36 34 52 76
DenseNet-121 517 273 156 98 113 160 331

1.2 Caffe + OpenBLAS

Network 1 2 4 8 16 32 64 FeatherCNN Speedup
VGG-16 3329 2227 1443 1108 1137 2109 3721 10.86
GoogleNet 1028 929 861 831 822 848 857 13.7
Resnet-50 728 490 347 278 252 346 365 3.88
SqueezeNet 190 127 92 76 74 84 92 1.68
MobileNet 211 166 146 139 137 153 184 4.03
DenseNet-121 865 593 438 373 354 655 856 3.08

1.3 Caffe2 + Eigen

Network 1 2 4 8 16 32 64 speedup
VGG-16 3267 2173 1550 1310 1385 1323 1401 12.84
GoogleNet 351 347 267 306 894 2422 3938 4.45
Resnet-50 869 549 374 262 149 355 724 2.29
SqueezeNet 91 65 55 87 221 628 723 1.25
MobileNet 174 139 110 90 110 171 592 2.65

2. RK3399 (2 big and 4 little cores, big.little architecture)

As ARM has a unique big.little architecture for energy saving, to evaluate the adaptation of schduling algortihm and blocking strategies with this big.little archtecture, RK3399 is selected as an widely used embeded developing board for testing. RK3399 has 2 big cores with 1.8GHz, and 4 little cores with 1.4GHz.

2.1 FeatherCNN

Network 1 2 1 2 4 all Memory (MB)
[VGG16] 2268 1620 6122 3422 2269 1932 904
[GoogleNet] 416 250 927 524 333 294 168
[Resnet-50] 857 517 1834 1009 671 555 466
[squeezenet] 236 144 539 315 210 172 404
[mobilenet] 242 137 487 271 165 153 176
[densenet-121] 842 543 1854 1050 686 543 111

2.2 Caffe + OpenBLAS

2.3 Caffe2 + Eigen

3. Raspberry Pi 3 (4 A53 cores)

3.1 FeatherCNN

Network 1 2 4
[VGG16] - - -
[GoogleNet] 1058 642 809
[Resnet-50] 2107 1255 1540
[squeezenet] 638 399 501
[mobilenet] 451 275 206
[densenet-121] 630 396 459

3.2 Caffe + OpenBLAS

3.3 Caffe2 + Eigen

4. TX2 (2 big and 4 little cores, big.little architecture)

4.1 FeatherCNN

Network 1 2 1 2 4 all
[VGG16] 1325 706 2540 1507 1226 844
[GoogleNet] 274 146 366 206 127 105
[Resnet-50] 480 266 759 417 261 215
[squeezenet] 88 115 73 61 204 153
[mobilenet] 156 87 211 116 68 56
[densenet-121] - - - - - -

4.2 Caffe2 + NNPACK

Network 1 2 1 2 4 all
[VGG16] x x - - - -
[GoogleNet] x x - - - -
[Resnet-50] x x - - - -
[squeezenet] x x x - - -
[mobilenet] x x x - - -
[densenet-121] - - - - - -
Clone this wiki locally