Benchmarks

Arm Architectures -- Cellphones, Servers and Embedded boards

Test bed settings

We evaluate performance with 5 widely-used standard models: VGG16, GoogleNet(Inception-V1), ResNet50, MobileNet, SqueezeNet and DenseNet-121, respectively. Our test bed 6 different ARM devices, including 2 cellphones, 2 server-class processors and 2 embedded dev boards, whose specifications can be found in the following table. All results are averaged from 20 consecutive runs and measured in milliseconds.

Table1: Specifications

Device	Processor	#CPUs @ Clock Speed	CPU Arch.	Memory (MB)	OS	SOC Power
Samsung S8	Snapdragon 835	4 @ 2.45Ghz + 4 @ 1.90GHz	Kryo	4GB	Android 7.0	~5W
iPhone 7	A10 Fusion	2 @ 2.34Ghz + 2 @ 1.05GHz	Hurricane	2GB	iOS 11.1	~5W
Huawei D05	Hi1616	2 * 32 @ 2.40GHz	Cortex-A72	256GB	Ubuntu 16.04	>100W
Phytium FT1500A/16	FTC660	16 @ 1.50GHz	Earth	64GB	Kylin 5.0	35W
RK3399	RK3399	2 @ 1.8Ghz + 4 @ 1.40GHz	Cortex-A72	2GB	Debian	6.05W
Raspberry Pi3	Broadcom BCM2837	4 @ 1.2Ghz	Cortex-A53	1GB	Ubuntu 16.04	~5W

VGG-16 end-to-end benchmark

The VGG series models run through many unit-stride 3x3 convolution layers (operators), therefore can be perfectly accelerated by the Winograd algorithm. We use VGG-16 as reference model to evaluate our Winograd implementation performance.

Table1: Avg. inference time (ms) on Arm-based CPUs.

Devices\Cores	1	2	4	8	GPU
Galaxy S8	925	630	489
iPhone 7	374	284
Huawei D05	755	399	226	149
Phytium FT1500A	1769	1020	639	444
RK3399	1673	1420
TensorFlow lite on iPhone7		905
ACL on RK3399			4103		1645
TVM on RK3399			-		1293
Intel Movidius*			812

Intel Movidius operates at FP16 precision.

VGG-16 layer-by-layer benchmark

We conduct layer-by-layer benchmarks on Samsung S8 and iPhone 7 along with NNPACK. We benchmark three FeatherCNN methods: img2col GEMM and Winograd F(2x2,3x3)/F(6x6,3x3), and also NNPACK F(6x6,3x3).

1. Huawei D05 Server (64-core, dual sockets)

To evaluate the scalabiltiy of state-of-art CNN inference tools, Huawei D05 Server is a domestically made many-core arm server with 64 arm A72 cores. All these 64 cores are inter-connected with a token-ring network.

1.1 FeatherCNN-F(2x2,3x3)

Network	1	2	4	8	16	32	64
VGG-16	1333	697	385	218	157	117	102
GoogleNet	333	210	154	125	126	151	230
ResNet-50	573	356	187	117	104	65	194
SqueezeNet	149	79	44	28	29	35	67
MobileNet	124	70	42	36	34	52	76
DenseNet-121	517	273	156	98	113	160	331

1.2 Caffe + OpenBLAS

Network	1	2	4	8	16	32	64	FeatherCNN Speedup
VGG-16	3329	2227	1443	1108	1137	2109	3721	10.86
GoogleNet	1028	929	861	831	822	848	857	13.7
Resnet-50	728	490	347	278	252	346	365	3.88
SqueezeNet	190	127	92	76	74	84	92	1.68
MobileNet	211	166	146	139	137	153	184	4.03
DenseNet-121	865	593	438	373	354	655	856	3.08

1.3 Caffe2 + Eigen

Network	1	2	4	8	16	32	64	speedup
VGG-16	3267	2173	1550	1310	1385	1323	1401	12.84
GoogleNet	351	347	267	306	894	2422	3938	4.45
Resnet-50	869	549	374	262	149	355	724	2.29
SqueezeNet	91	65	55	87	221	628	723	1.25
MobileNet	174	139	110	90	110	171	592	2.65

2. RK3399 (2 big and 4 little cores, big.little architecture)

As ARM has a unique big.little architecture for energy saving, to evaluate the adaptation of schduling algortihm and blocking strategies with this big.little archtecture, RK3399 is selected as an widely used embeded developing board for testing. RK3399 has 2 big cores with 1.8GHz, and 4 little cores with 1.4GHz.

2.1 FeatherCNN

Network	1	2	1	2	4	all	Memory (MB)
[VGG16]	2268	1620	6122	3422	2269	1932	904
[GoogleNet]	416	250	927	524	333	294	168
[Resnet-50]	857	517	1834	1009	671	555	466
[squeezenet]	236	144	539	315	210	172	404
[mobilenet]	242	137	487	271	165	153	176
[densenet-121]	842	543	1854	1050	686	543	111

2.2 Caffe + OpenBLAS

2.3 Caffe2 + Eigen

3. Raspberry Pi 3 (4 A53 cores)

3.1 FeatherCNN

Network	1	2	4
[VGG16]	-	-	-
[GoogleNet]	1058	642	809
[Resnet-50]	2107	1255	1540
[squeezenet]	638	399	501
[mobilenet]	451	275	206
[densenet-121]	630	396	459

3.2 Caffe + OpenBLAS

3.3 Caffe2 + Eigen

4. TX2 (2 big and 4 little cores, big.little architecture)

4.1 FeatherCNN

Network	1	2	1	2	4	all
[VGG16]	1325	706	2540	1507	1226	844
[GoogleNet]	274	146	366	206	127	105
[Resnet-50]	480	266	759	417	261	215
[squeezenet]	88	115	73	61	204	153
[mobilenet]	156	87	211	116	68	56
[densenet-121]	-	-	-	-	-	-

4.2 Caffe2 + NNPACK

Network	1	2	1	2	4	all
[VGG16]	x	x	-	-	-	-
[GoogleNet]	x	x	-	-	-	-
[Resnet-50]	x	x	-	-	-	-
[squeezenet]	x	x	x	-	-	-
[mobilenet]	x	x	x	-	-	-
[densenet-121]	-	-	-	-	-	-

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmarks

Benchmarks

Arm Architectures -- Cellphones, Servers and Embedded boards

Test bed settings

VGG-16 end-to-end benchmark

VGG-16 layer-by-layer benchmark

1. Huawei D05 Server (64-core, dual sockets)

1.1 FeatherCNN-F(2x2,3x3)

1.2 Caffe + OpenBLAS

1.3 Caffe2 + Eigen

2. RK3399 (2 big and 4 little cores, big.little architecture)

2.1 FeatherCNN

2.2 Caffe + OpenBLAS

2.3 Caffe2 + Eigen

3. Raspberry Pi 3 (4 A53 cores)

3.1 FeatherCNN

3.2 Caffe + OpenBLAS

3.3 Caffe2 + Eigen

4. TX2 (2 big and 4 little cores, big.little architecture)

4.1 FeatherCNN

4.2 Caffe2 + NNPACK

Clone this wiki locally