Use 16x unrolling for generic floating point squared euclidean distance computations #32183

vekterli · 2024-08-20T08:57:12Z

@baldersheim please review. Noticed this when looking at the generated ARM NEON machine code.

This lets the auto-vectorized code use all 4 32-bit lanes of a 128-bit SIMD register instead of just 2, which doubles performance compared to double (which is in line with what could be expected).

…tation This lets the generated auto-vectorized code use all 4 lanes of a 128-bit SIMD register instead of just 2, which doubles performance compared to `double` (which is in line with what could be expected).

…tance

vekterli · 2024-08-20T09:18:20Z

After some experiments, a 16x unrolling factor seems to be optimal for both float and double. Further increases either has no effect or reduces performance.

Before:

double : sum=2610046000000.000000 of N=1000000 and vector length=1000 took 589
float  : sum=2610046000000.000000 of N=1000000 and vector length=1000 took 570
int8_t : sum=2610046000000.000000 of N=1000000 and vector length=1000 took 82

After:

double : sum=2610046000000.000000 of N=1000000 and vector length=1000 took 121
float  : sum=2610046000000.000000 of N=1000000 and vector length=1000 took 81
int8_t : sum=2610046000000.000000 of N=1000000 and vector length=1000 took 82

Use 4x unrolling for generic float squared euclidean distance compu…

0d107ec

…tation This lets the generated auto-vectorized code use all 4 lanes of a 128-bit SIMD register instead of just 2, which doubles performance compared to `double` (which is in line with what could be expected).

vekterli requested a review from baldersheim August 20, 2024 08:57

baldersheim previously approved these changes Aug 20, 2024

View reviewed changes

Use 16x unrolling for both float and double squared euclidean dis…

d34c1f9

…tance

vekterli dismissed baldersheim’s stale review via d34c1f9 August 20, 2024 09:14

vekterli changed the title ~~Use 4x unrolling for generic float squared euclidean distance computation~~ Use 16x unrolling for generic floating point squared euclidean distance computations Aug 20, 2024

vekterli requested a review from baldersheim August 20, 2024 09:18

baldersheim merged commit 279c153 into master Aug 20, 2024
2 of 3 checks passed

baldersheim deleted the vekterli/4x-unroll-euclidean-distance branch August 20, 2024 09:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use 16x unrolling for generic floating point squared euclidean distance computations #32183

Use 16x unrolling for generic floating point squared euclidean distance computations #32183

vekterli commented Aug 20, 2024

vekterli commented Aug 20, 2024

Use 16x unrolling for generic floating point squared euclidean distance computations #32183

Use 16x unrolling for generic floating point squared euclidean distance computations #32183

Conversation

vekterli commented Aug 20, 2024

vekterli commented Aug 20, 2024