Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use 16x unrolling for generic floating point squared euclidean distance computations #32183

Merged
merged 2 commits into from
Aug 20, 2024

Conversation

vekterli
Copy link
Member

@baldersheim please review. Noticed this when looking at the generated ARM NEON machine code.

This lets the auto-vectorized code use all 4 32-bit lanes of a 128-bit SIMD register instead of just 2, which doubles performance compared to double (which is in line with what could be expected).

…tation

This lets the generated auto-vectorized code use all 4 lanes of a
128-bit SIMD register instead of just 2, which doubles performance
compared to `double` (which is in line with what could be expected).
@vekterli vekterli requested a review from baldersheim August 20, 2024 08:57
baldersheim
baldersheim previously approved these changes Aug 20, 2024
@vekterli vekterli changed the title Use 4x unrolling for generic float squared euclidean distance computation Use 16x unrolling for generic floating point squared euclidean distance computations Aug 20, 2024
@vekterli
Copy link
Member Author

After some experiments, a 16x unrolling factor seems to be optimal for both float and double. Further increases either has no effect or reduces performance.

Before:

double : sum=2610046000000.000000 of N=1000000 and vector length=1000 took 589
float  : sum=2610046000000.000000 of N=1000000 and vector length=1000 took 570
int8_t : sum=2610046000000.000000 of N=1000000 and vector length=1000 took 82

After:

double : sum=2610046000000.000000 of N=1000000 and vector length=1000 took 121
float  : sum=2610046000000.000000 of N=1000000 and vector length=1000 took 81
int8_t : sum=2610046000000.000000 of N=1000000 and vector length=1000 took 82

@vekterli vekterli requested a review from baldersheim August 20, 2024 09:18
@baldersheim baldersheim merged commit 279c153 into master Aug 20, 2024
2 of 3 checks passed
@baldersheim baldersheim deleted the vekterli/4x-unroll-euclidean-distance branch August 20, 2024 09:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants