Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avx2: Introduce LutAvx2 #52

Merged
merged 1 commit into from
Oct 13, 2024
Merged

Avx2: Introduce LutAvx2 #52

merged 1 commit into from
Oct 13, 2024

Conversation

AndersTrier
Copy link
Owner

@AndersTrier AndersTrier commented Oct 9, 2024

This avoids reloading the lookup table on every iteration of the inner loop. Speedup in my testing is 5-20%.

Had actually expected the compiler to take care of this...

Before (notice that all the vbroadcasti128 to load the lookup table are part of the inner loop):

$ objdump -C -d target/release/libreed_solomon_simd.rlib
[...]
0000000000000000 <reed_solomon_simd::engine::engine_avx2::Avx2::fft_private_avx2>:
[...]
     570:       c4 c1 7e 6f 4c 38 e0    vmovdqu -0x20(%r8,%rdi,1),%ymm1
     577:       c4 c1 7e 6f 14 38       vmovdqu (%r8,%rdi,1),%ymm2
     57d:       c4 e2 7d 5a 1a          vbroadcasti128 (%rdx),%ymm3
     582:       c5 f5 db e0             vpand  %ymm0,%ymm1,%ymm4
     586:       c4 e2 65 00 dc          vpshufb %ymm4,%ymm3,%ymm3
     58b:       c4 e2 7d 5a 6a 40       vbroadcasti128 0x40(%rdx),%ymm5
     591:       c4 e2 55 00 e4          vpshufb %ymm4,%ymm5,%ymm4
     596:       c5 d5 73 d1 04          vpsrlq $0x4,%ymm1,%ymm5
     59b:       c4 e2 7d 5a 72 10       vbroadcasti128 0x10(%rdx),%ymm6
     5a1:       c5 d5 db e8             vpand  %ymm0,%ymm5,%ymm5
     5a5:       c4 e2 4d 00 f5          vpshufb %ymm5,%ymm6,%ymm6
     5aa:       c4 e2 7d 5a 7a 50       vbroadcasti128 0x50(%rdx),%ymm7
     5b0:       c4 62 7d 5a 42 20       vbroadcasti128 0x20(%rdx),%ymm8
     5b6:       c4 e2 45 00 ed          vpshufb %ymm5,%ymm7,%ymm5
     5bb:       c5 ed db f8             vpand  %ymm0,%ymm2,%ymm7
     5bf:       c4 62 3d 00 c7          vpshufb %ymm7,%ymm8,%ymm8
     5c4:       c5 bd ef f6             vpxor  %ymm6,%ymm8,%ymm6
     5c8:       c4 62 7d 5a 42 60       vbroadcasti128 0x60(%rdx),%ymm8
     5ce:       c4 e2 3d 00 ff          vpshufb %ymm7,%ymm8,%ymm7
     5d3:       c5 d5 ef ef             vpxor  %ymm7,%ymm5,%ymm5
     5d7:       c5 c5 73 d2 04          vpsrlq $0x4,%ymm2,%ymm7
     5dc:       c4 62 7d 5a 42 30       vbroadcasti128 0x30(%rdx),%ymm8
     5e2:       c5 c5 db f8             vpand  %ymm0,%ymm7,%ymm7
     5e6:       c4 62 3d 00 c7          vpshufb %ymm7,%ymm8,%ymm8
     5eb:       c5 bd ef f6             vpxor  %ymm6,%ymm8,%ymm6
     5ef:       c4 62 7d 5a 42 70       vbroadcasti128 0x70(%rdx),%ymm8
     5f5:       c4 e2 3d 00 ff          vpshufb %ymm7,%ymm8,%ymm7
     5fa:       c5 d5 ef ef             vpxor  %ymm7,%ymm5,%ymm5
     5fe:       c5 e5 ef 5c 39 e0       vpxor  -0x20(%rcx,%rdi,1),%ymm3,%ymm3
     604:       c5 e5 ef de             vpxor  %ymm6,%ymm3,%ymm3
     608:       c5 dd ef 24 39          vpxor  (%rcx,%rdi,1),%ymm4,%ymm4
     60d:       c5 dd ef e5             vpxor  %ymm5,%ymm4,%ymm4
     611:       c5 fe 7f 5c 39 e0       vmovdqu %ymm3,-0x20(%rcx,%rdi,1)
     617:       c5 fe 7f 24 39          vmovdqu %ymm4,(%rcx,%rdi,1)
     61c:       c5 e5 ef c9             vpxor  %ymm1,%ymm3,%ymm1
     620:       c5 dd ef d2             vpxor  %ymm2,%ymm4,%ymm2
     624:       c4 c1 7e 7f 4c 38 e0    vmovdqu %ymm1,-0x20(%r8,%rdi,1)
     62b:       c4 c1 7e 7f 14 38       vmovdqu %ymm2,(%r8,%rdi,1)
     631:       48 83 c7 40             add    $0x40,%rdi
     635:       49 ff cb                dec    %r11
     638:       0f 85 32 ff ff ff       jne    570 <reed_solomon_simd::engine::engine_avx2::Avx2::fft_private_avx2+0x570>
[...]

After (we do all the vbroadcasti128 before the loop):

$ objdump -C -d target/release/libreed_solomon_simd.rlib
[...]
0000000000000000 <reed_solomon_simd::engine::engine_avx2::Avx2::fft_private_avx2>:
[...]
     575:       c4 e2 7d 5a 0f          vbroadcasti128 (%rdi),%ymm1
     57a:       c4 e2 7d 5a 57 40       vbroadcasti128 0x40(%rdi),%ymm2
     580:       c4 e2 7d 5a 5f 10       vbroadcasti128 0x10(%rdi),%ymm3
     586:       c4 e2 7d 5a 67 50       vbroadcasti128 0x50(%rdi),%ymm4
     58c:       c4 e2 7d 5a 6f 20       vbroadcasti128 0x20(%rdi),%ymm5
     592:       c4 e2 7d 5a 77 60       vbroadcasti128 0x60(%rdi),%ymm6
     598:       c4 e2 7d 5a 7f 30       vbroadcasti128 0x30(%rdi),%ymm7
     59e:       c4 62 7d 5a 47 70       vbroadcasti128 0x70(%rdi),%ymm8
     5a4:       31 ff                   xor    %edi,%edi
     5a6:       49 89 c3                mov    %rax,%r11
     5a9:       0f 1f 80 00 00 00 00    nopl   0x0(%rax)
     5b0:       c4 41 7e 6f 4c 39 e0    vmovdqu -0x20(%r9,%rdi,1),%ymm9
     5b7:       c4 41 7e 6f 14 39       vmovdqu (%r9,%rdi,1),%ymm10
     5bd:       c4 c1 25 73 d1 04       vpsrlq $0x4,%ymm9,%ymm11
     5c3:       c5 25 db d8             vpand  %ymm0,%ymm11,%ymm11
     5c7:       c4 42 65 00 e3          vpshufb %ymm11,%ymm3,%ymm12
     5cc:       c4 42 5d 00 db          vpshufb %ymm11,%ymm4,%ymm11
     5d1:       c5 2d db e8             vpand  %ymm0,%ymm10,%ymm13
     5d5:       c4 42 55 00 f5          vpshufb %ymm13,%ymm5,%ymm14
     5da:       c4 41 1d ef e6          vpxor  %ymm14,%ymm12,%ymm12
     5df:       c4 42 4d 00 ed          vpshufb %ymm13,%ymm6,%ymm13
     5e4:       c4 41 25 ef dd          vpxor  %ymm13,%ymm11,%ymm11
     5e9:       c4 c1 15 73 d2 04       vpsrlq $0x4,%ymm10,%ymm13
     5ef:       c5 15 db e8             vpand  %ymm0,%ymm13,%ymm13
     5f3:       c4 42 45 00 f5          vpshufb %ymm13,%ymm7,%ymm14
     5f8:       c4 41 1d ef e6          vpxor  %ymm14,%ymm12,%ymm12
     5fd:       c5 35 db f0             vpand  %ymm0,%ymm9,%ymm14
     601:       c4 42 3d 00 ed          vpshufb %ymm13,%ymm8,%ymm13
     606:       c4 41 25 ef dd          vpxor  %ymm13,%ymm11,%ymm11
     60b:       c4 42 75 00 ee          vpshufb %ymm14,%ymm1,%ymm13
     610:       c5 15 ef 6c 3e e0       vpxor  -0x20(%rsi,%rdi,1),%ymm13,%ymm13
     616:       c4 41 15 ef e4          vpxor  %ymm12,%ymm13,%ymm12
     61b:       c4 42 6d 00 ee          vpshufb %ymm14,%ymm2,%ymm13
     620:       c5 15 ef 2c 3e          vpxor  (%rsi,%rdi,1),%ymm13,%ymm13
     625:       c4 41 15 ef db          vpxor  %ymm11,%ymm13,%ymm11
     62a:       c5 7e 7f 64 3e e0       vmovdqu %ymm12,-0x20(%rsi,%rdi,1)
     630:       c4 41 1d ef c9          vpxor  %ymm9,%ymm12,%ymm9
     635:       c5 7e 7f 1c 3e          vmovdqu %ymm11,(%rsi,%rdi,1)
     63a:       c4 41 25 ef d2          vpxor  %ymm10,%ymm11,%ymm10
     63f:       c4 41 7e 7f 4c 39 e0    vmovdqu %ymm9,-0x20(%r9,%rdi,1)
     646:       c4 41 7e 7f 14 39       vmovdqu %ymm10,(%r9,%rdi,1)
     64c:       48 83 c7 40             add    $0x40,%rdi
     650:       49 ff cb                dec    %r11
     653:       0f 85 57 ff ff ff       jne    5b0 <reed_solomon_simd::engine::engine_avx2::Avx2::fft_private_avx2+0x5b0>
[...]

@AndersTrier AndersTrier self-assigned this Oct 9, 2024
@AndersTrier AndersTrier force-pushed the AndersTrier/avx2_lut branch 4 times, most recently from 38c5eb3 to c3703a0 Compare October 13, 2024 19:56
This avoids reloading the lookup table on every iteration of the inner
loop.
@AndersTrier AndersTrier merged commit c40748c into master Oct 13, 2024
1 check passed
@AndersTrier AndersTrier deleted the AndersTrier/avx2_lut branch October 13, 2024 20:30
@AndersTrier AndersTrier mentioned this pull request Oct 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant