perf(python): optimize pystr deserialize perf #2007

chaokunyang · 2025-01-14T05:49:27Z

What does this PR do?

This PR implemented an optimized version of PyUnicode_FromUCS1/Fury_PyUnicode_FromUCS2 for faster performance by :

replace max char check using SIMD
Cast ucs2 array to ucs1 array by SIMD

Related issues

Does this PR introduce any user-facing change?

Does this PR introduce any public API change?
Does this PR introduce any binary protocol compatibility change?

Benchmark

chaokunyang · 2025-01-15T15:22:04Z

cc @penguin-wwy @pandalee99 @theweipeng

pandalee99

This code is very efficient,very nice!

maybe we can optimize the repetitive code.

  // Handle remaining elements
  for (; i < length; i++) {
    if (arr[i] > max_sse) {
      max_sse = arr[i];
    }

It's just the way it's written. It's nothing serious.

chaokunyang · 2025-01-15T16:34:08Z

python/pyfury/_util.pyx

        cdef const char * buf = <const char *>(self.c_buffer.get().data() + self.reader_index)
        self.reader_index += size
        cdef uint32_t encoding = header & <uint32_t>0b11
        if encoding == 0:
            # PyUnicode_FromASCII
-            return PyUnicode_DecodeLatin1(buf, size, "strict")
+            return <unicode>Fury_PyUnicode_FromUCS1(buf, size)
+            # return PyUnicode_DecodeLatin1(buf, size, "strict")


If i use PyUnicode_DecodeLatin1 directly here, It's faster in macos, which is unexpected Since my implementation used the simd, and if i invoke PyUnicode_DecodeLatin1 directly in PyUnicode_FromUCS1, it's slower too. @penguin-wwy do you have any ideas?

chaokunyang added 2 commits January 13, 2025 00:33

add get uint16_t array max value util

fcb620c

add SMID copy uint16 array to uint8 array

f68dce4

chaokunyang requested a review from PragmaTwice as a code owner January 14, 2025 05:49

chaokunyang marked this pull request as draft January 14, 2025 05:49

pandalee99 self-requested a review January 14, 2025 15:26

chaokunyang added 11 commits January 15, 2025 01:01

skip avx for python wheel

eb7f7b8

enable avx for cpp test

84e0b0b

implement pyunicode library

9fd56f0

use pyunicode for python ucs1/2 string decoding

77fbec9

remove avx getMaxValue and copyValue

ec2c4d4

rename copyValue to copyArray

a0d74f1

add header and #pragma once

8e2a4b2

add cstdint include

d1d02e7

lint code

221a6f1

add #include <cassert>

4793946

remove array util inline

6f0a64b

chaokunyang force-pushed the optimize_pystr_deserialize_perf branch from 8ba4b1b to 6f0a64b Compare January 15, 2025 14:34

chaokunyang added 9 commits January 15, 2025 22:36

include <stdlib.h>

2ebfbc8

fix include

ad2f28a

add #pragma once

ea206d9

fix include

d2627fb

fix include

28aaf2c

fix include

e326271

add Python.h include

1ef388c

lint code

d4837ff

optimize include

8fe4de7

chaokunyang marked this pull request as ready for review January 15, 2025 15:19

remove comments

a940ba3

pandalee99 reviewed Jan 15, 2025

View reviewed changes

chaokunyang commented Jan 15, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(python): optimize pystr deserialize perf #2007

perf(python): optimize pystr deserialize perf #2007

chaokunyang commented Jan 14, 2025 •

edited

Loading

chaokunyang commented Jan 15, 2025

pandalee99 left a comment

chaokunyang Jan 15, 2025

perf(python): optimize pystr deserialize perf #2007

Are you sure you want to change the base?

perf(python): optimize pystr deserialize perf #2007

Conversation

chaokunyang commented Jan 14, 2025 • edited Loading

What does this PR do?

Related issues

Does this PR introduce any user-facing change?

Benchmark

chaokunyang commented Jan 15, 2025

pandalee99 left a comment

Choose a reason for hiding this comment

chaokunyang Jan 15, 2025

Choose a reason for hiding this comment

chaokunyang commented Jan 14, 2025 •

edited

Loading