Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[External] [stdlib] Use SIMD to make
b64encode
4.7x faster (#49573)
[External] [stdlib] Use SIMD to make `b64encode` 4.7x faster ## Dependencies The following PR should be merged first: * #3397 ## Description of the changes `b64encode` is the function that encode bytes to base 64. Base 64 encoding is massively used across the industry, being to write secrets as text or to send data across the internet. Since it's going to be used a lot, we should make sure it is fast. As such, this PR provides a new implementation of `b64encode` around 5 times faster than the current one. This implementation was taken from the following papers: Wojciech Muła, Daniel Lemire, Base64 encoding and decoding at almost the speed of a memory copy, Software: Practice and Experience 50 (2), 2020. https://arxiv.org/abs/1910.05109 Wojciech Muła, Daniel Lemire, Faster Base64 Encoding and Decoding using AVX2 Instructions, ACM Transactions on the Web 12 (3), 2018. https://arxiv.org/abs/1704.00605 Note that there are substancial differences between the papers and this implementation. There are two reasons for this: * We want to avoid using assembly/llvm intrinsics directly and try to use the functions provided by the stdlib * We want to keep the complexity low, so we don't make a slightly different algorithm for each simd sizes and each cpu architecture. In a nutshell, we decide on a simd size, let's say 32. So at each iteration, we load 32 bytes, reshuffle the 24 first bytes, convert them to base 64, it then becomes 32 bytes, and then we store those 32 bytes in the output buffer. We have a final iteration with the last incomplete chunks where we shouldn't load everything at once, otherwise we would get out of bounds errors. We then use partial loads and store and masking, but the main SIMD algorithm is used. The reasons for the speedups are simlar to the ones provided in #3401 ## API changes The existing api is ```mojo fn b64encode(str: String) -> String: ``` and has several limitations: 1) The input of the function is raw bytes. It doesn't have to represent text. Requirering the user to provide a `String` forces the user to handle null termination on its bytes and whatever other requirement `String` might have to use bytes. 2) It is not possible to write the produced bytes in an existing buffer. 3) It is hard to benchmark as the signature implies that the function allocates memory on the heap. 4) It supposes that the input value owns the underlying data, meaning that it's not possible to use the function if the data is not owned. `Span` would be a better choice here. We keep in this PR the existing signature for backward compatibility and add new overloads. Now the signatures are: ```mojo fn b64encode(input_bytes: List[UInt8, _], inout result: List[UInt8, _]) fn b64encode(input_bytes: List[UInt8, _]) -> String fn b64encode(input_string: String) -> String ``` Note that it could be further improved in future PRs as currently `Span` is not easy to use but would be a right fit for the input value. We could also in the future remove `fn b64encode(input_string: String) -> String`. Note that the python api takes `bytes` as input and returns `bytes`. ## Benchmarking Benchmarking is harder than usual here because the base function does memory allocation. To avoid having the alloc in the benchmark, we must modify the original function to add the overloads described above. In this case we can benchmark and on my system ``` WSL2 windows 11 Intel(R) Core(TM) i7-10700KF CPU @ 3.80GHz Base speed: 3,80 GHz Sockets: 1 Cores: 8 Logical processors: 16 Virtualization: Enabled L1 cache: 512 KB L2 cache: 2,0 MB L3 cache: 16,0 MB ``` We get around 5x speedup. I don't provide the benchmark script here because it won't work out of the box (see the issue mentionned above), but if that's really necessary to get this merged, I'll provide the diff + the benchmark script. ## Future work As said before, this PR is not an exact re-implementation of the papers and the state of the art implementation that comes with it, the [simdutf](https://github.com/simdutf/simdutf) library. This is to keep this implementation simple and portable as it will work on any CPU that has an simd size of at least 4 bytes, and below or equal 64 bytes. In future PRs, we could provide futher speedups by using simd algorithms that are specific to each architecture. This will greatly increase the complexity of the code. I'll leave this decision to the maintainers. We can also re-write `b64decode` using simd and it's also expected that we'll get speedups. This can be the topic of another PR too. Co-authored-by: Gabriel de Marmiesse <[email protected]> Closes #3443 MODULAR_ORIG_COMMIT_REV_ID: 0cd01a091ba8cfdaac49dcf43280de22d9c8b299
- Loading branch information