From a472b25771165b4f7d99403639a74be1492e1b18 Mon Sep 17 00:00:00 2001 From: Ilya Grebnov Date: Sat, 20 Apr 2024 17:05:12 -0700 Subject: [PATCH] Detailed algorithm description --- Benchmarks.md | 49 ++++++++++++++++++++++++++ README.md | 96 ++++++++++++++++++++++++++++----------------------- 2 files changed, 101 insertions(+), 44 deletions(-) diff --git a/Benchmarks.md b/Benchmarks.md index 6492c1d..31e57ba 100644 --- a/Benchmarks.md +++ b/Benchmarks.md @@ -116,3 +116,52 @@ The times are the minimum of five runs measuring **single-threaded (ST)** and ** | fib41 | 267914296 | 8.390 sec ( 31.93 MB/s) | 37.397 sec ( 7.16 MB/s) |**+345.72%**| 3.794 sec ( 70.61 MB/s) | 37.158 sec ( 7.21 MB/s) |**+879.30%**| | rs.13 | 216747218 | 6.425 sec ( 33.74 MB/s) | 28.994 sec ( 7.48 MB/s) |**+351.30%**| 2.807 sec ( 77.21 MB/s) | 31.609 sec ( 6.86 MB/s) |**+1025.97%**| | tm29 | 268435456 | 8.931 sec ( 30.06 MB/s) | 31.267 sec ( 8.59 MB/s) |**+250.09%**| 3.729 sec ( 71.99 MB/s) | 31.384 sec ( 8.55 MB/s) |**+741.69%**| + + + +## Large pages and multi-core systems support + +Large-pages and OpenMP improves the libsais performance. Here is an example of such improvements on Manzini Corpus. + +| file | size | baseline| LP | LP w 2c | LP w 3c | LP w 4c | LP w 5c | LP w 6c | LP w 7c | LP w 8c | +|:---------------:|:---------:|:-------:|:-------:|:-------:|:-------:|:-------:|:-------:|:-------:|:-------:|:-------:| +| chr22.dna | 34553758 |43.50MB/s|50.18MB/s|61.20MB/s|73.66MB/s|78.91MB/s|81.20MB/s|81.49MB/s|81.52MB/s|80.42MB/s| +| etext99 | 105277340 |32.96MB/s|40.73MB/s|50.19MB/s|59.34MB/s|62.97MB/s|64.06MB/s|62.83MB/s|63.08MB/s|62.49MB/s| +| gcc-3.0.tar | 86630400 |44.32MB/s|50.13MB/s|58.51MB/s|68.85MB/s|73.82MB/s|75.76MB/s|76.14MB/s|75.85MB/s|75.24MB/s| +| howto | 39422105 |42.78MB/s|48.10MB/s|57.38MB/s|67.75MB/s|71.91MB/s|73.67MB/s|73.61MB/s|73.17MB/s|72.38MB/s| +| jdk13c | 69728899 |42.70MB/s|47.77MB/s|54.50MB/s|64.85MB/s|69.63MB/s|71.66MB/s|72.15MB/s|71.96MB/s|71.24MB/s| +| linux-2.4.5.tar | 116254720 |42.46MB/s|48.85MB/s|57.60MB/s|67.92MB/s|72.29MB/s|73.88MB/s|74.11MB/s|73.59MB/s|73.27MB/s| +| rctail96 | 114711151 |36.39MB/s|43.19MB/s|50.96MB/s|60.60MB/s|64.33MB/s|65.43MB/s|65.79MB/s|65.78MB/s|65.18MB/s| +| rfc | 116421901 |39.81MB/s|46.76MB/s|55.92MB/s|66.48MB/s|70.79MB/s|71.68MB/s|72.21MB/s|71.92MB/s|71.06MB/s| +| sprot34.dat | 109617186 |36.09MB/s|45.06MB/s|53.26MB/s|61.60MB/s|59.69MB/s|62.25MB/s|67.20MB/s|66.84MB/s|66.38MB/s| +| w3c2 | 104201579 |42.97MB/s|47.09MB/s|54.01MB/s|63.79MB/s|67.67MB/s|69.84MB/s|69.94MB/s|69.65MB/s|68.86MB/s| + +Note, multi-core scalability is limited by RAM bandwidth and adding more RAM channels improves performance: +![enwik9 BWT throughput in MB/s on Azure DS14 v2 (Intel Xeon Platinum 8171M)](Azure_enwik9_benchmark.png?raw=true "enwik9 BWT throughput in MB/s on Azure DS14 v2 (Intel Xeon Platinum 8171M)") + +## libsais64 for inputs larger than 2GB + +Starting from version 2.2.0 libsais64 could process inputs larger than 2GB. + +The times below are the minimum of five runs measuring **multi-threaded (MT)** performance of suffix array construction on Azure DS14 v2 (Intel Xeon Platinum 8171M). + +| file | size | libsais64 2.2.0 (MT) | divsufsort64 2.0.2 (MT) |speedup (MT)| +|:---------------:|:-----------:|:--------------------------:|:--------------------------:|:----------:| +| english | 2210395553 | 61.499 sec ( 34.28 MB/s) | 435.199 sec ( 4.84 MB/s) |**+607.65%**| +| GRCh38.p13.fa | 3321586957 | 84.068 sec ( 37.68 MB/s) | 782.938 sec ( 4.05 MB/s) |**+831.32%**| +| enwik10 | 10000000000 | 303.542 sec ( 31.42 MB/s) |1927.351 sec ( 4.95 MB/s) |**+534.95%**| + +## Additional memory + +The libsais reuses space allocated for suffix array during construction. Sometimes this free space is not sufficient for most optimal algorithm (this is uncommon) and libsais will need to fallback to less efficient one (libsais has 4 algorithms at different break-points point: 6k, 4k, 2k and 1k; where k is alphabet size). To improve performance for those cases you could allocating additional space at the end of suffix array. + +| file | size | libsais + O(n) (ST) | libsais + O(1) (ST) |speedup (ST)| libsais + O(n) (MT) | libsais + O(1) (ST) |speedup (MT)| +|:---------------:|:-----------:|:--------------------------:|:--------------------------:|:----------:|:--------------------------:|:--------------------------:|:----------:| +| osdb | 10085684 | 0.222 sec ( 45.52 MB/s) | 0.228 sec ( 44.20 MB/s) | **+2.97%**| 0.150 sec ( 67.30 MB/s) | 0.162 sec ( 62.25 MB/s) | **+8.11%**| +| x-ray | 8474240 | 0.190 sec ( 44.52 MB/s) | 0.217 sec ( 39.11 MB/s) | **+13.82%**| 0.122 sec ( 69.46 MB/s) | 0.156 sec ( 54.16 MB/s) | **+28.25%**| +| sao | 7251944 | 0.175 sec ( 41.48 MB/s) | 0.182 sec ( 39.75 MB/s) | **+4.37%**| 0.127 sec ( 57.26 MB/s) | 0.140 sec ( 51.87 MB/s) | **+10.39%**| +| ooffice | 6152192 | 0.113 sec ( 54.55 MB/s) | 0.117 sec ( 52.45 MB/s) | **+4.01%**| 0.081 sec ( 76.38 MB/s) | 0.088 sec ( 70.30 MB/s) | **+8.65%**| +| abac | 200000 | 0.002 sec ( 84.36 MB/s) | 0.003 sec ( 73.63 MB/s) | **+14.56%**| 0.002 sec ( 105.08 MB/s) | 0.002 sec ( 86.64 MB/s) | **+21.27%**| +| test3 | 2097088 | 0.034 sec ( 61.54 MB/s) | 0.037 sec ( 56.45 MB/s) | **+9.03%**| 0.028 sec ( 75.76 MB/s) | 0.032 sec ( 64.93 MB/s) | **+16.68%**| + +> * All other files from [Benchmarks](#benchmarks) above do not suffer from this fallbacks. \ No newline at end of file diff --git a/README.md b/README.md index 132e99b..924529d 100644 --- a/README.md +++ b/README.md @@ -1,7 +1,7 @@ # libsais The libsais is a library for fast (see [Benchmarks](#benchmarks) below) linear time suffix array, longest common prefix array and Burrows-Wheeler transform construction based on induced sorting algorithm described in the following papers: -* Ge Nong, Sen Zhang, Wai Hong Chan *Two Efficient Algorithms for Linear Suffix Array Construction*, 2008 +* Ge Nong, Sen Zhang, Wai Hong Chan *Two Efficient Algorithms for Linear Suffix Array Construction*, 2009 * Juha Karkkainen, Giovanni Manzini, Simon J. Puglisi *Permuted Longest-Common-Prefix Array*, 2009 * Nataliya Timoshevskaya, Wu-chun Feng *SAIS-OPT: On the characterization and optimization of the SA-IS algorithm for suffix array construction*, 2014 * Jing Yi Xie, Ge Nong, Bin Lao, Wentao Xu *Scalable Suffix Sorting on a Multicore Machine*, 2020 @@ -61,7 +61,7 @@ The libsais is released under the [Apache License Version 2.0](LICENSE "Apache l * February 23, 2021 (1.0.0) * Initial release. -## Versions of the libsais library +## Versions of the libsais * [libsais.c](src/libsais.c) (and corresponding [libsais.h](include/libsais.h)) is for suffix array, PLCP, LCP, forward BWT and reverse BWT construction over 8-bit inputs smaller than 2GB (2147483648 bytes). * This version of the library could also be used to construct suffix array of an integer array (with a caveat that input array must be mutable). * [libsais64.c](src/libsais64.c) (and corresponding [libsais64.h](include/libsais64.h)) is optional extension of the library for inputs larger or equlas to 2GB (2147483648 bytes). @@ -122,7 +122,7 @@ The libsais is released under the [Apache License Version 2.0](LICENSE "Apache l CPMAddPackage( NAME libsais GITHUB_REPOSITORY IlyaGrebnov/libsais - GIT_TAG v2.7.3 + GIT_TAG v2.8.1 OPTIONS "LIBSAIS_USE_OPENMP OFF" "LIBSAIS_BUILD_SHARED_LIB OFF" @@ -131,55 +131,63 @@ CPMAddPackage( target_link_libraries( libsais) ``` ---- +# Algorithm description +The libsais uses the SA-IS (Suffix Array Induced Sorting) algorithm to construct both the suffix array and the Burrows-Wheeler transform through recursive decomposition and induced sorting: +* Initially, the algorithm classifies each position in a string as either an S-type or an L-type, based on whether the suffix starting at that position is lexicographically smaller or larger than the suffix at the adjacent right position. Positions identified as S-type, which have an adjacent left L-type position, are further categorized as LMS-type (Leftmost S-type) positions. Next, the algorithm splits the input string into LMS substrings, which start at an LMS-type position and extend up to the next adjacent LMS-type position. These LMS substrings are then lexicographically sorted through induced sorting and subsequently replaced in the input string with their corresponding sorted ranks, thus forming a new, compacted string. This compacted string reduces the problem size, enabling the algorithm to perform a recursive decomposition in which it is reapplied to construct the suffix array for the compacted string. And at the end of the recursive call, the suffix array for the input string is constructed from the suffix array of the compacted string using another round of induced sorting. +* The induced sorting is a core mechanic of the SA-IS algorithm and is employed twice during each recursive call: initially before the recursive call to establish the order of LMS substrings, and subsequently after the recursive call to finalize the order of the suffixes of the string. This process involves two sequential scans: a left-to-right scan that determines the order of L-type positions based on the LMS-type positions, followed by a right-to-left scan that establishes the order of S-type positions based on L-type positions. These scans efficiently extend the ordering from LMS-type positions to all positions in the string. -# Benchmarks - -Full list of benchmarks are moved to own [Benchmarks.md](Benchmarks.md) file. - -## Large pages and multi-core systems support - -Large-pages and OpenMP improves the libsais performance. Here is an example of such improvements on Manzini Corpus. +The SA-IS algorithm is quite elegant, yet implementing it efficiently presents multiple challenges. The primary challenge is that the SA-IS algorithm exhibits random memory access patterns, which can significantly decrease efficiency due to cache misses. Another significant challenge is that the SA-IS algorithm is not a lightweight construction algorithm; it requires additional memory to support positions classification, induced sorting, compacted string representations, and recursive decomposition. To circumvent this, the libsais implements careful optimizations that are worth highlighting: +* The libsais is meticulously designed from the ground up to leverage the capabilities of modern microprocessors, aiming to minimize various stalls and enhance throughput through instruction-level parallelism. The library employs sophisticated techniques such as manual loop unrolling, software prefetching, and branch elimination to achieve this goal. Moreover, it strives to minimize the number of passes over the data by combining multiple operations into a single function. A prime example of these techniques could be observed in the initialization phase of the SA-IS algorithm. In this phase, the entire logic required to classify positions, count symbols into various buckets, and segment the string into LMS substrings is executed through a single, completely branch-less loop: +```c + for (i = m - 1, j = omp_block_start + prefetch_distance + 3; i >= j; i -= 4) + { + libsais_prefetchr(&T[i - 2 * prefetch_distance]); -| file | size | baseline| LP | LP w 2c | LP w 3c | LP w 4c | LP w 5c | LP w 6c | LP w 7c | LP w 8c | -|:---------------:|:---------:|:-------:|:-------:|:-------:|:-------:|:-------:|:-------:|:-------:|:-------:|:-------:| -| chr22.dna | 34553758 |43.50MB/s|50.18MB/s|61.20MB/s|73.66MB/s|78.91MB/s|81.20MB/s|81.49MB/s|81.52MB/s|80.42MB/s| -| etext99 | 105277340 |32.96MB/s|40.73MB/s|50.19MB/s|59.34MB/s|62.97MB/s|64.06MB/s|62.83MB/s|63.08MB/s|62.49MB/s| -| gcc-3.0.tar | 86630400 |44.32MB/s|50.13MB/s|58.51MB/s|68.85MB/s|73.82MB/s|75.76MB/s|76.14MB/s|75.85MB/s|75.24MB/s| -| howto | 39422105 |42.78MB/s|48.10MB/s|57.38MB/s|67.75MB/s|71.91MB/s|73.67MB/s|73.61MB/s|73.17MB/s|72.38MB/s| -| jdk13c | 69728899 |42.70MB/s|47.77MB/s|54.50MB/s|64.85MB/s|69.63MB/s|71.66MB/s|72.15MB/s|71.96MB/s|71.24MB/s| -| linux-2.4.5.tar | 116254720 |42.46MB/s|48.85MB/s|57.60MB/s|67.92MB/s|72.29MB/s|73.88MB/s|74.11MB/s|73.59MB/s|73.27MB/s| -| rctail96 | 114711151 |36.39MB/s|43.19MB/s|50.96MB/s|60.60MB/s|64.33MB/s|65.43MB/s|65.79MB/s|65.78MB/s|65.18MB/s| -| rfc | 116421901 |39.81MB/s|46.76MB/s|55.92MB/s|66.48MB/s|70.79MB/s|71.68MB/s|72.21MB/s|71.92MB/s|71.06MB/s| -| sprot34.dat | 109617186 |36.09MB/s|45.06MB/s|53.26MB/s|61.60MB/s|59.69MB/s|62.25MB/s|67.20MB/s|66.84MB/s|66.38MB/s| -| w3c2 | 104201579 |42.97MB/s|47.09MB/s|54.01MB/s|63.79MB/s|67.67MB/s|69.84MB/s|69.94MB/s|69.65MB/s|68.86MB/s| + libsais_prefetchw(&buckets[BUCKETS_INDEX4(T[i - prefetch_distance - 0], 0)]); + libsais_prefetchw(&buckets[BUCKETS_INDEX4(T[i - prefetch_distance - 1], 0)]); + libsais_prefetchw(&buckets[BUCKETS_INDEX4(T[i - prefetch_distance - 2], 0)]); + libsais_prefetchw(&buckets[BUCKETS_INDEX4(T[i - prefetch_distance - 3], 0)]); -Note, multi-core scalability is limited by RAM bandwidth and adding more RAM channels improves performance: -![enwik9 BWT throughput in MB/s on Azure DS14 v2 (Intel Xeon Platinum 8171M)](Azure_enwik9_benchmark.png?raw=true "enwik9 BWT throughput in MB/s on Azure DS14 v2 (Intel Xeon Platinum 8171M)") + c1 = T[i - 0]; s = (s << 1) + (fast_uint_t)(c1 > (c0 - (fast_sint_t)(s & 1))); SA[m] = (sa_sint_t)(i + 1); m -= ((s & 3) == 1); + buckets[BUCKETS_INDEX4((fast_uint_t)c0, s & 3)]++; -## libsais64 for inputs larger than 2GB + c0 = T[i - 1]; s = (s << 1) + (fast_uint_t)(c0 > (c1 - (fast_sint_t)(s & 1))); SA[m] = (sa_sint_t)(i - 0); m -= ((s & 3) == 1); + buckets[BUCKETS_INDEX4((fast_uint_t)c1, s & 3)]++; -Starting from version 2.2.0 libsais64 could process inputs larger than 2GB. + c1 = T[i - 2]; s = (s << 1) + (fast_uint_t)(c1 > (c0 - (fast_sint_t)(s & 1))); SA[m] = (sa_sint_t)(i - 1); m -= ((s & 3) == 1); + buckets[BUCKETS_INDEX4((fast_uint_t)c0, s & 3)]++; -The times below are the minimum of five runs measuring **multi-threaded (MT)** performance of suffix array construction on Azure DS14 v2 (Intel Xeon Platinum 8171M). + c0 = T[i - 3]; s = (s << 1) + (fast_uint_t)(c0 > (c1 - (fast_sint_t)(s & 1))); SA[m] = (sa_sint_t)(i - 2); m -= ((s & 3) == 1); + buckets[BUCKETS_INDEX4((fast_uint_t)c1, s & 3)]++; + } +``` +* To sort LMS substrings lexicographically and compute their ranks, the libsais algorithm begins by gathering LMS-type positions as they appear in the string, placing them at the end of the suffix array. The library then employs two passes of induced sorting, which concludes with these same LMS-type positions ordered lexicographically at the beginning of the suffix array. Once all LMS-type positions are sorted, the ranks of the LMS substrings are computed by inspecting each pair of adjacent positions to determine if the corresponding LMS substrings are identical. If they are the same, they receive the same rank; otherwise, the rank is incremented by one. + * The first challenge of induced sorting is that, during passes over the suffix array, we need to examine each value to determine if it represents a valid position or an empty space, whether this position is not the beginning of the string (and thus could induce another position), and if the induced position is going to be of the necessary type (for example, during a left-to-right scan, we are only inducing L-type positions). This process can cause branch mispredictions and corresponding microprocessor stalls. To address this challenge, libsais employs following techniques. Firstly, the library uses two pointers per induction bucket, each pointing to different sections of the suffix array depending on the type of positions these positions will be inducing next. This approach allows for the separation of LS-type (meaning S-type, which induces L-type; this is the same as LMS-type) and LL-type positions needed for the left-to-right scan from SL-type and SS-type positions needed for the right-to-left scan. Secondly, by understanding the distribution of symbols based on their position types and the types they induce (i.e., SS, SL, LS, LL), we can pre-calculate pointers for each bucket, leaving no empty spaces. And thirdly, by removing the first LMS position and all positions left of it from the initial gathering and distribution, we eliminate the need to check whether a position is not the beginning of the string. These techniques not only result in a completely branch-less loop for each induction sorting pass but also eliminate redundant scanning and the final gathering of LMS-type positions at the beginning of the suffix array. + * The second challenge arises after induced sorting when we need to compute the ranks of LMS substrings. To accomplish this, we must first calculate and store the lengths of LMS substrings and then inspect each pair of adjacent LMS-type positions to determine if the corresponding LMS substrings are identical. This comparison starts with their lengths, and if they are the same, proceeds to compare the substrings themselves. Such operations exhibits random memory access patterns, which can significantly decrease efficiency due to cache misses. However, libsais avoids this inefficient logic by incorporating the ranking of LMS substrings as part of the induced sorting process itself. The library achieves this by marking the most significant bit (MSB) of positions in the suffix array that start new ranking groups. Each time a position is processed during induced sorting, the library checks the MSB and increments the current rank if the beginning of a new ranking group is encountered. Additionally, for each pointer in an induction bucket, the rank of the previous induced position is maintained. Whenever another position is induced, this previous rank is used to determine whether to mark the newly induced position as the beginning of a new rank group. All the logic to update the ranks and mark the beginnings of new ranking groups is implemented using bit manipulation and is completely branch-less. +```c + for (i = omp_block_start, j = omp_block_start + omp_block_size - 2 * prefetch_distance - 1; i < j; i += 2) + { + libsais_prefetchr(&SA[i + 3 * prefetch_distance]); -| file | size | libsais64 2.2.0 (MT) | divsufsort64 2.0.2 (MT) |speedup (MT)| -|:---------------:|:-----------:|:--------------------------:|:--------------------------:|:----------:| -| english | 2210395553 | 61.499 sec ( 34.28 MB/s) | 435.199 sec ( 4.84 MB/s) |**+607.65%**| -| GRCh38.p13.fa | 3321586957 | 84.068 sec ( 37.68 MB/s) | 782.938 sec ( 4.05 MB/s) |**+831.32%**| -| enwik10 | 10000000000 | 303.542 sec ( 31.42 MB/s) |1927.351 sec ( 4.95 MB/s) |**+534.95%**| + libsais_prefetchr(&T[SA[i + 2 * prefetch_distance + 0] & SAINT_MAX] - 1); + libsais_prefetchr(&T[SA[i + 2 * prefetch_distance + 0] & SAINT_MAX] - 2); + libsais_prefetchr(&T[SA[i + 2 * prefetch_distance + 1] & SAINT_MAX] - 1); + libsais_prefetchr(&T[SA[i + 2 * prefetch_distance + 1] & SAINT_MAX] - 2); -## Additional memory + sa_sint_t p0 = SA[i + prefetch_distance + 0] & SAINT_MAX; sa_sint_t v0 = BUCKETS_INDEX4(T[p0 - (p0 > 0)], 0); libsais_prefetchw(&buckets[v0]); + sa_sint_t p1 = SA[i + prefetch_distance + 1] & SAINT_MAX; sa_sint_t v1 = BUCKETS_INDEX4(T[p1 - (p1 > 0)], 0); libsais_prefetchw(&buckets[v1]); -The libsais reuses space allocated for suffix array during construction. Sometimes this free space is not sufficient for most optimal algorithm (this is uncommon) and libsais will need to fallback to less efficient one (libsais has 4 algorithms at different break-points point: 6k, 4k, 2k and 1k; where k is alphabet size). To improve performance for those cases you could allocating additional space at the end of suffix array. + sa_sint_t p2 = SA[i + 0]; d += (p2 < 0); p2 &= SAINT_MAX; sa_sint_t v2 = BUCKETS_INDEX4(T[p2 - 1], T[p2 - 2] >= T[p2 - 1]); + SA[buckets[v2]++] = (p2 - 1) | ((sa_sint_t)(buckets[2 + v2] != d) << (SAINT_BIT - 1)); buckets[2 + v2] = d; -| file | size | libsais + O(n) (ST) | libsais + O(1) (ST) |speedup (ST)| libsais + O(n) (MT) | libsais + O(1) (ST) |speedup (MT)| -|:---------------:|:-----------:|:--------------------------:|:--------------------------:|:----------:|:--------------------------:|:--------------------------:|:----------:| -| osdb | 10085684 | 0.222 sec ( 45.52 MB/s) | 0.228 sec ( 44.20 MB/s) | **+2.97%**| 0.150 sec ( 67.30 MB/s) | 0.162 sec ( 62.25 MB/s) | **+8.11%**| -| x-ray | 8474240 | 0.190 sec ( 44.52 MB/s) | 0.217 sec ( 39.11 MB/s) | **+13.82%**| 0.122 sec ( 69.46 MB/s) | 0.156 sec ( 54.16 MB/s) | **+28.25%**| -| sao | 7251944 | 0.175 sec ( 41.48 MB/s) | 0.182 sec ( 39.75 MB/s) | **+4.37%**| 0.127 sec ( 57.26 MB/s) | 0.140 sec ( 51.87 MB/s) | **+10.39%**| -| ooffice | 6152192 | 0.113 sec ( 54.55 MB/s) | 0.117 sec ( 52.45 MB/s) | **+4.01%**| 0.081 sec ( 76.38 MB/s) | 0.088 sec ( 70.30 MB/s) | **+8.65%**| -| abac | 200000 | 0.002 sec ( 84.36 MB/s) | 0.003 sec ( 73.63 MB/s) | **+14.56%**| 0.002 sec ( 105.08 MB/s) | 0.002 sec ( 86.64 MB/s) | **+21.27%**| -| test3 | 2097088 | 0.034 sec ( 61.54 MB/s) | 0.037 sec ( 56.45 MB/s) | **+9.03%**| 0.028 sec ( 75.76 MB/s) | 0.032 sec ( 64.93 MB/s) | **+16.68%**| + sa_sint_t p3 = SA[i + 1]; d += (p3 < 0); p3 &= SAINT_MAX; sa_sint_t v3 = BUCKETS_INDEX4(T[p3 - 1], T[p3 - 2] >= T[p3 - 1]); + SA[buckets[v3]++] = (p3 - 1) | ((sa_sint_t)(buckets[2 + v3] != d) << (SAINT_BIT - 1)); buckets[2 + v3] = d; + } +``` +* In the SA-IS algorithm, after induced sorting, the ranks of LMS substrings are computed in suffix order. These ranks then need to be scattered to reorder them in string order before being gathered again to form the compacted string for recursion. At this point, some LMS substrings may be unique, meaning they don't share their rank with any other LMS substring. Being unique, these substrings are essentially already sorted, and their position relative to other LMS substrings is already determined. However, these unique LMS substrings may still be necessary for sorting other, non-unique LMS substrings during recursion—unless a unique LMS substring is immediately followed by another unique LMS substring in the string. In such cases, the rank of any subsequent unique LMS substrings becomes redundant in the compacted string, as it will not be utilized. Leveraging this insight, libsais employs a strategy to further reduce the size of the compacted string by omitting such redundant LMS substring ranks. This process involves a few steps. First, unique LMS substrings are identified by looking ahead while scanning LMS-positions in the suffix array during the ranking and scattering phase. When scattering LMS substring ranks to form the compacted string, the most significant bit (MSB) of the rank is used to mark that this rank is unique. Next, as the library scans the ranks in string order and detects tandems of unique ranks using the MSB, it then recalculates the MSB for ranks which are redundant, thus markign them for removal from the compacted string. Subsequently, the libsais rescans the LMS-positions in suffix order to recompute the ranks, now focusing only on the ranks of the remaining LMS substrings. The library also uses MSB of first symbol of LMS substrings to mark that LMS substring is removed from the compacted string. Finally, the library builds the compacted string based on the newly recalculated ranks for the remaining LMS substrings, while also saving the final positions for the removed LMS substrings before proceeding with recursion. This reduction process not only further decreases the size of the compacted string but also reduces the alphabet size of the reduced string and creates additional free space in the suffix array, which can be utilized during recursion. +* The SA-IS algorithm, while robust for suffix array construction, is not considered lightweight due to its need for additional memory for tasks such as position classification, induced sorting, the creation of compacted string representations, and recursive decomposition. To mitigate this, libsais optimizes memory usage by not storing position classifications and striving to reuse the memory space allocated for the suffix array for induced sorting, compacted string representations, and recursive decomposition processes. Since position classifications are not stored, the library recalculates them as needed, typically involving checks of adjacent symbols for a given position. Although this approach may seem straightforward, it introduces the challenge of random memory access. Nevertheless, libsais manages these accesses in a manner that either avoids unnecessary memory fetches or minimizes cache penalties. In situations where avoiding cache penalties is unfeasible, the library leverages the most significant bit (MSB) bits for computations, as branch mispredictions on modern microprocessors generally incur lower penalties than cache misses. Memory reuse for the suffix array, despite appearing straightforward, also presents hidden challenges related to implementation complexity. In certain cases, the available space in the suffix array may not suffice for the most optimal algorithm implementation mentioned above. Although such instances are rare, the library aims to deliver optimal performance without additional memory allocation by resorting to a less efficient variant of induced sorting. To accommodate various scenarios, libsais includes four distinct implementations tailored to different breakpoints based on alphabet size (denoted by 'k'): 6k, 4k, 2k, and 1k, with each implementation optimized to ensure performance efficiency. Extensive efforts have been dedicated to refining these implementations, including significant time invested in using various sanitizers to confirm the correctness of the algorithms. Ultimately, while there are specific inputs under which libsais might require additional memory—most of which tend to be synthetic tests designed specifically to challenge the SA-IS algorithm—such instances are relatively rare. In these exceptional cases, the library is designed to allocate only the minimum necessary amount of memory while still delivering the best possible performance. +* The libsais library, initially was developed for constructing suffix arrays, but has broadened its scope to include the calculation of the longest common prefix (LCP) and both the forward and inverse Burrows-Wheeler Transform (BWT) with considerable efforts has been dedicated to refining these algorithms to ensure they deliver maximum performance and maintain the correctness. An illustrative example is the forward BWT, which performance is nearly identical to that of its suffix array construction which is achieved by integrating a modified version of the induced sorting implementation within the final stage of the SA-IS algorithm. Rather than inducing suffix positions at this stage, the library induces the Burrows-Wheeler Transform directly. This approach also supports in-place transformation, maintaining a memory usage of 5n, making it an sutable for data compression applications. Similarly, the inverse BWT is fine-tuned to operate in-place, adhering to the same memory efficiency of 5n with an additional optimization of a bi-gram LF-mapping technique, which allows for the decoding of two symbols simultaneously effectively reduces the number of cache misses during the inversion of the Burrows-Wheeler Transform. + +# Benchmarks -> * All other files from [Benchmarks](#benchmarks) above do not suffer from this fallbacks. +Full list of benchmarks are moved to own [Benchmarks.md](Benchmarks.md) file.