Optimise initial bounds for maching peptides in Suffix Array #25

tibvdm · 2024-08-27T08:31:31Z

This PR introduces the "K-mer bounds table". This table is a small lookup that allows us to very easily match a short peptide sequence and provides us already with a better initial value for the minimum and maximum bounds when looking up a peptide in the suffix array.

See this detailed explanation for how this works exactly:

When searching for a peptide within the suffix array, we employ a binary search strategy across all the suffixes of the proteins in UniProtKB. In the 2024_04 release of UniProtKB, which contains more than 87 billion amino acids, we need to perform log2(87 * 10^9) ≈ 36,35 = 37 suffix comparisons, on average, per bound. Since we are identifying a range of suffixes that match with our peptide within the suffix array, two binary searches are required to determine the minimum and maximum bounds, resulting in a total of 74 comparisons per peptide. These comparisons are computationally costly and should be minimized. To reduce the number of searches, we precompute and store the initial bounds for all possible peptides of length k and shorter. This allows for a constant-time lookup, followed by a much smaller binary search step for each query.

For a k-mer table with factor k, there exist a total of 20k different k-mers. This can be seen as dividing the space of 87 billion amino acids into 20k compartments, where, on average, each compartment has size 2^37 / 20^k. By taking the logarithm log2(2^37 / 20^k) = 37 - k * log2(20), we can calculate how many text comparisons we still need to perform after consulting the k-mer bounds table. For a k of size 5, we only need 15 comparisons per bound instead of 37, resulting in significantly less work per search.

tibvdm added 26 commits August 26, 2024 11:51

first bad implementation for a kmer lookup cache

b95b9f9

Fix indexing issues

4b522b7

fix empty string and add TODO for IL equality

fc8551e

Set K to 5 for testing purposes

1047e43

Don't search for peptides where the k-mer has no matches

252d9c8

Try to get more hits

6d8c54e

Do more lookups, but try to avoid the largests bounds this way

f7c66a2

Don't search te entire search string again

6aab11b

Move while loop to the comment

32c8a83

remove shady lcp hack

deccd8a

Optimize the index functions

53c9490

Do not store parameter K as a constant, but a field.

54c5ef4

offset array for efficiency + renaming bounds_cache file

cdb434d

set L to I

031a609

fix tests

b6ff10d

add print information to track filling the cache

a26ce06

add timings to track filling the cache

2d0b2a4

min bound

ed2d746

add - to the alphabet

bfc358f

remove equate

428c63b

division by 0 fix

3db59a0

division by 0 fix

a1c3cdd

division by 0 fix

c3000ff

fix the amount of iterations to calculate the k-mers

99cc2a5

debug bounds

e87e0cf

fix the print code

352bcb7

pverscha changed the title ~~Feature/sa bounds optimization~~ Optimise initial bounds for maching peptides in Suffix Array Sep 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimise initial bounds for maching peptides in Suffix Array #25

Optimise initial bounds for maching peptides in Suffix Array #25

tibvdm commented Aug 27, 2024 •

edited by pverscha

Loading

Optimise initial bounds for maching peptides in Suffix Array #25

Are you sure you want to change the base?

Optimise initial bounds for maching peptides in Suffix Array #25

Conversation

tibvdm commented Aug 27, 2024 • edited by pverscha Loading

tibvdm commented Aug 27, 2024 •

edited by pverscha

Loading