Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rust LCA #32

Merged
merged 17 commits into from
Jan 10, 2024
Merged

Rust LCA #32

merged 17 commits into from
Jan 10, 2024

Conversation

stijndcl
Copy link
Contributor

Re-wrote the LCA Java tool in Rust, taking into account all comments on my previous PR.

@stijndcl stijndcl added the enhancement New feature or request label Nov 14, 2023
@stijndcl stijndcl self-assigned this Nov 14, 2023
@ninewise
Copy link

ninewise commented Nov 15, 2023 via email

@pdawyndt
Copy link
Member

pdawyndt commented Nov 15, 2023

@stijndcl: @BramDevlaminck has already used @ninewise's Rust implementations of the LCA (and alternatives such as LCA*) in his attempts for an improved Unipept index; feel free to check with him

@ninewise: FYI: @BramDevlaminck is implementing alternative index structures in Rust (suffix tree, (compressed/sparse) sufix array, (bi-directional) FM-index) to see if we could replace the current Unipept index such that i) non-tryptic peptides can be found (including tryptic peptides that suffer from miscleavages) and ii) peptides can be found through inexact matching instead of exact matching, without compromising speed (or even improving speed) for lookups and keeping memory within limits of our machines. We still precompute aggregations (currently LCA* as we did for umgap) at all internal nodes of the index structures for performance reasons. We already finished implementation and benchmarking of suffix trees for SwissProt, with extremly good (fast) results. Now we need to scale-up x500 in size (where memory will become the bottleneck), so this is why were moving on to the other more memory-friendly index structures.

@stijndcl
Copy link
Contributor Author

Out of curiosity: if you're porting this to rust as well, why don't just
call umgap joinkmers taxons.tsv (cutting out the right columns from
the TSV input)? I've replaced the java command with this before (to try
out non-LCA indexes) and it works quite well.

Because I didn't know that existed :) I'll give it a shot later this week and see if it outperforms mine.

@BramDevlaminck
Copy link

@pdawyndt I'm actually using LCA and not LCA*.
I discussed this with @pverscha.
The reason for this is that applying LCA* to the leaves vs applying it directly on the descendants gave a different result.

Here is a minimal example that shows where a difference occurs (this is part of the first version of my thesis):
image

@ninewise
Copy link

ninewise commented Nov 19, 2023 via email

@BramDevlaminck
Copy link

BramDevlaminck commented Nov 20, 2023

@ninewise

@BramDevlaminck I don't quite understand the image. What is the original input here? I'd assume it was 9606 10566 9606, but then why are they structured in a tree like this? Or is this the tree of operations, something like lca*(lca*(9606, 10566), 9606)?

The original input are the values of the leaves, so indeed 9606, 10566 and 9606

Figure (a) performs the LCA* on the leaves of the subtree. So for the smaller left subtree it computes lca*(9606, 10566), and for the root of the tree lca*(9606, 10566, 9606).

Figure (b) performs the LCA* on the values on the tree of operations, so indeed lca*(lca*(9606, 10566), 9606) for the root, and lca*(9606, 10566) for the left subtree.

@pverscha
Copy link
Member

@ninewise @stijndcl I just also thought about the fact that the filter that's present in the current database construction pipeline should still be taken into account. I don't think that UMGAP follows the same exclusion rules as Unipept does at this point? I'm talking about the filtering that takes place in the validate function: https://github.com/unipept/unipept-database/blob/fb0c554bcb442bbb061d593e09106e053e3d3b46/scripts/helper_scripts/parser/src/taxons/TaxonList.java#L81C5-L116C1

@ninewise
Copy link

ninewise commented Nov 20, 2023 via email

@stijndcl
Copy link
Contributor Author

@ninewise I gave umgap a go and joinkmers needs 86 seconds whereas my implementation only needs 22. The difference is quite substantial so I'd propose we use mine instead.

I believe part of the difference could be that umgap operates on the taxons file, but my tool uses the lineages that we build in an earlier stage of the pipeline.

@ninewise
Copy link

ninewise commented Nov 27, 2023 via email

@stijndcl stijndcl merged commit ac980f2 into master Jan 10, 2024
5 checks passed
@stijndcl stijndcl deleted the feature/rust-lca branch April 30, 2024 14:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants