-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rust LCA #32
Rust LCA #32
Conversation
Out of curiosity: if you're porting this to rust as well, why don't just
call `umgap joinkmers taxons.tsv` (`cut`ting out the right columns from
the TSV input)? I've replaced the java command with this before (to try
out non-LCA indexes) and it works quite well.
|
@stijndcl: @BramDevlaminck has already used @ninewise's Rust implementations of the LCA (and alternatives such as LCA*) in his attempts for an improved Unipept index; feel free to check with him @ninewise: FYI: @BramDevlaminck is implementing alternative index structures in Rust (suffix tree, (compressed/sparse) sufix array, (bi-directional) FM-index) to see if we could replace the current Unipept index such that i) non-tryptic peptides can be found (including tryptic peptides that suffer from miscleavages) and ii) peptides can be found through inexact matching instead of exact matching, without compromising speed (or even improving speed) for lookups and keeping memory within limits of our machines. We still precompute aggregations (currently LCA* as we did for umgap) at all internal nodes of the index structures for performance reasons. We already finished implementation and benchmarking of suffix trees for SwissProt, with extremly good (fast) results. Now we need to scale-up x500 in size (where memory will become the bottleneck), so this is why were moving on to the other more memory-friendly index structures. |
Because I didn't know that existed :) I'll give it a shot later this week and see if it outperforms mine. |
@pdawyndt I'm actually using LCA and not LCA*. Here is a minimal example that shows where a difference occurs (this is part of the first version of my thesis): |
Thanks for the FYI, @pdawyndt; I'll be asking for a copy to read through at the end of the year, if that's OK.
@BramDevlaminck I don't quite understand the image. What is the original input here? I'd assume it was `9606 10566 9606`, but then why are they structured in a tree like this? Or is this the tree of operations, something like `lca*(lca*(9606, 10566), 9606)`?
|
The original input are the values of the leaves, so indeed 9606, 10566 and 9606 Figure (a) performs the LCA* on the leaves of the subtree. So for the smaller left subtree it computes Figure (b) performs the LCA* on the values on the tree of operations, so indeed |
@ninewise @stijndcl I just also thought about the fact that the filter that's present in the current database construction pipeline should still be taken into account. I don't think that UMGAP follows the same exclusion rules as Unipept does at this point? I'm talking about the filtering that takes place in the |
Bram Devlaminck ***@***.***> wrote:
> @BramDevlaminck I don't quite understand the image. What is the original input here? I'd assume it was `9606 10566 9606`, but then why are they structured in a tree like this? Or is this the tree of operations, something like `lca*(lca*(9606, 10566), 9606)`?
The original input are the values of the leaves, so indeed 9606, 10566 and 9606
Figure (a) performs the LCA* on the leaves of the subtree. So for the smaller left subtree it computes `lca*(9606, 10566)`, and for the root of the tree `lca*(9606, 10566, 9606)`.
Figure (b) performs the LCA* on the values on the tree of operations, so indeed `lca*(lca*(9606, 10566), 9606)` for the root, and `lca*(9606, 10566)` for the left subtree.
Right, makes more sense now. LCA* does indeed not work for a simple pairwise aggregation, unless you additionally store the "depth" of the last merge. You can check this in code [here](https://github.com/unipept/umgap/blob/master/src/rmq/lca.rs#L58-L89) or read the discussion of this algorithm in Tom's master thesis (I don't have a copy here, but Peter and Bart probably do).
Pieter Verschaffelt ***@***.***> wrote:
@ninewise @stijndcl I just also thought about the fact that the filter that's present in the current database construction pipeline should still be taken into account. I don't think that UMGAP follows the same exclusion rules as Unipept does at this point? I'm talking about the filtering that takes place in the `validate` function: https://github.com/unipept/unipept-database/blob/fb0c554bcb442bbb061d593e09106e053e3d3b46/scripts/helper_scripts/parser/src/taxons/TaxonList.java#L81C5-L116C1
It should: it takes the taxons.tsv table as input and uses the "valid" column therein.
|
scripts/helper_scripts/unipept-database-rs/src/calculate_lcas/taxonomy.rs
Outdated
Show resolved
Hide resolved
scripts/helper_scripts/unipept-database-rs/src/calculate_lcas/taxonomy.rs
Show resolved
Hide resolved
scripts/helper_scripts/unipept-database-rs/src/calculate_lcas/taxonomy.rs
Show resolved
Hide resolved
scripts/helper_scripts/unipept-database-rs/src/calculate_lcas/taxonomy.rs
Outdated
Show resolved
Hide resolved
scripts/helper_scripts/unipept-database-rs/src/calculate_lcas/taxonomy.rs
Outdated
Show resolved
Hide resolved
@ninewise I gave I believe part of the difference could be that |
Well, just using the taxons file probably wouldn't be that much slower
(I think), but `joinkmers` also does a tree-based LCA* (actually
hybrid 95%) rather than a table-based LCA*, which apparently makes
quite a difference in performance. I definitely agree going with your
implementation.
|
Rust Taxon Dump Parser
Re-wrote the LCA Java tool in Rust, taking into account all comments on my previous PR.