Rust LCA #32

stijndcl · 2023-11-14T08:45:44Z

Re-wrote the LCA Java tool in Rust, taking into account all comments on my previous PR.

ninewise · 2023-11-15T07:21:37Z

Out of curiosity: if you're porting this to rust as well, why don't just call `umgap joinkmers taxons.tsv` (`cut`ting out the right columns from the TSV input)? I've replaced the java command with this before (to try out non-LCA indexes) and it works quite well.

pdawyndt · 2023-11-15T07:50:37Z

@stijndcl: @BramDevlaminck has already used @ninewise's Rust implementations of the LCA (and alternatives such as LCA*) in his attempts for an improved Unipept index; feel free to check with him

@ninewise: FYI: @BramDevlaminck is implementing alternative index structures in Rust (suffix tree, (compressed/sparse) sufix array, (bi-directional) FM-index) to see if we could replace the current Unipept index such that i) non-tryptic peptides can be found (including tryptic peptides that suffer from miscleavages) and ii) peptides can be found through inexact matching instead of exact matching, without compromising speed (or even improving speed) for lookups and keeping memory within limits of our machines. We still precompute aggregations (currently LCA* as we did for umgap) at all internal nodes of the index structures for performance reasons. We already finished implementation and benchmarking of suffix trees for SwissProt, with extremly good (fast) results. Now we need to scale-up x500 in size (where memory will become the bottleneck), so this is why were moving on to the other more memory-friendly index structures.

stijndcl · 2023-11-15T08:18:59Z

Out of curiosity: if you're porting this to rust as well, why don't just
call umgap joinkmers taxons.tsv (cutting out the right columns from
the TSV input)? I've replaced the java command with this before (to try
out non-LCA indexes) and it works quite well.

Because I didn't know that existed :) I'll give it a shot later this week and see if it outperforms mine.

BramDevlaminck · 2023-11-15T09:04:12Z

@pdawyndt I'm actually using LCA and not LCA*.
I discussed this with @pverscha.
The reason for this is that applying LCA* to the leaves vs applying it directly on the descendants gave a different result.

Here is a minimal example that shows where a difference occurs (this is part of the first version of my thesis):

ninewise · 2023-11-19T22:25:32Z

Thanks for the FYI, @pdawyndt; I'll be asking for a copy to read through at the end of the year, if that's OK. @BramDevlaminck I don't quite understand the image. What is the original input here? I'd assume it was `9606 10566 9606`, but then why are they structured in a tree like this? Or is this the tree of operations, something like `lca*(lca*(9606, 10566), 9606)`?

BramDevlaminck · 2023-11-20T09:42:13Z

@ninewise

@BramDevlaminck I don't quite understand the image. What is the original input here? I'd assume it was 9606 10566 9606, but then why are they structured in a tree like this? Or is this the tree of operations, something like lca*(lca*(9606, 10566), 9606)?

The original input are the values of the leaves, so indeed 9606, 10566 and 9606

Figure (a) performs the LCA* on the leaves of the subtree. So for the smaller left subtree it computes lca*(9606, 10566), and for the root of the tree lca*(9606, 10566, 9606).

Figure (b) performs the LCA* on the values on the tree of operations, so indeed lca*(lca*(9606, 10566), 9606) for the root, and lca*(9606, 10566) for the left subtree.

pverscha · 2023-11-20T12:59:20Z

@ninewise @stijndcl I just also thought about the fact that the filter that's present in the current database construction pipeline should still be taken into account. I don't think that UMGAP follows the same exclusion rules as Unipept does at this point? I'm talking about the filtering that takes place in the validate function: https://github.com/unipept/unipept-database/blob/fb0c554bcb442bbb061d593e09106e053e3d3b46/scripts/helper_scripts/parser/src/taxons/TaxonList.java#L81C5-L116C1

ninewise · 2023-11-20T16:18:13Z

Bram Devlaminck ***@***.***> wrote:

> @BramDevlaminck I don't quite understand the image. What is the original input here? I'd assume it was `9606 10566 9606`, but then why are they structured in a tree like this? Or is this the tree of operations, something like `lca*(lca*(9606, 10566), 9606)`? The original input are the values of the leaves, so indeed 9606, 10566 and 9606 Figure (a) performs the LCA* on the leaves of the subtree. So for the smaller left subtree it computes `lca*(9606, 10566)`, and for the root of the tree `lca*(9606, 10566, 9606)`. Figure (b) performs the LCA* on the values on the tree of operations, so indeed `lca*(lca*(9606, 10566), 9606)` for the root, and `lca*(9606, 10566)` for the left subtree.

Right, makes more sense now. LCA* does indeed not work for a simple pairwise aggregation, unless you additionally store the "depth" of the last merge. You can check this in code [here](https://github.com/unipept/umgap/blob/master/src/rmq/lca.rs#L58-L89) or read the discussion of this algorithm in Tom's master thesis (I don't have a copy here, but Peter and Bart probably do). Pieter Verschaffelt ***@***.***> wrote:

@ninewise @stijndcl I just also thought about the fact that the filter that's present in the current database construction pipeline should still be taken into account. I don't think that UMGAP follows the same exclusion rules as Unipept does at this point? I'm talking about the filtering that takes place in the `validate` function: https://github.com/unipept/unipept-database/blob/fb0c554bcb442bbb061d593e09106e053e3d3b46/scripts/helper_scripts/parser/src/taxons/TaxonList.java#L81C5-L116C1

It should: it takes the taxons.tsv table as input and uses the "valid" column therein.

scripts/helper_scripts/unipept-database-rs/src/calculate_lcas/taxonomy.rs

stijndcl · 2023-11-27T14:07:41Z

@ninewise I gave umgap a go and joinkmers needs 86 seconds whereas my implementation only needs 22. The difference is quite substantial so I'd propose we use mine instead.

I believe part of the difference could be that umgap operates on the taxons file, but my tool uses the lineages that we build in an earlier stage of the pipeline.

ninewise · 2023-11-27T22:53:38Z

Well, just using the taxons file probably wouldn't be that much slower (I think), but `joinkmers` also does a tree-based LCA* (actually hybrid 95%) rather than a table-based LCA*, which apparently makes quite a difference in performance. I definitely agree going with your implementation.

Rust Taxon Dump Parser

stijndcl added 8 commits November 14, 2023 09:34

First compiling version of rust LCAs

358481c

Linting

5f0523d

Update error handling

0f17c9d

Remove merge artifact

eeb1172

Fix error handling

0565ab3

Parsing and invalidating taxon dumps

adf56e1

First version of validation

c1b20e3

Bugfixes

4f68219

stijndcl added the enhancement New feature or request label Nov 14, 2023

stijndcl requested review from ninewise, rien, pverscha and tibvdm November 14, 2023 08:45

stijndcl self-assigned this Nov 14, 2023

stijndcl and others added 6 commits November 14, 2023 09:48

Fix linting

f4cbb7f

Fix linting

1bdd24f

Fix linting

77b9d7c

Merge branch 'feature/rust-lca' into feature/rust-taxons-lineages

9f77f9b

Formatting

a04f409

Formatting

ae56c71

rien reviewed Nov 22, 2023

View reviewed changes

scripts/helper_scripts/unipept-database-rs/src/calculate_lcas/taxonomy.rs Outdated Show resolved Hide resolved

rien reviewed Nov 22, 2023

View reviewed changes

stijndcl added 2 commits November 28, 2023 10:27

Apply suggestions by Rien

eeeb0e1

Revert change

0675be3

tibvdm approved these changes Dec 1, 2023

View reviewed changes

Merge pull request #33 from unipept/feature/rust-taxons-lineages

4984b88

Rust Taxon Dump Parser

stijndcl merged commit ac980f2 into master Jan 10, 2024
5 checks passed

stijndcl deleted the feature/rust-lca branch April 30, 2024 14:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rust LCA #32

Rust LCA #32

stijndcl commented Nov 14, 2023

ninewise commented Nov 15, 2023 via email

pdawyndt commented Nov 15, 2023 •

edited

Loading

stijndcl commented Nov 15, 2023

BramDevlaminck commented Nov 15, 2023

ninewise commented Nov 19, 2023 via email

BramDevlaminck commented Nov 20, 2023 •

edited

Loading

pverscha commented Nov 20, 2023

ninewise commented Nov 20, 2023 via email

stijndcl commented Nov 27, 2023

ninewise commented Nov 27, 2023 via email

Rust LCA #32

Rust LCA #32

Conversation

stijndcl commented Nov 14, 2023

ninewise commented Nov 15, 2023 via email

pdawyndt commented Nov 15, 2023 • edited Loading

stijndcl commented Nov 15, 2023

BramDevlaminck commented Nov 15, 2023

ninewise commented Nov 19, 2023 via email

BramDevlaminck commented Nov 20, 2023 • edited Loading

pverscha commented Nov 20, 2023

ninewise commented Nov 20, 2023 via email

stijndcl commented Nov 27, 2023

ninewise commented Nov 27, 2023 via email

pdawyndt commented Nov 15, 2023 •

edited

Loading

BramDevlaminck commented Nov 20, 2023 •

edited

Loading