Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] Is it possible to use the VEBA Microeukaryotic database with MMSEQS2 taxonomy? #35

Open
franlat opened this issue Nov 16, 2023 · 19 comments

Comments

@franlat
Copy link

franlat commented Nov 16, 2023

Hello, it's me again.

I wondered if the Microeukaryotic DB can be used with the MMSEQS2 taxonomy module. I am interested in using this database to assign taxonomy at the contig level rather than the whole genome. However, MMSEQS2 requires tax dump-like information from the database. Since the MIcroeukaryotic DB has a similar taxonomy format to GTDB, do you think it can be done? Perhaps you already have the nodes.dmp and names.dmp files?

@jolespin
Copy link
Owner

Haha wow, I'm literally trying to solve this problem at this EXACT moment.

Ok so maybe you can help me out with this because what you mentioned is that exact use case I'm trying to build it for right now. First, do we need a taxdump file to build a MMSEQS2 database?

If so, I think I can use this for help: shenwei356/taxonkit#56 (comment)

If not, what do we need to build it?

@franlat
Copy link
Author

franlat commented Dec 15, 2023

For the taxonomy database of MMSEQS2 you need the taxdump files: names.dmp, nodes.dmp, taxid.map, delnodes.dmp & merged.dmp

A few weeks ago I tried creating the taxdump files using taxonkit's function: create-taxdump. See manual here.

I modified the humann_uniref50_annotations.tsv.gz file to have the input format the program requires, something like this:

e66f926af8418c8873d264785662420e        c__Acantharea   o__Arthracanthida       f__Acanthometridae      g__Amphilonche  s__Amphilonche elongata
38c6e6957eabc901ba3b9c708a5e2eee        c__Acantharea   o__Arthracanthida       f__Acanthometridae      g__Amphilonche  s__Amphilonche elongata
083441fc54d5eb9ecc0cd83490139b14        c__Acantharea   o__Arthracanthida       f__Phyllostauridae      g__Phyllostaurus        s__Phyllostaurus siculus
5df8df7ddea8c8316cfdfc02e9ceb988        c__Acantharea   o__Arthracanthida       f__Phyllostauridae      g__Phyllostaurus        s__Phyllostaurus siculu

it worked fine, but then the manual says that the order has to be fixed for it to work on MMSEQS2, but when I tested some sequences against the taxonomy Microeukaryotic DB, I only obtained unassigned taxonomy to contigs (which shouldn't be the case because I am 100% sure that I have sequences that are in the database). I think it has to do with "ordering" the taxonomy files when doing the fixing, but I haven't tested any longer, I had other projects to work on. I might come back to this if you are making advances and need help!

@jolespin
Copy link
Owner

The humann_uniref50_annotations.tsv.gz only has a subset of the data. You would need to use the source_taxonomy.tsv.gz file. What is the MMSEQS2 command used once you have the taxdump file? I can look into this today and test on a few sets.

@franlat
Copy link
Author

franlat commented Dec 15, 2023

First, you create the database as mmseqs createdb sequence.fasta sequenceDB with the reference fasta.

Then:
mmseqs createtaxdb sequenceDB tmp --ncbi-tax-dump ncbi-taxdump --tax-mapping-file taxidmapping

ncbi-taxdump and taxidmapping are the files that taxonkit outputs. It did not raise any error for me, but since you say the file I used was a subset, then maybe there is the problem!

@jolespin
Copy link
Owner

jolespin commented Dec 16, 2023

@franlat
Excellent. This will be next on my to-do list on Monday.

e66f926af8418c8873d264785662420e        c__Acantharea   o__Arthracanthida       f__Acanthometridae      g__Amphilonche  s__Amphilonche elongata
38c6e6957eabc901ba3b9c708a5e2eee        c__Acantharea   o__Arthracanthida       f__Acanthometridae      g__Amphilonche  s__Amphilonche elongata
083441fc54d5eb9ecc0cd83490139b14        c__Acantharea   o__Arthracanthida       f__Phyllostauridae      g__Phyllostaurus        s__Phyllostaurus siculus
5df8df7ddea8c8316cfdfc02e9ceb988        c__Acantharea   o__Arthracanthida       f__Phyllostauridae      g__Phyllostaurus        s__Phyllostaurus siculu

When you used this file, was your taxonkit command the following?

cat humann_uniref50_annotations.tsv.gz \
    | gzip -d \
    | csvtk fix -t \
    | csvtk cut -t -f id_source,class,order,family,genus,species\
    | taxonkit create-taxdump -A 1 -O taxdump --rank-names class,order,family,genus,species

My understanding is that the above command would generate the ncbi-taxdump file.

What is the format for the taxidmapping file?

@franlat
Copy link
Author

franlat commented Dec 16, 2023

Yes, exactly like that! The taxidmapping file is the taxid.map that taxonkit creates. Looks like this;

e66f926af8418c8873d264785662420e        236192150
38c6e6957eabc901ba3b9c708a5e2eee        236192150
083441fc54d5eb9ecc0cd83490139b14        1443417319
5df8df7ddea8c8316cfdfc02e9ceb988        1443417319
84c8d9b87b512f3fabe68675d6b12bec        1443417319
413a5978d14b93c1b640172214bee751        1443417319
15e1cf336625fec63e2a77d06b107e11        1443417319
839345a945933f604310b98bd9b3aa35        1443417319
84f696d32ebc6504e7f34801e7b2174c        1296643208

@jolespin
Copy link
Owner

For these:

e66f926af8418c8873d264785662420e        236192150
38c6e6957eabc901ba3b9c708a5e2eee        236192150
083441fc54d5eb9ecc0cd83490139b14        1443417319
5df8df7ddea8c8316cfdfc02e9ceb988        1443417319
84c8d9b87b512f3fabe68675d6b12bec        1443417319
413a5978d14b93c1b640172214bee751        1443417319
15e1cf336625fec63e2a77d06b107e11        1443417319
839345a945933f604310b98bd9b3aa35        1443417319
84f696d32ebc6504e7f34801e7b2174c        1296643208

Can the values in the second column be source identifiers instead NCBI taxids? Not all of the organisms have a ncbi_taxon identifier unfortunately but they all have unique source identifiers.

@franlat
Copy link
Author

franlat commented Dec 16, 2023

These are not NCBI taxids, they are IDs that taxonkit creates automatically. So you don't have to worry about it, I think.

@jolespin
Copy link
Owner

Ok, I built it. Our servers are undergoing maintenance today but I'll try testing it out today or tomorrow when they are available.

@jolespin
Copy link
Owner

Just a heads up, I've been able to build it but I'm trying to figure out exactly how many resources I need to run mmseqs easy-taxonomy because it keeps failing even when I use --split-memory-limit to 75% of the memory I'm allocating. Circling around to this here and there.

@franlat
Copy link
Author

franlat commented Jan 2, 2024

I usually run just 'mmseqs taxonomy'. Were you able to make it work?

@jolespin
Copy link
Owner

jolespin commented Jan 4, 2024

No not yet. I even tried it with a much smaller subset but still same error. Here's the MMSEQS2 chat: soedinglab/MMseqs2#779

@jolespin
Copy link
Owner

There might be another option:

(VEBA-binning-eukaryotic_env) [jespinoz@login02 output]$ metaeuk taxtocontig -h | head
usage: metaeuk taxtocontig <i:contigsDB> <i:predictionsFasta> <i:predictionsFasta.headersMap.tsv> <i:taxAnnotTargetDb> <o:taxResult> <tmpDir> [options]
 By Eli Levy Karin <[email protected]>
options: prefilter:
 --comp-bias-corr INT             Correct for locally biased amino acid composition (range 0-1) [1]
 --add-self-matches BOOL          Artificially add entries of queries with themselves (for clustering) [0]
 --seed-sub-mat TWIN              Substitution matrix file for k-mer generation [nucl:nucleotide.out,aa:VTML80.out]
 -s FLOAT                         Sensitivity: 1.0 faster; 4.0 fast; 7.5 sensitive [4.000]
 -k INT                           k-mer length (0: automatically set to optimum) [0]
 --k-score INT                    k-mer threshold for generating similar k-mer lists [2147483647]
 --alph-size TWIN                 Alphabet size (range 2-21) [nucl:5,aa:21]

@franlat
Copy link
Author

franlat commented Feb 15, 2024

Hi! I'm glad to see you are still working on it. I am still very interested! I just checked metaeuk taxtocontig, but I thought it used mmseqs2-like databases. Therefore, isn't it the same problem with building the mmseqs2 database?

I recently built a MMSEQS2 database for the new marine DB called MarFERReT. It is very similar to EukProt, but designed to be more specific with populations and species. The good thing compared to EukProt is that it includes NCBI taxids, and building the taxonomy for mmseqs2 was so much easier, as you can use the NCBI taxonomic files directly. Perhaps one solution is to try to give the VEBA microeukaryotic database a taxid for each entry. I'm not sure it is feasible since it might be a lot of work.

@jolespin
Copy link
Owner

I've thought about that but the problem is that some of the databases I'm pulling from do not have NCBITaxIDs for the entries unfortunately. So it was the classic strict vs. relax scenario. From my understanding, the way MetaEuk and MMseqs2 classify contigs is a little different in terms of the database that is used but I'll definitely check it out.

In the meantime, here's the current source taxonomy table for MicroEuk100_v3:
source_taxonomy.tsv.gz

@franlat
Copy link
Author

franlat commented Feb 24, 2024

I think I made it work? I used the source taxonomy file that you provided in the last comment and used taxonkit to generate the taxdump files. The taxonkit manual says that you have to order it, but I didn't this time. I ran some tests, and it turned out okay. Still, many contigs were unclassified, but some were assigned correctly to what I was expecting. The biggest issue was memory usage. For a fasta file of 2.6 GB (in bytes not), mmseqs2 needed 180G of memory.

The files are too big for me to upload them here, but I can try a wetransfer if you need. Here's how the file that I gave taxonkit as input looks like:

df72f5b70f98518d320317439c41c753        Dothideomycetes Pleosporales    Pleosporaceae   Alternaria
ec750542edad93ce3f0770b338d8b761        Dothideomycetes Pleosporales    Pleosporaceae   Alternaria
336c7b7529ee5ddc5f30f79c7f2e79b9        Dothideomycetes Pleosporales    Pleosporaceae   Alternaria
ef3d0ea5c2258d0b5e467b53ea1078aa        Dothideomycetes Pleosporales    Pleosporaceae   Alternaria

@jolespin
Copy link
Owner

jolespin commented Feb 26, 2024

This is a great development. One parameter that you can use to get around this memory issue is using --split-memory-limit:

 --split-memory-limit BYTE        Set max memory per split. E.g. 800B, 5K, 10M, 1G. Default (0) to all available system memory [0]

Some potential issues around this (that I think have been resolved): soedinglab/MMseqs2#338

I typically use allocate 36G to split memory limit and then when I request from the servers I use 48G just in case it goes over a little bit.

The taxonkit manual says that you have to order it, but I didn't this time.

How did you order it?

@franlat
Copy link
Author

franlat commented Feb 26, 2024

The taxonkit manual says that mmseqs2 databases require the taxondump to be ordered, and provides a script to do it, but it doesn't work as intended. This time, I built the database with the taxonid mapping file that taxonkit creates directly. Indeed, I didn't use the whole source taxonomy the last time, so that might have been the issue there. In any case, I will continue using this version and see if I see something strange or not.

@franlat
Copy link
Author

franlat commented Jun 5, 2024

Hello! I've been testing lately doing contig taxonomic assignation with MMSEQS easy-taxonomy and VEBA microeuk v3 database (built as I mentioned in the last comments). I've been comparing it with the MarFERReT db doing the same approach, and the results are very similar. So, it works great. I guess I'll have to build it for the new database version!

I tested running the eukaryotic classify environment (it's running still the first version of the database, not v3) vs using mmseqs2 easy-taxonomy and summarizing the output manually. Using mmseqs2 turned out more accurate, but perhaps that has to do with the versions of the database being different. I guess this can also be used to assess "contamination", too.

I would like to know your insights on this matter!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants