-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Question] Is it possible to use the VEBA Microeukaryotic database with MMSEQS2 taxonomy? #35
Comments
Haha wow, I'm literally trying to solve this problem at this EXACT moment. Ok so maybe you can help me out with this because what you mentioned is that exact use case I'm trying to build it for right now. First, do we need a taxdump file to build a MMSEQS2 database? If so, I think I can use this for help: shenwei356/taxonkit#56 (comment) If not, what do we need to build it? |
For the taxonomy database of MMSEQS2 you need the taxdump files: names.dmp, nodes.dmp, taxid.map, delnodes.dmp & merged.dmp A few weeks ago I tried creating the taxdump files using taxonkit's function: I modified the humann_uniref50_annotations.tsv.gz file to have the input format the program requires, something like this:
it worked fine, but then the manual says that the order has to be fixed for it to work on MMSEQS2, but when I tested some sequences against the taxonomy Microeukaryotic DB, I only obtained unassigned taxonomy to contigs (which shouldn't be the case because I am 100% sure that I have sequences that are in the database). I think it has to do with "ordering" the taxonomy files when doing the fixing, but I haven't tested any longer, I had other projects to work on. I might come back to this if you are making advances and need help! |
The |
First, you create the database as Then:
|
@franlat
When you used this file, was your taxonkit command the following?
My understanding is that the above command would generate the What is the format for the |
Yes, exactly like that! The taxidmapping file is the taxid.map that taxonkit creates. Looks like this;
|
For these:
Can the values in the second column be source identifiers instead NCBI taxids? Not all of the organisms have a |
These are not NCBI taxids, they are IDs that taxonkit creates automatically. So you don't have to worry about it, I think. |
Ok, I built it. Our servers are undergoing maintenance today but I'll try testing it out today or tomorrow when they are available. |
Just a heads up, I've been able to build it but I'm trying to figure out exactly how many resources I need to run |
I usually run just 'mmseqs taxonomy'. Were you able to make it work? |
No not yet. I even tried it with a much smaller subset but still same error. Here's the MMSEQS2 chat: soedinglab/MMseqs2#779 |
There might be another option:
|
Hi! I'm glad to see you are still working on it. I am still very interested! I just checked I recently built a MMSEQS2 database for the new marine DB called MarFERReT. It is very similar to EukProt, but designed to be more specific with populations and species. The good thing compared to EukProt is that it includes NCBI taxids, and building the taxonomy for mmseqs2 was so much easier, as you can use the NCBI taxonomic files directly. Perhaps one solution is to try to give the VEBA microeukaryotic database a taxid for each entry. I'm not sure it is feasible since it might be a lot of work. |
I've thought about that but the problem is that some of the databases I'm pulling from do not have NCBITaxIDs for the entries unfortunately. So it was the classic strict vs. relax scenario. From my understanding, the way MetaEuk and MMseqs2 classify contigs is a little different in terms of the database that is used but I'll definitely check it out. In the meantime, here's the current source taxonomy table for MicroEuk100_v3: |
I think I made it work? I used the source taxonomy file that you provided in the last comment and used taxonkit to generate the taxdump files. The taxonkit manual says that you have to order it, but I didn't this time. I ran some tests, and it turned out okay. Still, many contigs were The files are too big for me to upload them here, but I can try a wetransfer if you need. Here's how the file that I gave taxonkit as input looks like:
|
This is a great development. One parameter that you can use to get around this memory issue is using
Some potential issues around this (that I think have been resolved): soedinglab/MMseqs2#338 I typically use allocate 36G to split memory limit and then when I request from the servers I use 48G just in case it goes over a little bit.
How did you order it? |
The taxonkit manual says that mmseqs2 databases require the taxondump to be ordered, and provides a script to do it, but it doesn't work as intended. This time, I built the database with the taxonid mapping file that taxonkit creates directly. Indeed, I didn't use the whole source taxonomy the last time, so that might have been the issue there. In any case, I will continue using this version and see if I see something strange or not. |
Hello! I've been testing lately doing contig taxonomic assignation with MMSEQS easy-taxonomy and VEBA microeuk v3 database (built as I mentioned in the last comments). I've been comparing it with the MarFERReT db doing the same approach, and the results are very similar. So, it works great. I guess I'll have to build it for the new database version! I tested running the eukaryotic classify environment (it's running still the first version of the database, not v3) vs using mmseqs2 easy-taxonomy and summarizing the output manually. Using mmseqs2 turned out more accurate, but perhaps that has to do with the versions of the database being different. I guess this can also be used to assess "contamination", too. I would like to know your insights on this matter! |
Hello, it's me again.
I wondered if the Microeukaryotic DB can be used with the MMSEQS2 taxonomy module. I am interested in using this database to assign taxonomy at the contig level rather than the whole genome. However, MMSEQS2 requires tax dump-like information from the database. Since the MIcroeukaryotic DB has a similar taxonomy format to GTDB, do you think it can be done? Perhaps you already have the nodes.dmp and names.dmp files?
The text was updated successfully, but these errors were encountered: