Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sourmash tax question - genus level matches? #3497

Open
jodyphelan opened this issue Jan 17, 2025 · 3 comments
Open

sourmash tax question - genus level matches? #3497

jodyphelan opened this issue Jan 17, 2025 · 3 comments

Comments

@jodyphelan
Copy link

If a hypothetical new species was sequenced which had max 85% with all other species in the genus, would it be possible with sourmash to identify that it belonged to that genus?

To give a more concrete example, this is the output from gather with gtdb-rs207.genomic-reps.dna.k31.zip:

overlap     p_query p_match avg_abund
---------   ------- ------- ---------
97.0 kbp       1.5%    1.6%      38.3    GCF_000266905.1 Mycolicibacterium chubuense NBB4 strain=NBB4, ASM26690v1
@ctb
Copy link
Contributor

ctb commented Jan 18, 2025

hi @jodyphelan, some hot takes per @bluegenes (who wrote this part of tax):

that's complicated. Basically yes, if we believe we have the "right" ANI/AAI cutoff for genus-level matching

You might also consider using k=21 to get better genus level matching and/or use a protein molecule type/sketch (but we don't provide standardized protein databases yet, so you'd have to sketch your own matching genomes).

@jodyphelan
Copy link
Author

jodyphelan commented Jan 20, 2025

Thanks @ctb

I've managed to boost the proportion in Mycobacterium to upt 11% by making k=21

sample name    proportion   cANI   lineage
-----------    ----------   ----   -------
WMW1132           88.9%     -      unclassified
WMW1132           11.1%     87.8%  d__Bacteria;p__Actinomycetota;c__Actinomycetes;o__Mycobacteriales;f__Mycobacteriaceae;g__Mycobacterium

@ctb ctb changed the title sourmash tax question sourmash tax question - genus level matches? Jan 20, 2025
@ctb
Copy link
Contributor

ctb commented Jan 20, 2025

@jodyphelan this looks great! I will say that an 11.1% overlap in k=21 space between microbial genomes is actually surprisingly stringent (despite looking "low") and would match ~family or genus level, as the cANI suggests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants