Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

eukaryotic genome databases on farm (and soon, publicly available) #3504

Open
ctb opened this issue Jan 22, 2025 · 0 comments
Open

eukaryotic genome databases on farm (and soon, publicly available) #3504

ctb opened this issue Jan 22, 2025 · 0 comments
Labels
fyi Information that is interesting or useful

Comments

@ctb
Copy link
Contributor

ctb commented Jan 22, 2025

eukaryotic genome databases are now available on farm 🎉 . These contain (almost) every reference genome under NCBI taxonomy node 2759 - a total of 19,215 lineages.

location: /group/ctbrowngrp5/sourmash-db/genbank-euks-2024.01/

-rw-rw-r-- 1 ctbrown datalabgrp 4.0G Jan 21 11:03 vertebrates.k51.sig.zip
-rw-rw-r-- 1 ctbrown datalabgrp 1.7G Jan 21 09:05 bilateria-minus-vertebrates.k51.sig.zip
-rw-rw-r-- 1 ctbrown datalabgrp 1.3G Jan 21 08:39 plants.k51.sig.zip
-rw-rw-r-- 1 ctbrown datalabgrp 165M Jan 21 08:03 fungi.k51.sig.zip
-rw-rw-r-- 1 ctbrown datalabgrp  81M Jan 21 08:06 metazoa-minus-bilateria.k51.sig.zip
-rw-rw-r-- 1 ctbrown datalabgrp  56M Jan 21 08:08 eukaryotes-other.k51.sig.zip
-rw-rw-r-- 1 ctbrown datalabgrp 3.9M Jan 21 06:57 eukaryotes.lineages.csv

they are available for download here:

bilateria-minus-vertebrates.k51.sig.zip
eukaryotes-other.k51.sig.zip
eukaryotes.lineages.csv
fungi.k51.sig.zip
metazoa-minus-bilateria.k51.sig.zip
plants.k51.sig.zip
vertebrates.k51.sig.zip

missing genomes

there are 26 reference genomes missing; they don't seem to be available on GenBank. current list is in this file:

/home/ctbrown/scratch3/2025-sourmash-eukaryotic-databases/collections/eukaryotes-missing.links.csv

build repos and scripts

for sketching, I used the code in ctb/2025-ncbi-rest-api#1 to get a list of all euk genomes (in various subsets) and sketch them with directsketch at k=21, k=31, k=51, and a scaled=1000.

the code here https://github.com/sourmash-bio/2025-sourmash-eukaryotic-databases was used to take those and build comprehensive subsets + lineage CSV at k=51 and scaled=10_000. Easy enough to add k=21 and k=31.

I'm building a RocksDB index (k=51, scaled=10_000) using the scripts here: https://github.com/ctb/2025-make-rocksdb-entire/

content summary

a summary of the manifest going into the RocksDB:

** loading from 'entire-2025-01-21.mf.csv'
path filetype: StandaloneManifestIndex
location: entire-2025-01-21.mf.csv
is database? yes
has manifest? yes
num signatures: 616184
** examining manifest...
total hashes: 3158566951
summary of sketches:
   19556 sketches with DNA, k=51, scaled=10000        1102231815 total hashes
   596628 sketches with DNA, k=51, scaled=1000, abund 2056335136 total hashes

where the 19,556 sketches are the eukaryotic ones, and the 596,628 are from GTDB.

@ctb ctb added the fyi Information that is interesting or useful label Jan 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fyi Information that is interesting or useful
Projects
None yet
Development

No branches or pull requests

1 participant