You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
eukaryotic genome databases are now available on farm 🎉 . These contain (almost) every reference genome under NCBI taxonomy node 2759 - a total of 19,215 lineages.
for sketching, I used the code in ctb/2025-ncbi-rest-api#1 to get a list of all euk genomes (in various subsets) and sketch them with directsketch at k=21, k=31, k=51, and a scaled=1000.
** loading from 'entire-2025-01-21.mf.csv'
path filetype: StandaloneManifestIndex
location: entire-2025-01-21.mf.csv
is database? yes
has manifest? yes
num signatures: 616184
** examining manifest...
total hashes: 3158566951
summary of sketches:
19556 sketches with DNA, k=51, scaled=10000 1102231815 total hashes
596628 sketches with DNA, k=51, scaled=1000, abund 2056335136 total hashes
where the 19,556 sketches are the eukaryotic ones, and the 596,628 are from GTDB.
The text was updated successfully, but these errors were encountered:
eukaryotic genome databases are now available on farm 🎉 . These contain (almost) every reference genome under NCBI taxonomy node 2759 - a total of 19,215 lineages.
location:
/group/ctbrowngrp5/sourmash-db/genbank-euks-2024.01/
they are available for download here:
bilateria-minus-vertebrates.k51.sig.zip
eukaryotes-other.k51.sig.zip
eukaryotes.lineages.csv
fungi.k51.sig.zip
metazoa-minus-bilateria.k51.sig.zip
plants.k51.sig.zip
vertebrates.k51.sig.zip
missing genomes
there are 26 reference genomes missing; they don't seem to be available on GenBank. current list is in this file:
/home/ctbrown/scratch3/2025-sourmash-eukaryotic-databases/collections/eukaryotes-missing.links.csv
build repos and scripts
for sketching, I used the code in ctb/2025-ncbi-rest-api#1 to get a list of all euk genomes (in various subsets) and sketch them with directsketch at k=21, k=31, k=51, and a scaled=1000.
the code here https://github.com/sourmash-bio/2025-sourmash-eukaryotic-databases was used to take those and build comprehensive subsets + lineage CSV at k=51 and scaled=10_000. Easy enough to add k=21 and k=31.
I'm building a RocksDB index (k=51, scaled=10_000) using the scripts here: https://github.com/ctb/2025-make-rocksdb-entire/
content summary
a summary of the manifest going into the RocksDB:
where the 19,556 sketches are the eukaryotic ones, and the 596,628 are from GTDB.
The text was updated successfully, but these errors were encountered: