eukaryotic genome databases on farm (and soon, publicly available) #3504

ctb · 2025-01-22T04:06:19Z

eukaryotic genome databases are now available on farm 🎉 . These contain (almost) every reference genome under NCBI taxonomy node 2759 - a total of 19,215 lineages.

location: /group/ctbrowngrp5/sourmash-db/genbank-euks-2024.01/

-rw-rw-r-- 1 ctbrown datalabgrp 4.0G Jan 21 11:03 vertebrates.k51.sig.zip
-rw-rw-r-- 1 ctbrown datalabgrp 1.7G Jan 21 09:05 bilateria-minus-vertebrates.k51.sig.zip
-rw-rw-r-- 1 ctbrown datalabgrp 1.3G Jan 21 08:39 plants.k51.sig.zip
-rw-rw-r-- 1 ctbrown datalabgrp 165M Jan 21 08:03 fungi.k51.sig.zip
-rw-rw-r-- 1 ctbrown datalabgrp  81M Jan 21 08:06 metazoa-minus-bilateria.k51.sig.zip
-rw-rw-r-- 1 ctbrown datalabgrp  56M Jan 21 08:08 eukaryotes-other.k51.sig.zip
-rw-rw-r-- 1 ctbrown datalabgrp 3.9M Jan 21 06:57 eukaryotes.lineages.csv

they are available for download here:

bilateria-minus-vertebrates.k51.sig.zip
eukaryotes-other.k51.sig.zip
eukaryotes.lineages.csv
fungi.k51.sig.zip
metazoa-minus-bilateria.k51.sig.zip
plants.k51.sig.zip
vertebrates.k51.sig.zip

missing genomes

there are 26 reference genomes missing; they don't seem to be available on GenBank. current list is in this file:

/home/ctbrown/scratch3/2025-sourmash-eukaryotic-databases/collections/eukaryotes-missing.links.csv

build repos and scripts

for sketching, I used the code in ctb/2025-ncbi-rest-api#1 to get a list of all euk genomes (in various subsets) and sketch them with directsketch at k=21, k=31, k=51, and a scaled=1000.

the code here https://github.com/sourmash-bio/2025-sourmash-eukaryotic-databases was used to take those and build comprehensive subsets + lineage CSV at k=51 and scaled=10_000. Easy enough to add k=21 and k=31.

I'm building a RocksDB index (k=51, scaled=10_000) using the scripts here: https://github.com/ctb/2025-make-rocksdb-entire/

content summary

a summary of the manifest going into the RocksDB:

** loading from 'entire-2025-01-21.mf.csv'
path filetype: StandaloneManifestIndex
location: entire-2025-01-21.mf.csv
is database? yes
has manifest? yes
num signatures: 616184
** examining manifest...
total hashes: 3158566951
summary of sketches:
   19556 sketches with DNA, k=51, scaled=10000        1102231815 total hashes
   596628 sketches with DNA, k=51, scaled=1000, abund 2056335136 total hashes

where the 19,556 sketches are the eukaryotic ones, and the 596,628 are from GTDB.

The text was updated successfully, but these errors were encountered:

ctb added the fyi Information that is interesting or useful label Jan 22, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

eukaryotic genome databases on farm (and soon, publicly available) #3504

eukaryotic genome databases on farm (and soon, publicly available) #3504

ctb commented Jan 22, 2025

eukaryotic genome databases on farm (and soon, publicly available) #3504

eukaryotic genome databases on farm (and soon, publicly available) #3504

Comments

ctb commented Jan 22, 2025

missing genomes

build repos and scripts

content summary