Author: [email protected]
Snakemake Pipeline to check the requirements for a prokaryotic assembly to be included in the SeqCode initiative.
The requirements are outlined in APPENDIX I of the SeqCode.
Check out the usage instructions in the snakemake workflow catalog
But here is a rough overview:
- Install conda (mamba or miniconda is fine).
- Install snakemake with:
conda install -c conda-forge -c bioconda snakemake
- Download checkm2 database (via
wget https://zenodo.org/api/files/fd3bc532-cd84-4907-b078-2e05a1e46803/checkm2_database.tar.gz
) - Download GTDB-Tk database (via
wget https://data.gtdb.ecogenomic.org/releases/release220/220.0/auxillary_files/gtdbtk_package/full_package/gtdbtk_r220_data.tar.gz
) - Download the latest release from this repo and cd into it
- Edit the
config/config.yaml
to provide the paths to your results/logs directories, and the paths to the databases you downloaded, as well as any parameters you might want to change. - Edit the
config/sampleData.csv
file with the specific details for each assembly you want to check. Depending on what you enter here, the pipeline will automatically adjust what will be done. - Open a terminal in the main dir and start a dry-run of the pipeline with the following command. This will download and install all the dependencies for the pipeline (this step takes may take some time) and it will show you if you set up the paths correctly:
snakemake --sdm conda -n --cores
- Run the pipeline with
snakemake --sdm conda --cores
- add 16S rRNA gene truncation check
- add automatic switches for Kingdom specific modes of some tools
- automate checkm2 and gtdb-tk database downloads
- add checks if the config file and the sample file are correctly filled
- Taxonomy
- GTDB-Tk v2.4.0 - toolkit for assigning objective taxonomic classifications to bacterial and archaeal genomes. Used to get full genome taxonomic classification.
- Infernal v1.1.5 - RNA secondary structure/sequence profiles for homology search and alignment. Used to find and extract rRNA genes in the genomes.
- DECIPHER v2.30.0 - Tools for curating, analyzing, and manipulating biological sequences. Used to get 16S rRNA gene taxonomic classification by comparing to SILVA db.
- SILVA r138 - rRNA database. Used as source of rRNA gene taxonomy
- Contamination and Completeness
- CheckM2 v1.0.1 - Assessing the quality of metagenome-derived genome bins using machine learning. Used to get completeness and contamination stats. Unlike CheckM1 (one of the most popular tools for completeness and contamination prediction), CheckM2 has universally trained machine learning models it applies regardless of taxonomic lineage. This allows it to work better with organisms that have only few known representative genomes.
- tRNA gene occurence
- tRNAscan-SE v2.0.12 - An improved tool for transfer RNA detection. Used to find tRNA genes in the genomes.
- General stats, file manipulation, alignment, and reporting
- seqkit v2.8.2 - ultrafast toolkit for FASTA/Q file manipulation. Used for quick and easy general stat gathering and sequence concatination.
- minimap2 v2.28 - versatile pairwise aligner for genomic and spliced nucleotide sequences. Used to align sequencing reads to assembly to get coverage stats.
- samtools v1.20 - Tools for manipulating next-generation sequencing data Used to calculate coverage stats.
- tidyverse v2.0.0 - R packages for data science Used for general data manipulation for reporting
- fs v1.6.4 - cross platform file operations Used for file manipulation for reporting
- tinytable v0.4.0 - Simple and Customizable Tables Used to generate the final report
data/GCF_000007305.1_ASM730v1_genomic.fna
- This is the reference genome of Pyrococcus furiosus, which does fit the criteria of SeqCode. It was acquired from the RefSeq database.data/GCA_015662175.1_ASM1566217v1_genomic.fna
- This is the assembly of Thermococcus paralvinellae, which does not fit the criteria of SeqCode. It was acquired from GenBank databasedata/SRR8767914_subsampled.fastq.gz
is a DNA-Seq of Pyrococcus furiosus DSM 3638 dataset, that was subsampled for quicker testing viazcat SRR8767914.fastq.gz | seqkit sample --rand-seed 42 -p 0.1 -o SRR8767914_subsampled.fastq.gz
.
Copyright Richard Stöckl 2024.
Distributed under the Boost Software License, Version 1.0.
(See accompanying file LICENSE or copy at
https://www.boost.org/LICENSE_1_0.txt)