-
Notifications
You must be signed in to change notification settings - Fork 10
Input&output v2.1
Snakemake
takes configuration files in yaml format, where the key - value pairs are separated by a colon and strings don't need to be quoted:
######################
# General settings
######################
PROJECT: IBD_tutorial
working_dir: /mypath/IBD_tutorial
omics: metaG
minto_dir: /mypath/MIntO
METADATA: /mypath/IBD_tutorial/tutorial_metadata.txt
raw_reads_dir: /mypath/IBD_tutorial_raw/metaG
When running QC_0.smk
or QC_1.smk
, the corresponding yaml file needs to be filled with the paths and values as necessary. See the tutorial for details.
In the subsequent modules of the pipeline, these values are inherited. Likewise, most parameters are set to the defaults in the yaml files generated by the pipeline, but it is recommended that the user reviews these settings each time, as some more computing intensive features are turned off by default.
MIntO requires sequencing data and accompanying metadata sheet for running.
Sample identifiers should be unique, contain only [A-Za-z0-9_] characters, and agree between the two inputs and also between metaG and metaT.
Sample identifiers containing only numbers and underscores need to be in single-quotes, i.e.
'1_1_1'
and'1234'
.
The metadata sheet is a text file with columns separated by tabs and its location has to be specified in METADATA
in the Snakemake
configuration file QC_0.yaml
or QC_1.yaml
.
Required columns: sample
, sample_alias
One time attribute (sampling time) and up to two categorical attributes could be used for data visualization, for example to create PCoA diagrams and cladograms in QC_2.smk
.
MIntO takes both short, pair-end sequencing reads and long sequencing reads as input, and expects them to be in gzipped fastq format.
The pipeline is capable of hybrid assembly of Nanopore and Illumina reads, but the main application is processing short reads.
All fastq files have to have the same suffix, which is by default 1.fq.gz for forward and 2.fq.gz for the reverse file. Thus, the filenames would be <something>.1.fq.gz and <something>.2.fq.gz, where the <something>
string has to agree for the pair of files.
The suffix strings could be customized in QC_0.yaml
or QC_1.yaml
via the ILLUMINA_suffix
parameter.
See the repository-version of the QC_0.yaml file.
Input sequencing data location has to be specified in raw_reads_dir
in QC_0.yaml
or QC_1.yaml
.
The data could be organized in two ways inside this directory:
Samples could be organized into directories, one directory per sample, and where the directory name is the same as the sample identifier given in the metadata sheet's sample
column. E.g., /somewhere/coolstudy/sample1, /somewhere/coolstudy/sample2, /somewhere/coolstudy/sample3, etc.
Sometimes, we have multiple sequencing runs for a sample. This is also allowed, as long as all the runs are placed within the same sample directory. A sample with three sequencing runs might look like this:
coolstudy
|
---sample1
|
--- lib1.1.fq.gz
--- lib1.2.fq.gz
--- lib2.1.fq.gz
--- lib2.2.fq.gz
--- lib3_1.fq.gz
--- lib3_2.fq.gz
---sample2
|
--- sample2_1.fq.gz
--- sample2_2.fq.gz
Samples, that are to be processed, need to be listed in the yaml
file under ILLUMINA
with their sample identifiers.
See the repository-version of the QC_1.yaml file.
In this case, all samples are in raw_reads_dir
:
coolstudy
|
--- sample1.1.fq.gz
--- sample1.2.fq.gz
--- sample2.1.fq.gz
--- sample2.2.fq.gz
--- sample3_1.fq.gz
--- sample3_2.fq.gz
Either a) their filenames should contain their sample identifiers (e.g. sample1.1.fq.gz) and there are no multiple runs for any of them, in which case sample identifiers are collected from the metadata sheet's sample
column in QC_0.yaml
or QC_1.yaml
:
# 'column_name': use all samples from the column 'column_name' from the metadata file
ILLUMINA: sample
Or b) there are multiple runs for samples, in which case the samples and their corresponding run identifiers need to be given in an additional tab separated text file, with columns sample
and run
, with the ILLUMINA
parameter containing its location in QC_0.yaml
or QC_1.yaml
:
# Input data
# ILLUMINA section:
ILLUMINA: /path/to/runs_sheet.tsv
See an example of the runs sheet file.
The main outputs are placed into the output
folder under the working directory. Software and database versions printed to text files in output/versions
.
Here follows a non-exhaustive list of outputs, with placeholders for various variables, e.g. metaG/metaT and sample identifiers.
Histogram of the read lengths after trimming:
output/1-trimmed/{omics}_cumulative_read_length_cutoff.pdf
Trimmed files:
{omics}/1-trimmed/{sample}
Taxonomic profiling from reads:
Abundance tables:
output/6-taxa_profile/{omics}.{taxonomy}.{version}.merged_abundance_table.txt
output/6-taxa_profile/{omics}.{taxonomy}.{version}.merged_abundance_table_species.txt
Filtered to taxa that are present in at least one sample:
output/6-taxa_profile/{omics}.{taxonomy}.{version}.tsv
Phyloseq object:
output/6-taxa_profile/{omics}.{taxonomy}.{version}_phyloseq.rds
Plots:
output/6-taxa_profile/{omics}.{taxonomy}.{version}.PCoA.Bray_Curtis.pdf
output/6-taxa_profile/{omics}.{taxonomy}.{version}.Top15genera.pdf
output/6-taxa_profile/{omics}.{taxonomy}.{version}.richness.pdf
Sourmash fracminhash clusters:
Barplots of the clustered samples with their top 15 genera:
output/6-1-smash/{omics}.{taxonomy}.{version}.clusters.pdf
UMAP and cladogram of sample dissimilarities:
output/6-1-smash/{omics}.sourmash_plots.pdf
Cleaned metagenomic reads:
metaG/4-hostfree/{sample}
Cleaned meta-transcriptomic reads:
metaT/5-1-sortmerna/{sample}
Illumina single-sample assemblies:
{omics}/7-assembly/{sample}
Illumina co-assemblies:
{omics}/7-assembly/{coassembly}/{assembly_preset}
where {coassembly}
is from the MAIN_factor
attribute, and {assembly_preset}
is e.g. meta-large, meta-sensitive.
Metabat-like depth files over scaffolds/contigs:
{omics}/8-1-binning/depth_{scaf_type}/combined.{min_length}.depth.txt
Unique MAG collection:
{omics}/8-1-binning/mags_generation_pipeline/unique_genomes
Checkm2 and sequence scores:
{omics}/8-1-binning/mags_generation_pipeline/coverm_unique_cluster_scored.tsv
Taxonomic annotation:
{omics}/8-1-binning/mags_generation_pipeline/taxonomy.phylophlan.SGB.Jun23.tsv
{omics}/8-1-binning/mags_generation_pipeline/taxonomy.gtdb.r220.tsv
Locus id for each genome:
DB/{MINTO_MODE}/1-prokka/locus_id_list.txt
Predicted CDS:
DB/{MINTO_MODE}/2-postprocessed/
Functional annotation of genomes:
DB/{MINTO_MODE}/4-annotations/{ANNOTATION}.tsv
Read mapping statistics for each sample:
{omics}/9-mapping-profiles/{MINTO_MODE}/mapping.p{identity}.maprate.txt
{omics}/9-mapping-profiles/{MINTO_MODE}/mapping.p{identity}.mapstats.txt
{omics}/9-mapping-profiles/{MINTO_MODE}/mapping.p{identity}.multimap.txt
Normalized abundances:
MAG or refgenome:
{omics}/9-mapping-profiles/{MINTO_MODE}/gene_abundances.p{identity}.MG.tsv
{omics}/9-mapping-profiles/{MINTO_MODE}/gene_abundances.p{identity}.TPM.tsv
{omics}/9-mapping-profiles/{MINTO_MODE}/genome_abundances.p{identity}.profile.abund.prop.genome.txt
{omics}/9-mapping-profiles/{MINTO_MODE}/genome_abundances.p{identity}.profile.abund.prop.txt
{omics}/9-mapping-profiles/{MINTO_MODE}/genome_abundances.p{identity}.profile.relabund.prop.genome.txt
Catalog:
{omics}/9-mapping-profiles/catalog/gene_abundances.p{identity}.TPM.tsv
All outputs are in sub-directories for each combination of the omics
, mapping identity threshold and normalization parameters under /output/data_integration/{MINTO_MODE}-genes
.
Depending on the omics
, the analysis produced:
- Gene profiles and normalised (MG and/or TPM) gene abundance (
GA.tsv
), transcript (GT.tsv
) or expression (GE.tsv
). Expression is the ratio of transcript and abundance. - Function profiles per database and category (
{functional_category}
), including the function IDs, function description and function abundance (FA.{functional_category}.tsv
), transcript (FT.{functional_category}.tsv
) or expression (FE.{functional_category}.tsv
) normalized counts. - Phyloseq objects for the gene and function profiles in
phyloseq_obj
subfolder - PCoA plots of gene and function profiles and summary of genes and features in
plots
subfolder.