Skip to content

Input&output v2.1

jszarvas edited this page Oct 1, 2024 · 1 revision

0. Configuration file

Snakemake takes configuration files in yaml format, where the key - value pairs are separated by a colon and strings don't need to be quoted:

######################
# General settings
######################

PROJECT: IBD_tutorial
working_dir: /mypath/IBD_tutorial
omics: metaG
minto_dir: /mypath/MIntO
METADATA: /mypath/IBD_tutorial/tutorial_metadata.txt
raw_reads_dir: /mypath/IBD_tutorial_raw/metaG

When running QC_0.smk or QC_1.smk, the corresponding yaml file needs to be filled with the paths and values as necessary. See the tutorial for details.
In the subsequent modules of the pipeline, these values are inherited. Likewise, most parameters are set to the defaults in the yaml files generated by the pipeline, but it is recommended that the user reviews these settings each time, as some more computing intensive features are turned off by default.

1. Input data format and organization

MIntO requires sequencing data and accompanying metadata sheet for running.
Sample identifiers should be unique, contain only [A-Za-z0-9_] characters, and agree between the two inputs and also between metaG and metaT.

Sample identifiers containing only numbers and underscores need to be in single-quotes, i.e. '1_1_1' and '1234'.

1.1. Metadata sheet

The metadata sheet is a text file with columns separated by tabs and its location has to be specified in METADATA in the Snakemake configuration file QC_0.yaml or QC_1.yaml.

Required columns: sample, sample_alias

One time attribute (sampling time) and up to two categorical attributes could be used for data visualization, for example to create PCoA diagrams and cladograms in QC_2.smk.

1.2. Sequence data

MIntO takes both short, pair-end sequencing reads and long sequencing reads as input, and expects them to be in gzipped fastq format.
The pipeline is capable of hybrid assembly of Nanopore and Illumina reads, but the main application is processing short reads.

1.2.1. Sequence data file naming convention

All fastq files have to have the same suffix, which is by default 1.fq.gz for forward and 2.fq.gz for the reverse file. Thus, the filenames would be <something>.1.fq.gz and <something>.2.fq.gz, where the <something> string has to agree for the pair of files.
The suffix strings could be customized in QC_0.yaml or QC_1.yaml via the ILLUMINA_suffix parameter.

See the repository-version of the QC_0.yaml file.

1.2.2. Sequence data organization

Input sequencing data location has to be specified in raw_reads_dir in QC_0.yaml or QC_1.yaml. The data could be organized in two ways inside this directory:

Samples in individual directories

Samples could be organized into directories, one directory per sample, and where the directory name is the same as the sample identifier given in the metadata sheet's sample column. E.g., /somewhere/coolstudy/sample1, /somewhere/coolstudy/sample2, /somewhere/coolstudy/sample3, etc. Sometimes, we have multiple sequencing runs for a sample. This is also allowed, as long as all the runs are placed within the same sample directory. A sample with three sequencing runs might look like this:

coolstudy
|
---sample1
   |
   --- lib1.1.fq.gz
   --- lib1.2.fq.gz
   --- lib2.1.fq.gz
   --- lib2.2.fq.gz
   --- lib3_1.fq.gz
   --- lib3_2.fq.gz
---sample2
   |
   --- sample2_1.fq.gz
   --- sample2_2.fq.gz

Samples, that are to be processed, need to be listed in the yaml file under ILLUMINA with their sample identifiers.

See the repository-version of the QC_1.yaml file.

Samples in bulk

In this case, all samples are in raw_reads_dir:

coolstudy
|
--- sample1.1.fq.gz
--- sample1.2.fq.gz
--- sample2.1.fq.gz
--- sample2.2.fq.gz
--- sample3_1.fq.gz
--- sample3_2.fq.gz

Either a) their filenames should contain their sample identifiers (e.g. sample1.1.fq.gz) and there are no multiple runs for any of them, in which case sample identifiers are collected from the metadata sheet's sample column in QC_0.yaml or QC_1.yaml:

# 'column_name': use all samples from the column 'column_name' from the metadata file
ILLUMINA: sample

Or b) there are multiple runs for samples, in which case the samples and their corresponding run identifiers need to be given in an additional tab separated text file, with columns sample and run, with the ILLUMINA parameter containing its location in QC_0.yaml or QC_1.yaml:

# Input data
# ILLUMINA section:
ILLUMINA: /path/to/runs_sheet.tsv

See an example of the runs sheet file.

2. Outputs

The main outputs are placed into the output folder under the working directory. Software and database versions printed to text files in output/versions.
Here follows a non-exhaustive list of outputs, with placeholders for various variables, e.g. metaG/metaT and sample identifiers.

2.1. QC_1 or QC_0

Histogram of the read lengths after trimming:
output/1-trimmed/{omics}_cumulative_read_length_cutoff.pdf

Trimmed files:
{omics}/1-trimmed/{sample}

2.2. QC_2

Taxonomic profiling from reads:
  Abundance tables:
output/6-taxa_profile/{omics}.{taxonomy}.{version}.merged_abundance_table.txt
output/6-taxa_profile/{omics}.{taxonomy}.{version}.merged_abundance_table_species.txt
  Filtered to taxa that are present in at least one sample:
output/6-taxa_profile/{omics}.{taxonomy}.{version}.tsv
  Phyloseq object:
output/6-taxa_profile/{omics}.{taxonomy}.{version}_phyloseq.rds
  Plots:
output/6-taxa_profile/{omics}.{taxonomy}.{version}.PCoA.Bray_Curtis.pdf
output/6-taxa_profile/{omics}.{taxonomy}.{version}.Top15genera.pdf
output/6-taxa_profile/{omics}.{taxonomy}.{version}.richness.pdf

Sourmash fracminhash clusters:
  Barplots of the clustered samples with their top 15 genera:
output/6-1-smash/{omics}.{taxonomy}.{version}.clusters.pdf
  UMAP and cladogram of sample dissimilarities:
output/6-1-smash/{omics}.sourmash_plots.pdf

Cleaned metagenomic reads:
metaG/4-hostfree/{sample}

Cleaned meta-transcriptomic reads:
metaT/5-1-sortmerna/{sample}

2.3. assembly

Illumina single-sample assemblies:
{omics}/7-assembly/{sample}

Illumina co-assemblies:
{omics}/7-assembly/{coassembly}/{assembly_preset}
where {coassembly} is from the MAIN_factor attribute, and {assembly_preset} is e.g. meta-large, meta-sensitive.

2.4. binning_preparation

Metabat-like depth files over scaffolds/contigs:
{omics}/8-1-binning/depth_{scaf_type}/combined.{min_length}.depth.txt

2.5. mags_generation

Unique MAG collection:
{omics}/8-1-binning/mags_generation_pipeline/unique_genomes

Checkm2 and sequence scores:
{omics}/8-1-binning/mags_generation_pipeline/coverm_unique_cluster_scored.tsv

Taxonomic annotation:
{omics}/8-1-binning/mags_generation_pipeline/taxonomy.phylophlan.SGB.Jun23.tsv
{omics}/8-1-binning/mags_generation_pipeline/taxonomy.gtdb.r220.tsv

2.6. gene_annotation (MAG or refgenome)

Locus id for each genome:
DB/{MINTO_MODE}/1-prokka/locus_id_list.txt

Predicted CDS:
DB/{MINTO_MODE}/2-postprocessed/

Functional annotation of genomes:
DB/{MINTO_MODE}/4-annotations/{ANNOTATION}.tsv

2.6. gene_abundance

Read mapping statistics for each sample:
{omics}/9-mapping-profiles/{MINTO_MODE}/mapping.p{identity}.maprate.txt
{omics}/9-mapping-profiles/{MINTO_MODE}/mapping.p{identity}.mapstats.txt
{omics}/9-mapping-profiles/{MINTO_MODE}/mapping.p{identity}.multimap.txt

Normalized abundances:
  MAG or refgenome:
{omics}/9-mapping-profiles/{MINTO_MODE}/gene_abundances.p{identity}.MG.tsv
{omics}/9-mapping-profiles/{MINTO_MODE}/gene_abundances.p{identity}.TPM.tsv
{omics}/9-mapping-profiles/{MINTO_MODE}/genome_abundances.p{identity}.profile.abund.prop.genome.txt
{omics}/9-mapping-profiles/{MINTO_MODE}/genome_abundances.p{identity}.profile.abund.prop.txt
{omics}/9-mapping-profiles/{MINTO_MODE}/genome_abundances.p{identity}.profile.relabund.prop.genome.txt

  Catalog:
{omics}/9-mapping-profiles/catalog/gene_abundances.p{identity}.TPM.tsv

2.6. data_integration

All outputs are in sub-directories for each combination of the omics, mapping identity threshold and normalization parameters under /output/data_integration/{MINTO_MODE}-genes.

Depending on the omics, the analysis produced:

  • Gene profiles and normalised (MG and/or TPM) gene abundance (GA.tsv), transcript (GT.tsv) or expression (GE.tsv). Expression is the ratio of transcript and abundance.
  • Function profiles per database and category ({functional_category}), including the function IDs, function description and function abundance (FA.{functional_category}.tsv), transcript (FT.{functional_category}.tsv) or expression (FE.{functional_category}.tsv) normalized counts.
  • Phyloseq objects for the gene and function profiles in phyloseq_obj subfolder
  • PCoA plots of gene and function profiles and summary of genes and features in plots subfolder.