-
Notifications
You must be signed in to change notification settings - Fork 10
MIntO Tutorial
We assume that:
MIntO has been installed at
/server/apps/MIntO
and the pipeline dependencies have been downloaded usingdependencies.smk
(take a look at installation&dependencies).
We will denote this installation directory as
$MINTO_DIR
in this tutorial.
Our working directory for the tutorial will be
/mypath
. Raw data will be located in/mypath/IBD_tutorial_raw
, and data processing will happen in/mypath/IBD_tutorial
.
For this tutorial, we are going to use a subset of 10 gut microbiome samples, matching Illumina metagenomic (metaG) and metatranscriptomic (metaT) data, to illustrate the use of MIntO.
- metaG samples are composed of 7.7 million inserts (between 628,436 and 834,494 inserts).
- metaT samples are composed of 11.5 million inserts (between 466,724 and 1,602,400 inserts).
The subset of 10 gut microbiome samples is available at
IBD_tutorial_raw_v2.0.0.tar.gz
comprises the raw reads to run this tutorial
We benchmarked MIntO using a 64-bit Linux server with 2x AMD EPYC 7742 64-Core processors and 2TB of memory.
MIntO v2 has been heavily optimized for speed, to enable analyzing large datasets. Here's a comparison of runtime between MIntO v1 and MInto v2, when restricting snakemake to use 700GB of memory and 96 CPUs:
Step | MInto v1 | MIntO v2 | |
---|---|---|---|
metaG | metaT | ||
QC_1.smk | 4 min | 1 min | 1 min |
QC_2.smk | 20 min | 5 min | 8 min |
assembly.smk | 30 min | 25 min | 10 min |
binning_preparation.smk | 7 min | 3 min | NA |
mags_generation.smk | 180 min | 13 min | NA |
gene_annotation.smk | 15 min | 11 min | NA |
gene_abundance.smk | 7 min | 6 min | 3 min |
integration.smk | 5 min | 1 min | NA |
Total | 268 min | 65 min | 22 min |
MIntO v2 finished the analysis in just over 1 hour whereas MIntO v1 needed >4 hours.
Our recommendation is to use a 64-bit linux server with at least 8 CPUs and 64 GB RAM. More CPUs and memory will result in shorter execution in real time.
Let us download the data to our working directory /mypath
.
The following commands will prepare the data for you.
cd /mypath
wget https://zenodo.org/record/8320216/files/IBD_tutorial_raw_v2.0.0.tar.gz
tar xfz IBD_tutorial_raw_v2.0.0.tar.gz
The output looks like this:
[user@server /]$ cd /mypath
[user@server mypath]$ wget --no-verbose https://zenodo.org/record/8320216/files/IBD_tutorial_raw_v2.0.0.tar.gz
--2023-09-05 23:14:13-- https://zenodo.org/record/8320216/files/IBD_tutorial_raw_v2.0.0.tar.gz
Resolving zenodo.org (zenodo.org)... 188.185.124.72
Connecting to zenodo.org (zenodo.org)|188.185.124.72|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2175851966 (2.0G) [application/octet-stream]
Saving to: IBD_tutorial_raw_v2.0.0.tar.gz’
IBD_tutorial_raw_v2.0.0.tar.gz 100%[===============================================>] 2.03G 68.9MB/s in 4m 15s
2023-09-05 23:18:29 (8.13 MB/s) - ‘IBD_tutorial_raw_v2.0.0.tar.gz’ saved [2175851966/2175851966]
[user@server mypath]$ tar xfz IBD_tutorial_raw_v2.0.0.tar.gz
[user@server mypath]$ cd IBD_tutorial_raw
[user@server IBD_tutorial_raw]$ ls
metaG metaT
[user@server IBD_tutorial_raw]$ ls *
metaG:
CD136 CD138 CD140 CD142 CD146 CD237 CD238 CD240 CD242 CD244
metaT:
CD136 CD138 CD140 CD142 CD146 CD237 CD238 CD240 CD242 CD244
[user@server IBD_tutorial_raw]$ ls metaG/*/*
metaG/CD136/run_aa.1.fq.gz metaG/CD138/run_ab_1.fq.gz metaG/CD142/run1.1.fq.gz metaG/CD240/CD240.1.fq.gz
metaG/CD136/run_aa.2.fq.gz metaG/CD138/run_ab_2.fq.gz metaG/CD142/run1.2.fq.gz metaG/CD240/CD240.2.fq.gz
metaG/CD136/run_ab.1.fq.gz metaG/CD138/run_ac_1.fq.gz metaG/CD146/CD146.1.fq.gz metaG/CD242/CD242.1.fq.gz
metaG/CD136/run_ab.2.fq.gz metaG/CD138/run_ac_2.fq.gz metaG/CD146/CD146.2.fq.gz metaG/CD242/CD242.2.fq.gz
metaG/CD136/run_ac.1.fq.gz metaG/CD138/run_ad_1.fq.gz metaG/CD237/CD237.1.fq.gz metaG/CD244/CD244.1.fq.gz
metaG/CD136/run_ac.2.fq.gz metaG/CD138/run_ad_2.fq.gz metaG/CD237/CD237.2.fq.gz metaG/CD244/CD244.2.fq.gz
metaG/CD138/run_aa_1.fq.gz metaG/CD140/CD140.1.fq.gz metaG/CD238/CD238.1.fq.gz
metaG/CD138/run_aa_2.fq.gz metaG/CD140/CD140.2.fq.gz metaG/CD238/CD238.2.fq.gz
[user@server IBD_tutorial_raw]$ ls metaT/*/*
metaT/CD136/CD136.1.fq.gz metaT/CD140/CD140.2.fq.gz metaT/CD237/CD237.1.fq.gz metaT/CD240/CD240.2.fq.gz
metaT/CD136/CD136.2.fq.gz metaT/CD142/CD142.1.fq.gz metaT/CD237/CD237.2.fq.gz metaT/CD242/CD242.1.fq.gz
metaT/CD138/CD138.1.fq.gz metaT/CD142/CD142.2.fq.gz metaT/CD238/CD238.1.fq.gz metaT/CD242/CD242.2.fq.gz
metaT/CD138/CD138.2.fq.gz metaT/CD146/CD146.1.fq.gz metaT/CD238/CD238.2.fq.gz metaT/CD244/CD244.1.fq.gz
metaT/CD140/CD140.1.fq.gz metaT/CD146/CD146.2.fq.gz metaT/CD240/CD240.1.fq.gz metaT/CD244/CD244.2.fq.gz
The files listed in Section 2.1 are from a subset of 10 samples from the Inflammatory Bowel Disease Multi’omics Database (IBDMDB) corresponds to two participants with Crohn’s disease, (CD1
and CD2
).
The files are organized into sub-directories: each sample has its own directory, containing sometimes multiple runs of the same sequencing library. The paired files have the same name prefixes and the extensions .1.fq.gz
and .2.fq.gz
, for forward and reverse reads, respectively.
With the sample identifiers (IDs) listed in the configuration file, MIntO is capable of collecting and processing the input from this setup. For other input options, see input&output.
metaG/CD136/run_aa.1.fq.gz
metaG/CD136/run_aa.2.fq.gz
metaG/CD136/run_ab.1.fq.gz
metaG/CD136/run_ab.2.fq.gz
metaG/CD136/run_ac.1.fq.gz
metaG/CD136/run_ac.2.fq.gz
metaG/CD138/run_aa_1.fq.gz
metaG/CD138/run_aa_2.fq.gz
metaG/CD138/run_ab_1.fq.gz
metaG/CD138/run_ab_2.fq.gz
metaG/CD138/run_ac_1.fq.gz
metaG/CD138/run_ac_2.fq.gz
metaG/CD138/run_ad_1.fq.gz
metaG/CD138/run_ad_2.fq.gz
metaG/CD140/CD140.1.fq.gz
metaG/CD140/CD140.2.fq.gz
metaG/CD142/run1.1.fq.gz
metaG/CD142/run1.2.fq.gz
metaG/CD146/CD146.1.fq.gz
metaG/CD146/CD146.2.fq.gz
Original files from CD1
were derived from just one sequencing run each. We split them into multiple runs to enable testing MIntO's ability to handle multiple runs. We also renamed CD138
files as *_1.fq.gz
and *_2.fq.gz
to test MIntO's ability to handle different file-naming conventions.
metaG/CD237/CD237.1.fq.gz
metaG/CD237/CD237.2.fq.gz
metaG/CD238/CD238.1.fq.gz
metaG/CD238/CD238.2.fq.gz
metaG/CD240/CD240.1.fq.gz
metaG/CD240/CD240.2.fq.gz
metaG/CD242/CD242.1.fq.gz
metaG/CD242/CD242.2.fq.gz
metaG/CD244/CD244.1.fq.gz
metaG/CD244/CD244.2.fq.gz
metaT/CD136/CD136.1.fq.gz
metaT/CD136/CD136.2.fq.gz
metaT/CD138/CD138.1.fq.gz
metaT/CD138/CD138.2.fq.gz
metaT/CD140/CD140.1.fq.gz
metaT/CD140/CD140.2.fq.gz
metaT/CD142/CD142.1.fq.gz
metaT/CD142/CD142.2.fq.gz
metaT/CD146/CD146.1.fq.gz
metaT/CD146/CD146.2.fq.gz
metaT/CD237/CD237.1.fq.gz
metaT/CD237/CD237.2.fq.gz
metaT/CD238/CD238.1.fq.gz
metaT/CD238/CD238.2.fq.gz
metaT/CD240/CD240.1.fq.gz
metaT/CD240/CD240.2.fq.gz
metaT/CD242/CD242.1.fq.gz
metaT/CD242/CD242.2.fq.gz
metaT/CD244/CD244.1.fq.gz
metaT/CD244/CD244.2.fq.gz
MIntO expects input to be gzipped fastq files. Pairs of the same run should have matching name prefixes. The default extensions are
1.fq.gz
and2.fq.gz
for the forward and reverse reads, respectively.
MIntO can collect each sample from its own directory (named by sample ID), but other input options are also available.
MIntO expects the same sample ID in metaG and metaT reads to match the data and generate the gene and function expression profiles.
Let us copy metadata from the MIntO repository.
[user@server mypath]$ mkdir /mypath/IBD_tutorial
[user@server mypath]$ cp $MINTO_DIR/tutorial/metadata/tutorial_metadata.txt /mypath/IBD_tutorial
The samples above have been filtered from the original ones and only include reads matching to 5 genomes:
- Faecalibacterium prausnitzii
- Parabacteroides distasonis
- Bacteroides fragilis
- Ruminococcus bromii
- Bacteroides uniformis
Here, we are going to retrieve these MAGs from metaG using the genome-based assembly-dependent mode. And then integrate metaT and metaT for matching samples.
Let's go!
In this tutorial, we provide commands to perform analysis on a standalone server without a queuing system. If you have
slurm
on your cluster, you can replace--cores 8
with--cluster "sbatch -J {name} --mem={resources.mem}G -c {threads} -e slurm-%x.e%A -o slurm-%x.o%A" --local-cores 8
in allSnakemake
commands to letslurm
manage your jobs. This will create error and output files for eachSnakemake
rule run via slurm asslurm-<rulename>.o<JobID>
andslurm-<rulename>.e<JobID>
, respectively.
You can have a better look at how to launch MIntO with different cluster queuing systems in the snakemake documentation.
You can add
--dry-run -p
to the snakemake command line for a dry run listing what will happen.
For runtime benchmarking this tutorial, we used --use-conda --restart-times 1 --keep-going --latency-wait 60 --jobs 20 --cores 96 --resources mem=700 --conda-prefix $MINTO_DIR/conda_env --shadow-prefix /scratch/johndoe
as arguments to Snakemake
. Here's the explanation for these parameters.
We use the following Snakemake
arguments:
--use-conda --restart-times 1 --keep-going --latency-wait 60
These options tell Snakemake
to
- use
conda
environments (MIntO uses them a lot!), - try re-running a failed rule a second time (just in case!),
- continue running the workflow even if some rules fail (imagine going home for the night and coming back the next day to find that workflow quit 10 minutes after you left?),
- wait for 60 seconds before giving up on an output file being produced (this could be due to NFS delays).
For the tutorial, we also use:
-
--jobs 4
to askSnakemake
to run maximum 4 jobs at a time, -
--cores 16
to askSnakemake
to use maximum 16 CPUs at any time (this limit is across entire workflow at any given time), -
--conda-prefix
to specify where to install theconda
environments.
We also use the shadow-directive from Snakemake
, which is a very elegant way to run I/O intensive tasks in a local disk of an execution server. See Snakemake
shadow rule documentation. For this to work, the user should provide a temporary location in the local disk with enough space using --shadow-prefix
option. Many compute clusters provide a fast disk in /scratch
. Snakemake
will use a unique directory for each rule. The advantage of using shadow-directive is that you do not need to clean up after you are done - the entire directory will be deleted after the rule completes.
For ease of use, we will create a variable $SNAKE_PARAMS
with these options and use it in the Snakemake
commands.
[user@server mypath]$ SNAKE_PARAMS="--use-conda --restart-times 1 --keep-going --latency-wait 60 --jobs 4 --cores 16 --resources mem=120 --conda-prefix $MINTO_DIR/conda_env --shadow-prefix /scratch/johndoe"
QC_1 stands for Quality Control Part 1, to first remove sequencing adapters and low quality bases.
QC_1.smk
requires the configuration file QC_1.yaml
which has to be filled out by the user. You can find an example configuration file at $MINTO_DIR/configuration/QC_1.yaml
and the one used to run this tutorial here!
We will take a look into the # General settings
and # ILLUMINA
sections, please find out more about other parameters included in QC_1.yaml
here.
Here's the #General settings
section for metaG analysis.
######################
# General settings
######################
PROJECT: IBD_tutorial
working_dir: /mypath/IBD_tutorial
omics: metaG
minto_dir: /server/apps/MIntO
METADATA: /mypath/IBD_tutorial/tutorial_metadata.txt
raw_reads_dir: /mypath/IBD_tutorial_raw/metaG
And here's the #General settings
section for metaT analysis.
######################
# General settings
######################
PROJECT: IBD_tutorial
working_dir: /mypath/IBD_tutorial
omics: metaT
minto_dir: /server/apps/MIntO
METADATA: /mypath/IBD_tutorial/tutorial_metadata.txt
raw_reads_dir: /mypath/IBD_tutorial_raw/metaT
The different variables or fields are described below.
Field | Description |
---|---|
PROJECT | Name of the project. Should use same project name for project with both metaG and metaT. |
working_dir | Path to the project working directory, where MIntO will generate the outputs. Should use same location for project with both metaG and metaT. |
omics |
metaT or metaT
|
minto_dir | Path to the directory where you have installed MIntO, in this case our $MINTO_DIR
|
METADATA | Path to the file with sample metadata |
raw_reads_dir | Path to the directory with metaG or metaT raw reads |
Don't forget to replace
/mypath/
with your path!
Use same locations for
working_dir
for both metaG and metaT from the same study for integrating them to gene expression.
This is a list of sample identifiers in the yaml
format. Each sample id is given in a line prepended by -
.
ILLUMINA:
- CD136
- CD138
- CD140
- CD142
- CD146
- CD237
- CD238
- CD240
- CD242
- CD244
Once the configuration file metaG/QC_1.yaml has been copied to working directory and properly filled out, QC_1.smk
can be launched for metaG:
[user@server mypath]$ mkdir /mypath/IBD_tutorial/metaG
[user@server mypath]$ cd /mypath/IBD_tutorial/metaG
[user@server metaG]$ cp $MINTO_DIR/tutorial/configuration/metaG/QC_1.yaml .
[user@server metaG]$ conda activate MIntO
(MIntO) [user@server metaG]$ snakemake $SNAKE_PARAMS --snakefile $MINTO_DIR/smk/QC_1.smk --configfile QC_1.yaml
For metaT, the QC_1.smk
can be launched after copying the configuration file metaT/QC_1.yaml and filling it out properly:
[user@server mypath]$ mkdir /mypath/IBD_tutorial/metaT
[user@server mypath]$ cd /mypath/IBD_tutorial/metaT
[user@server metaT]$ cp $MINTO_DIR/tutorial/configuration/metaT/QC_1.yaml .
[user@server metaT]$ conda activate MIntO
(MIntO) [user@server metaT]$ snakemake $SNAKE_PARAMS --snakefile $MINTO_DIR/smk/QC_1.smk --configfile QC_1.yaml
Once QC_1.smk
finishes to run, MIntO generates the directory /mypath/IBD_tutorial/metaG/1-trimmed
and partially filled out /mypath/IBD_tutorial/metaG/QC_2.yaml
configuration file (or the corresponding files in /mypath/IBD_tutorial/metaT
) to run the next step for each omics
. Here you can see how QC_2.yaml
file looks.
Importantly, QC_1.smk
estimates the value for READ_minlen
field in QC_2.yaml
and fills it for you. This is the read-length cutoff that will retain perc_remaining_reads
percent reads (by default 95%
reads are retained).
The second part of the quality control is to remove reads that are shorter than READ_minlen
and delete host-derived sequences. Afterwards, read-based taxonomic profiling is performed.
Take a look at the complete QC_2.yaml
file here.
For both metaG and metaT, this section is identical to that in QC_1.yaml
. This is also autofilled by QC_1.smk
for you.
This section sets the READ_minlen
parameter for length trimming using seqkit
package. This is also autofilled by QC_1.smk
for you. Normally, you need not edit this. For the tutorial, the cut-off is 134.
#########################
# Read length filtering
#########################
READ_minlen: 134
This section sets the parameters for filtering host reads. Here's an example:
# Host genome filtering
PATH_host_genome: /mypath/IBD_tutorial
NAME_host_genome: build_hg18_subset.fna
BWA_index_host_memory: 100
BWA_host_threads: 8
BWA_host_memory: 40
Let us dissect this.
Field | Auto-filled? | Description |
---|---|---|
PATH_host_genome | No | Location of the host genome bwa2 database. We highly recommend to use a local disk for the host genome database. This will make the alignments much faster. If you are using a cluster, please make sure that the local disk is accessible at the same path across all execution servers. |
NAME_host_genome | No | Name of the database (prefix used when building index with bwa-mem2 index ) |
BWA_index_host_memory | Yes | Estimated memory usage (in GB) by bwa-mem2 index
|
BWA_host_threads | Yes | Number of threads when aligning using bwa-mem2 mem
|
BWA_host_memory | Yes | Estimated memory usage by bwa-mem2 mem
|
BWA program parameters (BWA_*
fields) are autofilled with by QC_1.smk
with default values corresponding to filtering against human genome. Normally, you need not edit these fields. However, host_genome
fields need to be filled by the user.
There are two scenarios that can happen with host genome filtering.
-
Scenario 1 -
<PATH_host_genome>/BWA_index/<NAME_host_genome>.bwt.2bit.64
exists - Since index exists, it will not be created. Alignment can proceed.
-
Scenario 2 -
<PATH_host_genome>/<NAME_host_genome>
exists - If the
.bwt.2bit.64
file (and others) are missing, thenbwa-mem2 index
will be used to create the database at<PATH_host_genome>/BWA_index/<NAME_host_genome>.bwt.2bit.64
.
The samples for this tutorial come from human fecal metagenomes already filtered by host-derived sequences. Thus, there is no need to index the entire human genome (build hg38) and map the reads to it. For a fast test on how this step works, we will map the reads to a subset of the human genome ($MINTO_DIR/tutorial/build_hg18_subset.fna
).
[user@server mypath]$ cd /mypath/IBD_tutorial
[user@server IBD_tutorial]$ cp $MINTO_DIR/tutorial/build_hg18_subset.fna /mypath/IBD_tutorial
For omics: metaT, ribosomal RNA sequences will be removed from the metaT reads using sortmeRNA software. For this, the user has to specify the path to the ribosomal RNA database. Here are the fields:
#########################
# Ribosomal RNA depletion
#########################
sortmeRNA_threads: 8
sortmeRNA_memory: 10
sortmeRNA_db: /server/apps/MIntO/data/rRNA_databases
sortmeRNA_db_idx: /server/apps/MIntO/data/rRNA_databases/idx
If the ribosomal RNA database has been downloaded using
dependencies.smk
, the database will be located at$MINTO_DIR/data/rRNA_databases
and the database index at$MINTO_DIR/data/rRNA_databases/idx
. These values are autofilled byQC_1.smk
. If you downloaded them manually, please update these fields accordingly.
The user has to select the taxonomic profiling tool (MetaPhlan
or mOTUs3
). By default, QC_1.smk
autofills profiler name as motus_rel,metaphlan
(motus_rel - relative abundance; metaphlan - relative abundance), which will be used by this tutorial. A third option, motus_raw
, reports estimated raw read counts for each species from mOTUs3.
While taxonomy profiling is normally performed for metaG, MIntO can perform this also for metaT data. If you do not need this for metaT, just remove the section from
QC_2.yaml
.
##################################
# Assembly-free taxonomy profiling
##################################
TAXA_threads: 8
TAXA_memory: 10
taxa_profile: motus_rel,metaphlan
metaphlan_version: 4.0.6
motus_version: 3.0.3
For metaG samples, dissimilarity could be computed on k-mer abundances using sourmash
. The cutoff value is used to create clusters of highly similar samples for co-abundance binning, which will have the prefix SCL. The cutoff value could be set to 0
to turn this feature off.
#########################
# K-mer based comparison
#########################
SOURMASH_min_abund: 2
SOURMASH_max_abund: 1000
SOURMASH_cutoff: 0.40
At the end of QC_2 step, alpha- and beta-diversity plots will be generated. Visualization can be enhanced using metadata variables, e.g. coloring using some feature. Also, further yaml
files for other steps will be created using the main analysis variable. These need to be defined in QC_2.yaml
.
The main variable/factor of interest in the study will be used to define co-assemblies, and to color the points in visualization. This is a required variable and must be defined in the metadata file also. Other variables are optional. The second factor will be used to shape the points. The column headers or variable names in the metadata file denoted by METADATA
field can be passed via QC_2.yaml
for this purpose.
For example, in a case-control cohort, MAIN_factor
could be disease_status
, and PLOT_factor2
could be sample_site
. For time-series data, visualizing metrics over time can be useful.
Field | Required | Auto-filled? | Description |
---|---|---|---|
MAIN_factor | Yes | No | the main factor in the metadata file to differentiate in visualization (using color) |
PLOT_factor2 | No | No | the second factor in the metadata file to differentiate in visualization (using shape) |
PLOT_time | No | No | name of the factor in the metadata file denoting time (e.g. hour, day) |
In this tutorial, we will use subject (variable participant_ID2
) as MAIN_factor
and week (variable week_n
) as PLOT_time
.
#####################
# Analysis parameters
#####################
MAIN_factor: participant_ID2
PLOT_factor2:
PLOT_time: week_n
For both metaG and metaT, this section is identical to that in QC_1.yaml
. This is also autofilled by QC_1.smk
for you. Normally, you need not edit these fields at this step.
Once the configuration file has been filled out for metaG, QC_2.smk
can be launched:
[user@server mypath]$ cd /mypath/IBD_tutorial/metaG
[user@server metaG]$ conda activate MIntO
(MIntO) [user@server metaG]$ snakemake $SNAKE_PARAMS --snakefile $MINTO_DIR/smk/QC_2.smk --configfile QC_2.yaml
For metaT:
[user@server mypath]$ cd /mypath/IBD_tutorial/metaT
[user@server metaT]$ conda activate MIntO
(MIntO) [user@server metaT]$ snakemake $SNAKE_PARAMS --snakefile $MINTO_DIR/smk/QC_2.smk --configfile QC_2.yaml
After the second quality control (QC_2.smk
), the configuration file assembly.yaml
will be ONLY generated in the /mypath/IBD_tutorial/metaG
directory.
This step assembles reads into contigs, so that we can reconstruct of metagenome-assembled genomes (MAGs). This step is only required in the genome-based assembly-dependent mode.
The
assembly.yaml
file is shared between this step and the next step (assembly.smk
andbinning_preparation.smk
).
In assembly.yaml
there are different fields to set up (have a look here) for the following assemblers:
- MetaSPAdes for (i) individual assembly of short-reads and (ii) hybrid assembly of short-reads and long reads
- MEGAHIT for co-assembly of short-reads
- MetaFlye for individual assembly of long-reads
Let us have a look.
# MetaSPAdes settings
#
METASPADES_qoffset: auto
METASPADES_threads: 16
METASPADES_memory: 10
METASPADES_hybrid_max_k: 99
METASPADES_illumina_max_k: 99
# MEGAHIT settings
#
# Note on MEGAHIT_memory:
# MEGAHIT's memory requirement scales with sample count and sample size.
# Default MEGAHIT_memory below is 10 Gigabytes per sample (plenty for metaG samples of 10M reads).
# MEGAHIT_memory parameter represents per-sample memory in Gigabytes
# during the first attempt. If it is not enough and MEGAHIT fails, then
# each successive attempt will increase per-sample memory by 6 Gibabytes
# until it succeeds or reaches max_attempts from snakemake.
# Please make sure that there is enough RAM on the server.
# Custom k-list for assembly: k must be odd, in the range 15-255, increment <= 28, fx. 21,29,39,59,79,99,119,141
MEGAHIT_memory: 10
MEGAHIT_threads: 32
MEGAHIT_presets:
- meta-sensitive
- meta-large
#MEGAHIT_custom:
# -
The metaSPAdes max k sizes are set for the typical short-read length of 150bp. For shorter read lengths, metaSPAdes automatically adjusts it lower.
The MEGAHIT preset meta-sensitive
is for assembling metagenomes with lower complexity to a higher granularity, while meta-large
is for assembling more complex metagenomes but including less strain-variance. The user can also supply a list of k-mers via the MEGAHIT_custom
parameter after uncommenting those lines.
In this tutorial, only the first two steps will be run because there are only short-reads.
MEGAHIT is used to run the co-assembly of all reads (Full) and the co-assembly per participant (CD1 and CD2). Per-participant co-assembly is organized based on grouping samples via participant_ID2
, which is our MAIN_factor
.
By default, MIntO generates these coassembly definitions, but sets
enable_COASSEMBLY
tono
.
###############################
# COASSEMBLY section:
###############################
# If enable_COASSEMBLY is "yes", MEGAHIT coassembly will be performed using the following definitions.
# Each coassembly is named in the LHS, and corresponding illumina sample(s) are in RHS (delimited by '+').
# One coassembly will be performed for each line.
# E.g. 'Subject1: I3+I4' will result in 1 coassembly: 'Subject1' using I3 and I4 data.
# Memory per coassembly is calculated to be 10G per sample in the coassembly.
# Please make sure that there is enough RAM on the server.
#
enable_COASSEMBLY: yes
COASSEMBLY:
Full: CD136+CD138+CD140+CD142+CD146+CD237+CD238+CD240+CD242+CD244
CD1: CD136+CD138+CD140+CD142+CD146
CD2: CD237+CD238+CD240+CD242+CD244
Please note the enable_COASSEMBLY
field. For large datasets with 100s of samples, enabling coassembly may not be a good idea. It will take too much time and resources. You can turn co-assembly on by changing enable_COASSEMBLY
to yes
.
If you have time-series data with 1000+ samples where per-participant coassemblies are manageable (not more than 20-30 samples), then you can just remove the Full
coassembly of all samples and leave the rest.
The complete assembly.yaml
file for this tutorial is here.
You can launch assembly.smk
using:
[user@server mypath]$ cd /mypath/IBD_tutorial/metaG
[user@server metaG]$ conda activate MIntO
(MIntO) [user@server metaG]$ snakemake $SNAKE_PARAMS --snakefile $MINTO_DIR/smk/assembly.smk --configfile assembly.yaml
This step is run after the previous assembly step finishes successfully. It maps reads from each sample to all assemblies from the previous step to prepare a contig depth file for MAG reconstruction. This step is only required in the genome-based assembly-dependent mode.
The
assembly.yaml
file is shared between the previous step and this step (assembly.smk
andbinning_preparation.smk
).
Here's the part relevant for binning preparation:
###############################
# Binning preparation settings
###############################
# Whether to use contigs or scaffolds from SPAdes
SPADES_CONTIGS_OR_SCAFFOLDS: contigs
# minimum contig/scaffold fasta length
MIN_FASTA_LENGTH: 2500
# assembly batch size for mapping reads to combined contig sets
CONTIG_MAPPING_BATCH_SIZE: 100
# Should we exclude any assembly type during MAG generation?
# E.g., if you make MAGs from metaT, individual sample assemblies
# will be quite fragmented. If you have several co-assemblies, then
# ignoring 'illumina_single' might improve your MAG quality. This can
# be achieved using:
# ---
# EXCLUDE_ASSEMBLY_TYPES:
# - illumina_single
# ---
#
# EXCLUDE_ASSEMBLY_TYPES:
# Contig-depth: bwa-mem2 settings
# Used when mapping reads back to contigs
#
BWA_threads: 10
BWA_memory: 25
BWA_index_memory: 50
# Contig-depth: samtools sort settings
# Used when sorting bam files
# memory listed below is PER-THREAD.
SAMTOOLS_sort_threads: 2
SAMTOOLS_sort_perthread_memgb: 10
Users can choose whether to use contigs or scaffolds from metaSPAdes
. They can also choose the minimum length of a scaffold/contig to be included in binning.
Generating the contig depth file is the most resource-consuming step in MAG reconstruction. The default resource parameters generated in the assembly.yaml
file work well for this tutorial. However, you need to edit these settings for real metagenome datasets.
Given N
samples and their assemblies, we will map each sample against combined database of all N
assemblies. The resulting bam
output is first filtered for alignments over 95%
sequence identity, then sorted by coordinates and finally summarized into a contig depth file.
Imagine a cohort with 200 such samples. Even at 95%
threshold, reads from a given species in sample X
will map to all contigs belonging to the same species across multiple samples. For prevalent and abundant species such as Bacteroides vulgatus, each read might map to >100 contigs representing different strains from different individuals. All this information is very useful during the MAG reconstruction process. This might sound innocuous. But a typical 9Gb
human fecal metagenome with 30
million paired-end reads produces assemblies with 150,000
scaffolds and 100-200Mb
sequences. We need to perform 200 mapping jobs, each mapping 30 million paired-end reads against 30,000,000
contigs with 20-40Gb
total length. For this step, bwa-mem2 mem
will be run with -a
option that lists all hits over a threshold (its default behavior otherwise is to randomly select among besthits). With multiple hits per read, the resulting bam
output will be quite large, and sorting it in-memory efficiently using samtools sort
requires lots of memory.
Based on a lot of experience with large cohorts (e.g., 200 samples each with 30 million paired-end reads), we have optimized the following parameters for bwa-mem2
and samtools sort
.
# Contig-depth: bwa-mem2 settings
# Used when mapping reads back to contigs
#
BWA_threads: 24
BWA_memory: 70
BWA_index_memory: 300
# Contig-depth: samtools sort settings
# Used when sorting bam files
# memory listed below is PER-THREAD.
SAMTOOLS_sort_threads: 3
SAMTOOLS_sort_perthread_memgb: 70
The above parameters are optimized for our cluster where each server has 2TB
RAM and 192
CPUs. When using a batch size of 50 assemblies, bwa-mem2
needs ~70GB
RAM to map a sample to a batch. Its output is piped to msamtools
for filtering but that does not use much RAM. The filtered output is piped to 'samtools sort'
with 3 CPUs each using 70GB
RAM. This means that each mapping of a sample to a batch in total needs 280GB
RAM. This allows us to run 7 jobs on a server (occupying 1960GB
RAM, with some room as buffer). In order to not leave idle CPUs, we then share the 192 CPUs across these 7 jobs, which gives us 27 CPUs per job. Reserving 3 CPUs for samtools sort
, we let bwa-mem2
have 24 CPUs. With this set up, each job takes 1-2 hours on average. In the above example of 200 samples, each sample is mapped to 4 batches of 50 assemblies each, so there are 800 mapping tasks. With the throughput of 7 jobs every 1-2 hours, it would take 7 days to complete.
Users should optimize these parameters for their scenario. If you do not have a lot of RAM, then considering reducing the number of threads in
SAMTOOLS_sort_threads
. It will take longer, as sorting is the bottleneck in this case.
Once the assembly.yaml
file is optimized for your setup, you can launch binning_preparation.smk
as follows:
[user@server mypath]$ cd /mypath/IBD_tutorial/metaG
[user@server metaG]$ conda activate MIntO
(MIntO) [user@server metaG]$ snakemake $SNAKE_PARAMS --snakefile $MINTO_DIR/smk/binning_preparation.smk --configfile assembly.yaml
When binning_preparation.smk
has finished, the configuration file mags_generation.yaml
will be available at /mypath/IBD_tutorial/metaG
.
This step generates MAGs using AVAMB
. The MAGs will be checked by quality with CheckM2
and clustered together with CoverM
in order to obtain a subset of representative genomes of the community.
Here, we will hope to retrieve near-complete MAGs for the 5 species we included.
Several options for MAG generation are encoded in the config file mags_generation.yaml
.
First, general parameters for MAGs:
# COMMON PARAMETERS
#
MIN_FASTA_LENGTH: 2500
MIN_MAG_LENGTH: 500000
This tells the workflow to use sequences longer than 2500bp
for binning and only continue with MAGs over 500kb
in the next steps.
Then, we set parameters for AVAMB:
# VAMB settings
#
BINNERS:
- aaey
- aaez
- vae384
VAMB_THREADS: 20
VAMB_memory: 40
# Use GPU in VAMB:
# could be yes or no
VAMB_GPU: no
We list the different binning approaches under BINNERS
. The values above tell the workflow to run AVAMB in both AAE mode and VAE mode. For aaey
and aaez
, AAE mode will be run with default parameters,and bins from the aaey
and aaez
space will be used. For vae384
, VAE mode will be run with '-l 24 -n 384 384'
. Workflow recognizes aaey
, aaez
and vae<n>
. For vae<n>
, the vamb parameters will be '-l <n> -n <m> <m>'
, where m=int(n/16)
.
Then, we set parameters for identifying the best non-redundant high-quality MAGS, using CheckM2 and coverM:
# CHECKM settings
#
CHECKM_COMPLETENESS: 90 # higher than this
CHECKM_CONTAMINATION: 5 # lower than this
CHECKM_BATCH_SIZE: 50 # Process MAGs with this batch size
# COVERM settings
#
COVERM_THREADS: 8
COVERM_memory: 5
# SCORING THE BEST GENOMES settings
#
# this could be checkm or genome
SCORE_METHOD: checkm
Workflow will check MAGs for completeness in batches of CHECKM_BATCH_SIZE
. Processing each MAG individually takes quite a long time in our experience. If you have thousands of MAGs, then increasing CHECKM_BATCH_SIZE
is a good idea.
We use CoverM to get non-redundant MAGs using the fastANI
algorithm, which estimates pairwise average nucleotide identity (ANI) to cluster MAGs. If you have large number of MAGs, consider increasing COVERM_THREADS
.
We cluster MAGs at 99%
ANI and choose the best MAG within each cluster as the representative. We use SCORE_METHOD
to score MAGs. By default, we use the MAG contig/scaffold characteristics (e.g., genome size, contig N50, longest sequence) and genome quality characteristics (completeness, contamination). The score used is:
seq_score = log10(longest_contig_size/genome_size) + log10(N50/L50)
qual_score = completeness - (2 * contamination)
score_checkm = (0.1 * qual_score) + seq_score
The complete mags_generation.yaml
file for this tutorial is here.
[user@server mypath]$ cd /mypath/IBD_tutorial/metaG
[user@server metaG]$ conda activate MIntO
(MIntO) [user@server metaG]$ snakemake $SNAKE_PARAMS --snakefile $MINTO_DIR/smk/mags_generation.smk --configfile mags_generation.yaml
MIntO can predict genes in MAGs/genomes using Prokka and annotate protein coding genes using different software and databases: eggNOG, KEGG and dbCAN. The gene_annotation.smk
script can be launched if genomes are used as reference (ONLY in genome-based mode: retrieved MAGs or the reference genomes). In this tutorial, we will annotate both MAGs and reference genomes.
The configuration file mapping.yaml will be generated after running QC_2.smk
. STEP5 can be skipped if the reads are mapped to reference genomes or to a gene catalog (without retrieving the MAGs).
This configuration file can be found in both /mypath/IBD_tutorial/metaG
and /mypath/IBD_tutorial/metaT
directories, which is used by gene_annotation.smk
and gene_abundace.smk
.
If you have generated MAGs, MIntO knows where to look for them. You just need to set map_reference
to MAG
and tell MIntO whether you want to use MAGs from metaG
or metaT
. In genome-based assembly-dependent mode, using the retrieved MAGs from metaG
, our configuration looks like this:
######################
# Annotation settings
######################
# Set MIntO mode
# Where should we map reads to? MAG, refgenome, catalog
MINTO_MODE: MAG
# Which omics for MAGs?
MAG_omics: metaG
# path to gene catalog fasta file
PATH_reference:
# file name of gene catalog fasta file (MIntO will generate bwa index with same name)
NAME_reference:
If you are using your own set of reference genomes (genome-based assembly-free mode), you should place them in a directory as FASTA files (.fna
).
For this tutorial, we will use reference genomes distributed with MIntO. These correspond to the 5 species included in the tutorial dataset.
[user@server mypath]$ tar xfz $MINTO_DIR/tutorial/genomes.tar.gz
[user@server mypath]$ ls genomes/
B_fragilis.fna B_uniformis.fna F_prausnitzii.fna P_distasonis.fna R_bromii.fna
We can then tell MIntO to find these reference genomes in /mypath/genomes
. In genome-based assembly-free mode, the mapping.yaml
will look like this:
######################
# Annotation settings
######################
# Set MIntO mode
# Where should we map reads to? MAG, refgenome, catalog
MINTO_MODE: refgenome
# Which omics for MAGs?
MAG_omics:
# path to gene catalog fasta file
PATH_reference: /mypath/genomes
# file name of gene catalog fasta file (MIntO will generate bwa index with same name)
NAME_reference:
When running the gene-catalog based assembly-free mode (e.g. using the IGC), the mapping.yaml
will look like this:
# Set MIntO mode
# Where should we map reads to? MAG, refgenome, catalog
MINTO_MODE: catalog
PATH_reference: /mypath/IGC_database
NAME_reference: IGC.faa.gz
In addition, the user can choose which software to use for annotating protein coding genes: eggNOG
, kofam
(KEGG) and/or dbCAN
. The complete mapping.yaml
file for this tutorial is here. For this tutorial, we will use eggNOG-mapper
and dbCAN
to annotated the genomes.
# List of software used to perform genome function annotation:
# - dbCAN
# - kofam
# - eggNOG
ANNOTATION:
- dbCAN
- eggNOG
Users can choose to taxonomically annotate MAGs using PhyloPhlAn
and/or GTDBtk
:
# MAG taxonomy settings
#
RUN_TAXONOMY: yes
TAXONOMY_NAME: phylophlan,gtdb # Currently, phylophlan or gtdb or combination
TAXONOMY_CPUS: 8
TAXONOMY_memory: 5
# Taxonomy database versions
#
PHYLOPHLAN_TAXONOMY_VERSION: SGB.Jun23
GTDB_TAXONOMY_VERSION: r220
The above parameters will run PhyloPhlAn with SGB.Jun23
taxonomy and/or GTDBtk with its database r220
.
gene_annotation.smk
can be launched as:
[user@server mypath]$ cd /mypath/IBD_tutorial/metaG
[user@server metaG]$ conda activate MIntO
(MIntO) [user@server metaG]$ snakemake $SNAKE_PARAMS --snakefile $MINTO_DIR/smk/gene_annotation.smk --configfile mapping.yaml
When running gene_annotation.smk
, a new directory will be generated in the working directory, called /mypath/IBD_tutorial/DB
(database), which is going to contain the result of the genome annotation.
Both
kofam
andeggNOG
provideKEGG
annotations, but the results they might not be identical. Union of KOs from both software are in the columnmerged.KEGG_KO
in4-annotations/combined_annotations.tsv
.
gene_abundance.smk
uses gene_annotation.smk
outputs to map the reads to the annotated genomes and genes (/mypath/IBD_tutorial/DB
) and calculates the gene (omics: metaG
) and transcript (omics: metaT
) abundances. It shares mapping.yaml
to set its own parameters as well. The relevant ones are:
#########################
# Gene abundance settings
#########################
# BWA Alignment
BWA_threads: 10
BWA_memory: 45
# Alignment filtering
# -------------------
# msamtools_filter_length: Default 50 works well for ILLUMINA 2x150bp.
# For shorter sequencing, e.g. 2x75bp, please reduce accordingly.
# alignment_identity: Default 97 works well for MAGs that are dereped at 99%.
# For refgenomes, modify it according to the derep level of your refgenomes.
# For gene catalog, modify it to match the derep level of the gene catalog (usually 95%).
# MIN_mapped_reads: Default 2 is designed for low-to-medium sequencing depth.
# Increase according to depth (e.g., set to 10 when depth is above 5Gb).
msamtools_filter_length: 50
alignment_identity: 97
MIN_mapped_reads: 2
# Normalization approach
# Could be TPM, MG or comma-delimited combinations
abundance_normalization: TPM,MG
The first group sets parameters for bwa-mem2
. You can change them according to your computing infrastructure.
The second group sets parameters for msamtools
for filtering alignments. The default settings will only consider reads mapping to the sequence database (MAGs, reference genomes, reference gene catalogs) at 97%
identity for at least 50bp
. Additionally, each gene will only be considered as a hit if at least 2 reads are mapped to it.
The third group sets parameters for abundance normalization. TPM
stands for "transcripts per million", where total abundances will sum to 1e6
. On the other hand, MG
stands for "marker genes", where abundance of each gene is normalized using the median abundance of 10 universal bacterial marker genes (COG0012, COG0016, COG0018, COG0172, COG0215, COG0495, COG0525, COG0533, COG0541, and COG0552).
gene_abundance.smk
should be launched once for each omic:
[user@server mypath]$ cd /mypath/IBD_tutorial/metaG
[user@server metaG]$ conda activate MIntO
(MIntO) [user@server metaG]$ snakemake $SNAKE_PARAMS --snakefile $MINTO_DIR/smk/gene_abundance.smk --configfile mapping.yaml
If you have metaT
:
[user@server mypath]$ cd /mypath/IBD_tutorial/metaT
[user@server metaG]$ conda activate MIntO
(MIntO) [user@server metaG]$ snakemake $SNAKE_PARAMS --snakefile $MINTO_DIR/smk/gene_abundance.smk --configfile mapping.yaml
Take a look at the complete mapping.yaml
file for metaG here and metaT here.
Once gene_abundance.smk
has finished to run, the configuration file data_integration.yaml
will be available at /mypath/IBD_tutorial/
.
Finally, we arrived to the last part of MIntO... the most important step! Where the data from metaG
and metaT
are integrated together.
The data_integration.yaml
is generated at /mypath/IBD_tutorial/
.
The complete data_integration.yaml
file for this tutorial is here.
In
data_integration.yaml
,omics
has to bemetaG_metaT
in the# General parameters
in order to integrate metaG and metaT results!
If you don't have both metaG and metaT, you can still run the script withomics
set to the appropriate type, to generate normalized abundance tables and plots. Read more about the required parameters here.
You can run data_integration.smk
like this:
[user@server mypath]$ cd /mypath/IBD_tutorial
[user@server IBD_tutorial]$ conda activate MIntO
(MIntO) [user@server IBD_tutorial]$ snakemake $SNAKE_PARAMS --snakefile $MINTO_DIR/smk/data_integration.smk --configfile data_integration.yaml
Note: This step has to be run once!
All the relevant outputs for the user will be generated in the folder /mypath/IBD_tutorial/outputs
:
- assembly-free and assembly-based taxonomic profiles in
6-taxa_profile
- gene profiles, including the gene IDs (generated by Prokka when selecting assembly-dependent mode or sequence IDs when choosing assembly-free mode) and normalised gene abundance, transcript or expression
- function profiles per database, including the function IDs, function description and function abundance, transcript or expression normalized counts. For an easier downstream analysis of these data, phyloseq objects are generated for the taxonomic, gene and function profiles.
In addition, MIntO also generate plots useful to have a preliminary idea of the community analysed and to help the user for downstream analysis:
-
Relative abundance of the top 15 genera from the assembly-free taxonomic profile
-
PCoA plots using the taxonomy
- PCoA using gene profiles
- PCoA using function profiles from the different databases
-
Summary of genes and features per function database that are expressed in the data.
Note: After running MIntO, remember to delete the temporary directory on the local disk (
/scratch/johndoe/IBD_tutorial
).