Skip to content

MIntO Tutorial

jszarvas edited this page Sep 30, 2024 · 48 revisions

1. Before starting

1.1. MIntO installation

We assume that:

MIntO has been installed at /server/apps/MIntO and the pipeline dependencies have been downloaded using dependencies.smk (take a look at installation&dependencies).

We will denote this installation directory as $MINTO_DIR in this tutorial.

Our working directory for the tutorial will be /mypath. Raw data will be located in /mypath/IBD_tutorial_raw, and data processing will happen in /mypath/IBD_tutorial.

1.2. Dataset

For this tutorial, we are going to use a subset of 10 gut microbiome samples, matching Illumina metagenomic (metaG) and metatranscriptomic (metaT) data, to illustrate the use of MIntO.

  • metaG samples are composed of 7.7 million inserts (between 628,436 and 834,494 inserts).
  • metaT samples are composed of 11.5 million inserts (between 466,724 and 1,602,400 inserts).

The subset of 10 gut microbiome samples is available at DOI

  • IBD_tutorial_raw_v2.0.0.tar.gz comprises the raw reads to run this tutorial

1.3. Expected resources and runtime

We benchmarked MIntO using a 64-bit Linux server with 2x AMD EPYC 7742 64-Core processors and 2TB of memory.

MIntO v2 has been heavily optimized for speed, to enable analyzing large datasets. Here's a comparison of runtime between MIntO v1 and MInto v2, when restricting snakemake to use 700GB of memory and 96 CPUs:

Step MInto v1 MIntO v2
metaG metaT
QC_1.smk 4 min 1 min 1 min
QC_2.smk 20 min 5 min 8 min
assembly.smk 30 min 25 min 10 min
binning_preparation.smk 7 min 3 min NA
mags_generation.smk 180 min 13 min NA
gene_annotation.smk 15 min 11 min NA
gene_abundance.smk 7 min 6 min 3 min
integration.smk 5 min 1 min NA
Total 268 min 65 min 22 min

MIntO v2 finished the analysis in just over 1 hour whereas MIntO v1 needed >4 hours.

Our recommendation is to use a 64-bit linux server with at least 8 CPUs and 64 GB RAM. More CPUs and memory will result in shorter execution in real time.

2. Download the data

Let us download the data to our working directory /mypath.

2.1. Get data from Zenodo

The following commands will prepare the data for you.

cd /mypath
wget https://zenodo.org/record/8320216/files/IBD_tutorial_raw_v2.0.0.tar.gz
tar xfz IBD_tutorial_raw_v2.0.0.tar.gz

The output looks like this:

[user@server /]$ cd /mypath
[user@server mypath]$ wget --no-verbose https://zenodo.org/record/8320216/files/IBD_tutorial_raw_v2.0.0.tar.gz
--2023-09-05 23:14:13--  https://zenodo.org/record/8320216/files/IBD_tutorial_raw_v2.0.0.tar.gz
Resolving zenodo.org (zenodo.org)... 188.185.124.72
Connecting to zenodo.org (zenodo.org)|188.185.124.72|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2175851966 (2.0G) [application/octet-stream]
Saving to: IBD_tutorial_raw_v2.0.0.tar.gz’

IBD_tutorial_raw_v2.0.0.tar.gz  100%[===============================================>]   2.03G  68.9MB/s    in 4m 15s

2023-09-05 23:18:29 (8.13 MB/s) - ‘IBD_tutorial_raw_v2.0.0.tar.gz’ saved [2175851966/2175851966]

[user@server mypath]$ tar xfz IBD_tutorial_raw_v2.0.0.tar.gz
[user@server mypath]$ cd IBD_tutorial_raw
[user@server IBD_tutorial_raw]$ ls
metaG  metaT
[user@server IBD_tutorial_raw]$ ls *
metaG:
CD136  CD138  CD140  CD142  CD146  CD237  CD238  CD240  CD242  CD244

metaT:
CD136  CD138  CD140  CD142  CD146  CD237  CD238  CD240  CD242  CD244
[user@server IBD_tutorial_raw]$ ls metaG/*/*
metaG/CD136/run_aa.1.fq.gz  metaG/CD138/run_ab_1.fq.gz  metaG/CD142/run1.1.fq.gz   metaG/CD240/CD240.1.fq.gz
metaG/CD136/run_aa.2.fq.gz  metaG/CD138/run_ab_2.fq.gz  metaG/CD142/run1.2.fq.gz   metaG/CD240/CD240.2.fq.gz
metaG/CD136/run_ab.1.fq.gz  metaG/CD138/run_ac_1.fq.gz  metaG/CD146/CD146.1.fq.gz  metaG/CD242/CD242.1.fq.gz
metaG/CD136/run_ab.2.fq.gz  metaG/CD138/run_ac_2.fq.gz  metaG/CD146/CD146.2.fq.gz  metaG/CD242/CD242.2.fq.gz
metaG/CD136/run_ac.1.fq.gz  metaG/CD138/run_ad_1.fq.gz  metaG/CD237/CD237.1.fq.gz  metaG/CD244/CD244.1.fq.gz
metaG/CD136/run_ac.2.fq.gz  metaG/CD138/run_ad_2.fq.gz  metaG/CD237/CD237.2.fq.gz  metaG/CD244/CD244.2.fq.gz
metaG/CD138/run_aa_1.fq.gz  metaG/CD140/CD140.1.fq.gz   metaG/CD238/CD238.1.fq.gz
metaG/CD138/run_aa_2.fq.gz  metaG/CD140/CD140.2.fq.gz   metaG/CD238/CD238.2.fq.gz
[user@server IBD_tutorial_raw]$ ls metaT/*/*
metaT/CD136/CD136.1.fq.gz  metaT/CD140/CD140.2.fq.gz  metaT/CD237/CD237.1.fq.gz  metaT/CD240/CD240.2.fq.gz
metaT/CD136/CD136.2.fq.gz  metaT/CD142/CD142.1.fq.gz  metaT/CD237/CD237.2.fq.gz  metaT/CD242/CD242.1.fq.gz
metaT/CD138/CD138.1.fq.gz  metaT/CD142/CD142.2.fq.gz  metaT/CD238/CD238.1.fq.gz  metaT/CD242/CD242.2.fq.gz
metaT/CD138/CD138.2.fq.gz  metaT/CD146/CD146.1.fq.gz  metaT/CD238/CD238.2.fq.gz  metaT/CD244/CD244.1.fq.gz
metaT/CD140/CD140.1.fq.gz  metaT/CD146/CD146.2.fq.gz  metaT/CD240/CD240.1.fq.gz  metaT/CD244/CD244.2.fq.gz

2.2. Data description

The files listed in Section 2.1 are from a subset of 10 samples from the Inflammatory Bowel Disease Multi’omics Database (IBDMDB) corresponds to two participants with Crohn’s disease, (CD1 and CD2).

The files are organized into sub-directories: each sample has its own directory, containing sometimes multiple runs of the same sequencing library. The paired files have the same name prefixes and the extensions .1.fq.gz and .2.fq.gz, for forward and reverse reads, respectively.
With the sample identifiers (IDs) listed in the configuration file, MIntO is capable of collecting and processing the input from this setup. For other input options, see input&output.

2.2.1. metaG samples from CD1: CD136, CD138, CD140, CD142, CD146,

metaG/CD136/run_aa.1.fq.gz
metaG/CD136/run_aa.2.fq.gz
metaG/CD136/run_ab.1.fq.gz
metaG/CD136/run_ab.2.fq.gz
metaG/CD136/run_ac.1.fq.gz
metaG/CD136/run_ac.2.fq.gz
metaG/CD138/run_aa_1.fq.gz
metaG/CD138/run_aa_2.fq.gz
metaG/CD138/run_ab_1.fq.gz
metaG/CD138/run_ab_2.fq.gz
metaG/CD138/run_ac_1.fq.gz
metaG/CD138/run_ac_2.fq.gz
metaG/CD138/run_ad_1.fq.gz
metaG/CD138/run_ad_2.fq.gz
metaG/CD140/CD140.1.fq.gz
metaG/CD140/CD140.2.fq.gz
metaG/CD142/run1.1.fq.gz
metaG/CD142/run1.2.fq.gz
metaG/CD146/CD146.1.fq.gz
metaG/CD146/CD146.2.fq.gz

Original files from CD1 were derived from just one sequencing run each. We split them into multiple runs to enable testing MIntO's ability to handle multiple runs. We also renamed CD138 files as *_1.fq.gz and *_2.fq.gz to test MIntO's ability to handle different file-naming conventions.

2.2.2. metaG samples from CD2: CD237, CD238, CD240, CD242 and CD244.

metaG/CD237/CD237.1.fq.gz
metaG/CD237/CD237.2.fq.gz
metaG/CD238/CD238.1.fq.gz
metaG/CD238/CD238.2.fq.gz
metaG/CD240/CD240.1.fq.gz
metaG/CD240/CD240.2.fq.gz
metaG/CD242/CD242.1.fq.gz
metaG/CD242/CD242.2.fq.gz
metaG/CD244/CD244.1.fq.gz
metaG/CD244/CD244.2.fq.gz

2.2.3. Respective metaT for both participants:

metaT/CD136/CD136.1.fq.gz
metaT/CD136/CD136.2.fq.gz
metaT/CD138/CD138.1.fq.gz
metaT/CD138/CD138.2.fq.gz
metaT/CD140/CD140.1.fq.gz
metaT/CD140/CD140.2.fq.gz
metaT/CD142/CD142.1.fq.gz
metaT/CD142/CD142.2.fq.gz
metaT/CD146/CD146.1.fq.gz
metaT/CD146/CD146.2.fq.gz

metaT/CD237/CD237.1.fq.gz
metaT/CD237/CD237.2.fq.gz
metaT/CD238/CD238.1.fq.gz
metaT/CD238/CD238.2.fq.gz
metaT/CD240/CD240.1.fq.gz
metaT/CD240/CD240.2.fq.gz
metaT/CD242/CD242.1.fq.gz
metaT/CD242/CD242.2.fq.gz
metaT/CD244/CD244.1.fq.gz
metaT/CD244/CD244.2.fq.gz

Summary:

MIntO expects input to be gzipped fastq files. Pairs of the same run should have matching name prefixes. The default extensions are 1.fq.gz and 2.fq.gz for the forward and reverse reads, respectively.

MIntO can collect each sample from its own directory (named by sample ID), but other input options are also available.

MIntO expects the same sample ID in metaG and metaT reads to match the data and generate the gene and function expression profiles.

2.3. Copy the metadata

Let us copy metadata from the MIntO repository.

[user@server mypath]$ mkdir /mypath/IBD_tutorial
[user@server mypath]$ cp $MINTO_DIR/tutorial/metadata/tutorial_metadata.txt /mypath/IBD_tutorial

2.4. Our objective

The samples above have been filtered from the original ones and only include reads matching to 5 genomes:

  • Faecalibacterium prausnitzii
  • Parabacteroides distasonis
  • Bacteroides fragilis
  • Ruminococcus bromii
  • Bacteroides uniformis

Here, we are going to retrieve these MAGs from metaG using the genome-based assembly-dependent mode. And then integrate metaT and metaT for matching samples.

Let's go!

3. Analysis with MIntO v2.2

Tips on running Snakemake:

In this tutorial, we provide commands to perform analysis on a standalone server without a queuing system. If you have slurm on your cluster, you can replace --cores 8 with --cluster "sbatch -J {name} --mem={resources.mem}G -c {threads} -e slurm-%x.e%A -o slurm-%x.o%A" --local-cores 8 in all Snakemake commands to let slurm manage your jobs. This will create error and output files for each Snakemake rule run via slurm as slurm-<rulename>.o<JobID> and slurm-<rulename>.e<JobID>, respectively.

You can have a better look at how to launch MIntO with different cluster queuing systems in the snakemake documentation.

You can add --dry-run -p to the snakemake command line for a dry run listing what will happen.

3.0. Set up Snakemake command arguments

For runtime benchmarking this tutorial, we used --use-conda --restart-times 1 --keep-going --latency-wait 60 --jobs 20 --cores 96 --resources mem=700 --conda-prefix $MINTO_DIR/conda_env --shadow-prefix /scratch/johndoe as arguments to Snakemake. Here's the explanation for these parameters.

We use the following Snakemake arguments:

--use-conda --restart-times 1 --keep-going --latency-wait 60

These options tell Snakemake to

  1. use conda environments (MIntO uses them a lot!),
  2. try re-running a failed rule a second time (just in case!),
  3. continue running the workflow even if some rules fail (imagine going home for the night and coming back the next day to find that workflow quit 10 minutes after you left?),
  4. wait for 60 seconds before giving up on an output file being produced (this could be due to NFS delays).

For the tutorial, we also use:

  • --jobs 4 to ask Snakemake to run maximum 4 jobs at a time,
  • --cores 16 to ask Snakemake to use maximum 16 CPUs at any time (this limit is across entire workflow at any given time),
  • --conda-prefix to specify where to install the conda environments.

We also use the shadow-directive from Snakemake, which is a very elegant way to run I/O intensive tasks in a local disk of an execution server. See Snakemake shadow rule documentation. For this to work, the user should provide a temporary location in the local disk with enough space using --shadow-prefix option. Many compute clusters provide a fast disk in /scratch. Snakemake will use a unique directory for each rule. The advantage of using shadow-directive is that you do not need to clean up after you are done - the entire directory will be deleted after the rule completes.

For ease of use, we will create a variable $SNAKE_PARAMS with these options and use it in the Snakemake commands.

[user@server mypath]$ SNAKE_PARAMS="--use-conda --restart-times 1 --keep-going --latency-wait 60 --jobs 4 --cores 16 --resources mem=120 --conda-prefix $MINTO_DIR/conda_env --shadow-prefix /scratch/johndoe"

3.1. STEP1: Quality Control - Part 1 (QC_1)

QC_1 stands for Quality Control Part 1, to first remove sequencing adapters and low quality bases.

QC_1.smk requires the configuration file QC_1.yaml which has to be filled out by the user. You can find an example configuration file at $MINTO_DIR/configuration/QC_1.yaml and the one used to run this tutorial here!

We will take a look into the # General settings and # ILLUMINA sections, please find out more about other parameters included in QC_1.yaml here.

"General settings" section for metaG and metaT

Here's the #General settings section for metaG analysis.

######################
# General settings
######################
PROJECT: IBD_tutorial
working_dir: /mypath/IBD_tutorial
omics: metaG
minto_dir: /server/apps/MIntO
METADATA: /mypath/IBD_tutorial/tutorial_metadata.txt
raw_reads_dir: /mypath/IBD_tutorial_raw/metaG

And here's the #General settings section for metaT analysis.

######################
# General settings
######################
PROJECT: IBD_tutorial
working_dir: /mypath/IBD_tutorial
omics: metaT
minto_dir: /server/apps/MIntO
METADATA: /mypath/IBD_tutorial/tutorial_metadata.txt
raw_reads_dir: /mypath/IBD_tutorial_raw/metaT

The different variables or fields are described below.

Field Description
PROJECT Name of the project. Should use same project name for project with both metaG and metaT.
working_dir Path to the project working directory, where MIntO will generate the outputs. Should use same location for project with both metaG and metaT.
omics metaT or metaT
minto_dir Path to the directory where you have installed MIntO, in this case our $MINTO_DIR
METADATA Path to the file with sample metadata
raw_reads_dir Path to the directory with metaG or metaT raw reads

Note:

Don't forget to replace /mypath/ with your path!

Use same locations for working_dir for both metaG and metaT from the same study for integrating them to gene expression.

"ILLUMINA" section for metaG and metaT

This is a list of sample identifiers in the yaml format. Each sample id is given in a line prepended by -.

ILLUMINA:
- CD136
- CD138
- CD140
- CD142
- CD146
- CD237
- CD238
- CD240
- CD242
- CD244

Launching QC_1 jobs

Once the configuration file metaG/QC_1.yaml has been copied to working directory and properly filled out, QC_1.smk can be launched for metaG:

[user@server mypath]$ mkdir /mypath/IBD_tutorial/metaG
[user@server mypath]$ cd /mypath/IBD_tutorial/metaG
[user@server metaG]$ cp $MINTO_DIR/tutorial/configuration/metaG/QC_1.yaml .
[user@server metaG]$ conda activate MIntO
(MIntO) [user@server metaG]$ snakemake $SNAKE_PARAMS --snakefile $MINTO_DIR/smk/QC_1.smk --configfile QC_1.yaml

For metaT, the QC_1.smk can be launched after copying the configuration file metaT/QC_1.yaml and filling it out properly:

[user@server mypath]$ mkdir /mypath/IBD_tutorial/metaT
[user@server mypath]$ cd /mypath/IBD_tutorial/metaT
[user@server metaT]$ cp $MINTO_DIR/tutorial/configuration/metaT/QC_1.yaml .
[user@server metaT]$ conda activate MIntO
(MIntO) [user@server metaT]$ snakemake $SNAKE_PARAMS --snakefile $MINTO_DIR/smk/QC_1.smk --configfile QC_1.yaml

When QC_1 finishes

Once QC_1.smk finishes to run, MIntO generates the directory /mypath/IBD_tutorial/metaG/1-trimmed and partially filled out /mypath/IBD_tutorial/metaG/QC_2.yaml configuration file (or the corresponding files in /mypath/IBD_tutorial/metaT) to run the next step for each omics. Here you can see how QC_2.yaml file looks. Importantly, QC_1.smk estimates the value for READ_minlen field in QC_2.yaml and fills it for you. This is the read-length cutoff that will retain perc_remaining_reads percent reads (by default 95% reads are retained).

3.2. STEP2: Quality Control - Part 2 (QC_2)

The second part of the quality control is to remove reads that are shorter than READ_minlen and delete host-derived sequences. Afterwards, read-based taxonomic profiling is performed.

Take a look at the complete QC_2.yaml file here.

"General settings" section

For both metaG and metaT, this section is identical to that in QC_1.yaml. This is also autofilled by QC_1.smk for you.

"Read length filtering" section

This section sets the READ_minlen parameter for length trimming using seqkit package. This is also autofilled by QC_1.smk for you. Normally, you need not edit this. For the tutorial, the cut-off is 134.

#########################
# Read length filtering
#########################

READ_minlen: 134

"Host genome filtering" section

This section sets the parameters for filtering host reads. Here's an example:

# Host genome filtering
PATH_host_genome: /mypath/IBD_tutorial
NAME_host_genome: build_hg18_subset.fna
BWA_index_host_memory: 100
BWA_host_threads: 8
BWA_host_memory: 40

Let us dissect this.

Field Auto-filled? Description
PATH_host_genome No Location of the host genome bwa2 database.
We highly recommend to use a local disk for the host genome database. This will make the alignments much faster. If you are using a cluster, please make sure that the local disk is accessible at the same path across all execution servers.
NAME_host_genome No Name of the database (prefix used when building index with bwa-mem2 index)
BWA_index_host_memory Yes Estimated memory usage (in GB) by bwa-mem2 index
BWA_host_threads Yes Number of threads when aligning using bwa-mem2 mem
BWA_host_memory Yes Estimated memory usage by bwa-mem2 mem

BWA program parameters (BWA_* fields) are autofilled with by QC_1.smk with default values corresponding to filtering against human genome. Normally, you need not edit these fields. However, host_genome fields need to be filled by the user. There are two scenarios that can happen with host genome filtering.

Scenario 1 - <PATH_host_genome>/BWA_index/<NAME_host_genome>.bwt.2bit.64 exists
Since index exists, it will not be created. Alignment can proceed.
Scenario 2 - <PATH_host_genome>/<NAME_host_genome> exists
If the .bwt.2bit.64 file (and others) are missing, then bwa-mem2 index will be used to create the database at <PATH_host_genome>/BWA_index/<NAME_host_genome>.bwt.2bit.64.

The samples for this tutorial come from human fecal metagenomes already filtered by host-derived sequences. Thus, there is no need to index the entire human genome (build hg38) and map the reads to it. For a fast test on how this step works, we will map the reads to a subset of the human genome ($MINTO_DIR/tutorial/build_hg18_subset.fna).

[user@server mypath]$ cd /mypath/IBD_tutorial
[user@server IBD_tutorial]$ cp $MINTO_DIR/tutorial/build_hg18_subset.fna /mypath/IBD_tutorial

"Ribosomal RNA depletion" section

For omics: metaT, ribosomal RNA sequences will be removed from the metaT reads using sortmeRNA software. For this, the user has to specify the path to the ribosomal RNA database. Here are the fields:

#########################
# Ribosomal RNA depletion
#########################
sortmeRNA_threads: 8
sortmeRNA_memory: 10
sortmeRNA_db: /server/apps/MIntO/data/rRNA_databases
sortmeRNA_db_idx: /server/apps/MIntO/data/rRNA_databases/idx

If the ribosomal RNA database has been downloaded using dependencies.smk, the database will be located at $MINTO_DIR/data/rRNA_databases and the database index at $MINTO_DIR/data/rRNA_databases/idx. These values are autofilled by QC_1.smk. If you downloaded them manually, please update these fields accordingly.

"Assembly-free taxonomy profiling" section

The user has to select the taxonomic profiling tool (MetaPhlan or mOTUs3). By default, QC_1.smk autofills profiler name as motus_rel,metaphlan (motus_rel - relative abundance; metaphlan - relative abundance), which will be used by this tutorial. A third option, motus_raw, reports estimated raw read counts for each species from mOTUs3.

While taxonomy profiling is normally performed for metaG, MIntO can perform this also for metaT data. If you do not need this for metaT, just remove the section from QC_2.yaml.

##################################
# Assembly-free taxonomy profiling
##################################
TAXA_threads: 8
TAXA_memory: 10
taxa_profile: motus_rel,metaphlan
metaphlan_version: 4.0.6
motus_version: 3.0.3

K-mer based sample comparison section

For metaG samples, dissimilarity could be computed on k-mer abundances using sourmash. The cutoff value is used to create clusters of highly similar samples for co-abundance binning, which will have the prefix SCL. The cutoff value could be set to 0 to turn this feature off.

#########################
# K-mer based comparison
#########################

SOURMASH_min_abund: 2
SOURMASH_max_abund: 1000
SOURMASH_cutoff: 0.40

"Analysis parameters" section

At the end of QC_2 step, alpha- and beta-diversity plots will be generated. Visualization can be enhanced using metadata variables, e.g. coloring using some feature. Also, further yaml files for other steps will be created using the main analysis variable. These need to be defined in QC_2.yaml. The main variable/factor of interest in the study will be used to define co-assemblies, and to color the points in visualization. This is a required variable and must be defined in the metadata file also. Other variables are optional. The second factor will be used to shape the points. The column headers or variable names in the metadata file denoted by METADATA field can be passed via QC_2.yaml for this purpose. For example, in a case-control cohort, MAIN_factor could be disease_status, and PLOT_factor2 could be sample_site. For time-series data, visualizing metrics over time can be useful.

Field Required Auto-filled? Description
MAIN_factor Yes No the main factor in the metadata file to differentiate in visualization (using color)
PLOT_factor2 No No the second factor in the metadata file to differentiate in visualization (using shape)
PLOT_time No No name of the factor in the metadata file denoting time (e.g. hour, day)

In this tutorial, we will use subject (variable participant_ID2) as MAIN_factor and week (variable week_n) as PLOT_time.

#####################
# Analysis parameters
#####################

MAIN_factor: participant_ID2
PLOT_factor2:
PLOT_time: week_n

"ILLUMINA" section

For both metaG and metaT, this section is identical to that in QC_1.yaml. This is also autofilled by QC_1.smk for you. Normally, you need not edit these fields at this step.

Launching QC_2 jobs

Once the configuration file has been filled out for metaG, QC_2.smk can be launched:

[user@server mypath]$ cd /mypath/IBD_tutorial/metaG
[user@server metaG]$ conda activate MIntO
(MIntO) [user@server metaG]$ snakemake $SNAKE_PARAMS --snakefile $MINTO_DIR/smk/QC_2.smk --configfile QC_2.yaml

For metaT:

[user@server mypath]$ cd /mypath/IBD_tutorial/metaT
[user@server metaT]$ conda activate MIntO
(MIntO) [user@server metaT]$ snakemake $SNAKE_PARAMS --snakefile $MINTO_DIR/smk/QC_2.smk --configfile QC_2.yaml

After the second quality control (QC_2.smk), the configuration file assembly.yaml will be ONLY generated in the /mypath/IBD_tutorial/metaG directory.

3.3. STEP3: Assembly

This step assembles reads into contigs, so that we can reconstruct of metagenome-assembled genomes (MAGs). This step is only required in the genome-based assembly-dependent mode.

Note

The assembly.yaml file is shared between this step and the next step (assembly.smk and binning_preparation.smk).

Assembler parameters

In assembly.yaml there are different fields to set up (have a look here) for the following assemblers:

  • MetaSPAdes for (i) individual assembly of short-reads and (ii) hybrid assembly of short-reads and long reads
  • MEGAHIT for co-assembly of short-reads
  • MetaFlye for individual assembly of long-reads

Let us have a look.

# MetaSPAdes settings
#
METASPADES_qoffset: auto
METASPADES_threads: 16
METASPADES_memory: 10
METASPADES_hybrid_max_k: 99
METASPADES_illumina_max_k: 99

# MEGAHIT settings
#
# Note on MEGAHIT_memory:
#     MEGAHIT's memory requirement scales with sample count and sample size.
#     Default MEGAHIT_memory below is 10 Gigabytes per sample (plenty for metaG samples of 10M reads).
#     MEGAHIT_memory parameter represents per-sample memory in Gigabytes
#     during the first attempt. If it is not enough and MEGAHIT fails, then
#     each successive attempt will increase per-sample memory by 6 Gibabytes
#     until it succeeds or reaches max_attempts from snakemake.
#     Please make sure that there is enough RAM on the server.
# Custom k-list for assembly: k must be odd, in the range 15-255, increment <= 28, fx. 21,29,39,59,79,99,119,141
MEGAHIT_memory: 10
MEGAHIT_threads: 32
MEGAHIT_presets:
- meta-sensitive
- meta-large
#MEGAHIT_custom:
# -

The metaSPAdes max k sizes are set for the typical short-read length of 150bp. For shorter read lengths, metaSPAdes automatically adjusts it lower.
The MEGAHIT preset meta-sensitive is for assembling metagenomes with lower complexity to a higher granularity, while meta-large is for assembling more complex metagenomes but including less strain-variance. The user can also supply a list of k-mers via the MEGAHIT_custom parameter after uncommenting those lines.

In this tutorial, only the first two steps will be run because there are only short-reads.

Coassembly

MEGAHIT is used to run the co-assembly of all reads (Full) and the co-assembly per participant (CD1 and CD2). Per-participant co-assembly is organized based on grouping samples via participant_ID2, which is our MAIN_factor.

By default, MIntO generates these coassembly definitions, but sets enable_COASSEMBLY to no.

###############################
# COASSEMBLY section:
###############################

# If enable_COASSEMBLY is "yes", MEGAHIT coassembly will be performed using the following definitions.
# Each coassembly is named in the LHS, and corresponding illumina sample(s) are in RHS (delimited by '+').
# One coassembly will be performed for each line.
# E.g. 'Subject1: I3+I4' will result in 1 coassembly: 'Subject1' using I3 and I4 data.
# Memory per coassembly is calculated to be 10G per sample in the coassembly.
# Please make sure that there is enough RAM on the server.
#
enable_COASSEMBLY: yes
COASSEMBLY:
  Full: CD136+CD138+CD140+CD142+CD146+CD237+CD238+CD240+CD242+CD244
  CD1: CD136+CD138+CD140+CD142+CD146
  CD2: CD237+CD238+CD240+CD242+CD244

Please note the enable_COASSEMBLY field. For large datasets with 100s of samples, enabling coassembly may not be a good idea. It will take too much time and resources. You can turn co-assembly on by changing enable_COASSEMBLY to yes.
If you have time-series data with 1000+ samples where per-participant coassemblies are manageable (not more than 20-30 samples), then you can just remove the Full coassembly of all samples and leave the rest.

The complete assembly.yaml file for this tutorial is here.

Launching assembly jobs

You can launch assembly.smk using:

[user@server mypath]$ cd /mypath/IBD_tutorial/metaG
[user@server metaG]$ conda activate MIntO
(MIntO) [user@server metaG]$ snakemake $SNAKE_PARAMS --snakefile $MINTO_DIR/smk/assembly.smk --configfile assembly.yaml 

3.4. STEP4: Binning preparation (creating inputs for binners)

This step is run after the previous assembly step finishes successfully. It maps reads from each sample to all assemblies from the previous step to prepare a contig depth file for MAG reconstruction. This step is only required in the genome-based assembly-dependent mode.

The assembly.yaml file is shared between the previous step and this step (assembly.smk and binning_preparation.smk).

Binning preparation parameters

Here's the part relevant for binning preparation:

###############################
# Binning preparation settings
###############################

# Whether to use contigs or scaffolds from SPAdes
SPADES_CONTIGS_OR_SCAFFOLDS: contigs

# minimum contig/scaffold fasta length
MIN_FASTA_LENGTH: 2500

# assembly batch size for mapping reads to combined contig sets
CONTIG_MAPPING_BATCH_SIZE: 100

# Should we exclude any assembly type during MAG generation?
# E.g., if you make MAGs from metaT, individual sample assemblies
# will be quite fragmented. If you have several co-assemblies, then
# ignoring 'illumina_single' might improve your MAG quality. This can
# be achieved using:
# ---
# EXCLUDE_ASSEMBLY_TYPES:
#     - illumina_single
# ---
#
# EXCLUDE_ASSEMBLY_TYPES:

# Contig-depth: bwa-mem2 settings
# Used when mapping reads back to contigs
#
BWA_threads: 10
BWA_memory: 25
BWA_index_memory: 50

# Contig-depth: samtools sort settings
# Used when sorting bam files
# memory listed below is PER-THREAD.
SAMTOOLS_sort_threads: 2
SAMTOOLS_sort_perthread_memgb: 10

Users can choose whether to use contigs or scaffolds from metaSPAdes. They can also choose the minimum length of a scaffold/contig to be included in binning.

Important resource considerations

Generating the contig depth file is the most resource-consuming step in MAG reconstruction. The default resource parameters generated in the assembly.yaml file work well for this tutorial. However, you need to edit these settings for real metagenome datasets.

Given N samples and their assemblies, we will map each sample against combined database of all N assemblies. The resulting bam output is first filtered for alignments over 95% sequence identity, then sorted by coordinates and finally summarized into a contig depth file.

Imagine a cohort with 200 such samples. Even at 95% threshold, reads from a given species in sample X will map to all contigs belonging to the same species across multiple samples. For prevalent and abundant species such as Bacteroides vulgatus, each read might map to >100 contigs representing different strains from different individuals. All this information is very useful during the MAG reconstruction process. This might sound innocuous. But a typical 9Gb human fecal metagenome with 30 million paired-end reads produces assemblies with 150,000 scaffolds and 100-200Mb sequences. We need to perform 200 mapping jobs, each mapping 30 million paired-end reads against 30,000,000 contigs with 20-40Gb total length. For this step, bwa-mem2 mem will be run with -a option that lists all hits over a threshold (its default behavior otherwise is to randomly select among besthits). With multiple hits per read, the resulting bam output will be quite large, and sorting it in-memory efficiently using samtools sort requires lots of memory.

Based on a lot of experience with large cohorts (e.g., 200 samples each with 30 million paired-end reads), we have optimized the following parameters for bwa-mem2 and samtools sort.

# Contig-depth: bwa-mem2 settings
# Used when mapping reads back to contigs
#
BWA_threads: 24
BWA_memory: 70
BWA_index_memory: 300

# Contig-depth: samtools sort settings
# Used when sorting bam files
# memory listed below is PER-THREAD.
SAMTOOLS_sort_threads: 3
SAMTOOLS_sort_perthread_memgb: 70

The above parameters are optimized for our cluster where each server has 2TB RAM and 192 CPUs. When using a batch size of 50 assemblies, bwa-mem2 needs ~70GB RAM to map a sample to a batch. Its output is piped to msamtools for filtering but that does not use much RAM. The filtered output is piped to 'samtools sort' with 3 CPUs each using 70GB RAM. This means that each mapping of a sample to a batch in total needs 280GB RAM. This allows us to run 7 jobs on a server (occupying 1960GB RAM, with some room as buffer). In order to not leave idle CPUs, we then share the 192 CPUs across these 7 jobs, which gives us 27 CPUs per job. Reserving 3 CPUs for samtools sort, we let bwa-mem2 have 24 CPUs. With this set up, each job takes 1-2 hours on average. In the above example of 200 samples, each sample is mapped to 4 batches of 50 assemblies each, so there are 800 mapping tasks. With the throughput of 7 jobs every 1-2 hours, it would take 7 days to complete.

Users should optimize these parameters for their scenario. If you do not have a lot of RAM, then considering reducing the number of threads in SAMTOOLS_sort_threads. It will take longer, as sorting is the bottleneck in this case.

Launching contig-depth jobs

Once the assembly.yaml file is optimized for your setup, you can launch binning_preparation.smk as follows:

[user@server mypath]$ cd /mypath/IBD_tutorial/metaG
[user@server metaG]$ conda activate MIntO
(MIntO) [user@server metaG]$ snakemake $SNAKE_PARAMS --snakefile $MINTO_DIR/smk/binning_preparation.smk --configfile assembly.yaml

When binning_preparation.smk has finished, the configuration file mags_generation.yaml will be available at /mypath/IBD_tutorial/metaG.

3.5. STEP5: Generating MAGs

This step generates MAGs using AVAMB. The MAGs will be checked by quality with CheckM2 and clustered together with CoverM in order to obtain a subset of representative genomes of the community.

Here, we will hope to retrieve near-complete MAGs for the 5 species we included.

Several options for MAG generation are encoded in the config file mags_generation.yaml.

Choosing MAG characteristics

First, general parameters for MAGs:

# COMMON PARAMETERS
#
MIN_FASTA_LENGTH: 2500
MIN_MAG_LENGTH: 500000

This tells the workflow to use sequences longer than 2500bp for binning and only continue with MAGs over 500kb in the next steps.

Options for AVAMB

Then, we set parameters for AVAMB:

# VAMB settings
#
BINNERS:
- aaey
- aaez
- vae384

VAMB_THREADS: 20
VAMB_memory: 40

# Use GPU in VAMB:
# could be yes or no
VAMB_GPU: no

We list the different binning approaches under BINNERS. The values above tell the workflow to run AVAMB in both AAE mode and VAE mode. For aaey and aaez, AAE mode will be run with default parameters,and bins from the aaey and aaez space will be used. For vae384, VAE mode will be run with '-l 24 -n 384 384'. Workflow recognizes aaey, aaez and vae<n>. For vae<n>, the vamb parameters will be '-l <n> -n <m> <m>', where m=int(n/16).

Non-redundant high-quality MAGs

Then, we set parameters for identifying the best non-redundant high-quality MAGS, using CheckM2 and coverM:

# CHECKM settings
#
CHECKM_COMPLETENESS: 90  # higher than this
CHECKM_CONTAMINATION: 5  # lower than this
CHECKM_BATCH_SIZE: 50    # Process MAGs with this batch size

# COVERM settings
#
COVERM_THREADS: 8
COVERM_memory: 5

# SCORING THE BEST GENOMES settings
#
# this could be checkm or genome
SCORE_METHOD: checkm

Workflow will check MAGs for completeness in batches of CHECKM_BATCH_SIZE. Processing each MAG individually takes quite a long time in our experience. If you have thousands of MAGs, then increasing CHECKM_BATCH_SIZE is a good idea.

We use CoverM to get non-redundant MAGs using the fastANI algorithm, which estimates pairwise average nucleotide identity (ANI) to cluster MAGs. If you have large number of MAGs, consider increasing COVERM_THREADS.

We cluster MAGs at 99% ANI and choose the best MAG within each cluster as the representative. We use SCORE_METHOD to score MAGs. By default, we use the MAG contig/scaffold characteristics (e.g., genome size, contig N50, longest sequence) and genome quality characteristics (completeness, contamination). The score used is:

seq_score = log10(longest_contig_size/genome_size) + log10(N50/L50)
qual_score = completeness - (2 * contamination)
score_checkm = (0.1 * qual_score) + seq_score

The complete mags_generation.yaml file for this tutorial is here.

Launching MAG construction

[user@server mypath]$ cd /mypath/IBD_tutorial/metaG
[user@server metaG]$ conda activate MIntO
(MIntO) [user@server metaG]$ snakemake $SNAKE_PARAMS --snakefile $MINTO_DIR/smk/mags_generation.smk --configfile mags_generation.yaml

3.6. STEP6: Annotating genomes

MIntO can predict genes in MAGs/genomes using Prokka and annotate protein coding genes using different software and databases: eggNOG, KEGG and dbCAN. The gene_annotation.smk script can be launched if genomes are used as reference (ONLY in genome-based mode: retrieved MAGs or the reference genomes). In this tutorial, we will annotate both MAGs and reference genomes.

The configuration file mapping.yaml will be generated after running QC_2.smk. STEP5 can be skipped if the reads are mapped to reference genomes or to a gene catalog (without retrieving the MAGs).

This configuration file can be found in both /mypath/IBD_tutorial/metaG and /mypath/IBD_tutorial/metaT directories, which is used by gene_annotation.smk and gene_abundace.smk.

MAGs generated by MIntO

If you have generated MAGs, MIntO knows where to look for them. You just need to set map_reference to MAG and tell MIntO whether you want to use MAGs from metaG or metaT. In genome-based assembly-dependent mode, using the retrieved MAGs from metaG, our configuration looks like this:

######################
# Annotation settings
######################

# Set MIntO mode
# Where should we map reads to? MAG, refgenome, catalog
MINTO_MODE: MAG

# Which omics for MAGs?
MAG_omics: metaG

# path to gene catalog fasta file
PATH_reference:

# file name of gene catalog fasta file (MIntO will generate bwa index with same name)
NAME_reference:

Own reference genomes

If you are using your own set of reference genomes (genome-based assembly-free mode), you should place them in a directory as FASTA files (.fna).

For this tutorial, we will use reference genomes distributed with MIntO. These correspond to the 5 species included in the tutorial dataset.

[user@server mypath]$ tar xfz $MINTO_DIR/tutorial/genomes.tar.gz
[user@server mypath]$ ls genomes/
B_fragilis.fna  B_uniformis.fna  F_prausnitzii.fna  P_distasonis.fna  R_bromii.fna

We can then tell MIntO to find these reference genomes in /mypath/genomes. In genome-based assembly-free mode, the mapping.yaml will look like this:

######################
# Annotation settings
######################

# Set MIntO mode
# Where should we map reads to? MAG, refgenome, catalog
MINTO_MODE: refgenome

# Which omics for MAGs?
MAG_omics:

# path to gene catalog fasta file
PATH_reference: /mypath/genomes

# file name of gene catalog fasta file (MIntO will generate bwa index with same name)
NAME_reference:

Reference gene catalog

When running the gene-catalog based assembly-free mode (e.g. using the IGC), the mapping.yaml will look like this:

# Set MIntO mode
# Where should we map reads to? MAG, refgenome, catalog
MINTO_MODE: catalog
PATH_reference: /mypath/IGC_database
NAME_reference: IGC.faa.gz

Functional annotation of protein coding genes

In addition, the user can choose which software to use for annotating protein coding genes: eggNOG, kofam (KEGG) and/or dbCAN. The complete mapping.yaml file for this tutorial is here. For this tutorial, we will use eggNOG-mapper and dbCAN to annotated the genomes.

# List of software used to perform genome function annotation:
# - dbCAN
# - kofam
# - eggNOG
ANNOTATION:
 - dbCAN
 - eggNOG

Taxonomic annotation

Users can choose to taxonomically annotate MAGs using PhyloPhlAn and/or GTDBtk:

# MAG taxonomy settings
#
RUN_TAXONOMY: yes
TAXONOMY_NAME: phylophlan,gtdb  # Currently, phylophlan or gtdb or combination
TAXONOMY_CPUS: 8
TAXONOMY_memory: 5

# Taxonomy database versions
#
PHYLOPHLAN_TAXONOMY_VERSION: SGB.Jun23
GTDB_TAXONOMY_VERSION: r220

The above parameters will run PhyloPhlAn with SGB.Jun23 taxonomy and/or GTDBtk with its database r220.

Launching gene annotation

gene_annotation.smk can be launched as:

[user@server mypath]$ cd /mypath/IBD_tutorial/metaG
[user@server metaG]$ conda activate MIntO
(MIntO) [user@server metaG]$ snakemake $SNAKE_PARAMS --snakefile $MINTO_DIR/smk/gene_annotation.smk --configfile mapping.yaml

When running gene_annotation.smk, a new directory will be generated in the working directory, called /mypath/IBD_tutorial/DB (database), which is going to contain the result of the genome annotation.

Both kofam and eggNOG provide KEGG annotations, but the results they might not be identical. Union of KOs from both software are in the column merged.KEGG_KO in 4-annotations/combined_annotations.tsv.

3.7. STEP7: Gene abundance

gene_abundance.smk uses gene_annotation.smk outputs to map the reads to the annotated genomes and genes (/mypath/IBD_tutorial/DB) and calculates the gene (omics: metaG) and transcript (omics: metaT) abundances. It shares mapping.yaml to set its own parameters as well. The relevant ones are:

#########################
# Gene abundance settings
#########################

# BWA Alignment
BWA_threads: 10
BWA_memory: 45

# Alignment filtering
# -------------------
# msamtools_filter_length: Default 50 works well for ILLUMINA 2x150bp.
#                          For shorter sequencing, e.g. 2x75bp, please reduce accordingly.
# alignment_identity: Default 97 works well for MAGs that are dereped at 99%.
#                     For refgenomes, modify it according to the derep level of your refgenomes.
#                     For gene catalog, modify it to match the derep level of the gene catalog (usually 95%).
# MIN_mapped_reads:   Default 2 is designed for low-to-medium sequencing depth.
#                     Increase according to depth (e.g., set to 10 when depth is above 5Gb).
msamtools_filter_length: 50
alignment_identity: 97
MIN_mapped_reads: 2

# Normalization approach
# Could be TPM, MG or comma-delimited combinations
abundance_normalization: TPM,MG

The first group sets parameters for bwa-mem2. You can change them according to your computing infrastructure.

The second group sets parameters for msamtools for filtering alignments. The default settings will only consider reads mapping to the sequence database (MAGs, reference genomes, reference gene catalogs) at 97% identity for at least 50bp. Additionally, each gene will only be considered as a hit if at least 2 reads are mapped to it.

The third group sets parameters for abundance normalization. TPM stands for "transcripts per million", where total abundances will sum to 1e6. On the other hand, MG stands for "marker genes", where abundance of each gene is normalized using the median abundance of 10 universal bacterial marker genes (COG0012, COG0016, COG0018, COG0172, COG0215, COG0495, COG0525, COG0533, COG0541, and COG0552).

Launching abundance estimation

gene_abundance.smk should be launched once for each omic:

[user@server mypath]$ cd /mypath/IBD_tutorial/metaG
[user@server metaG]$ conda activate MIntO
(MIntO) [user@server metaG]$ snakemake $SNAKE_PARAMS --snakefile $MINTO_DIR/smk/gene_abundance.smk --configfile mapping.yaml

If you have metaT:

[user@server mypath]$ cd /mypath/IBD_tutorial/metaT
[user@server metaG]$ conda activate MIntO
(MIntO) [user@server metaG]$ snakemake $SNAKE_PARAMS --snakefile $MINTO_DIR/smk/gene_abundance.smk --configfile mapping.yaml

Take a look at the complete mapping.yaml file for metaG here and metaT here.

Once gene_abundance.smk has finished to run, the configuration file data_integration.yaml will be available at /mypath/IBD_tutorial/.

3.8. STEP8: Data integration

Finally, we arrived to the last part of MIntO... the most important step! Where the data from metaG and metaT are integrated together.

The data_integration.yaml is generated at /mypath/IBD_tutorial/.

The complete data_integration.yaml file for this tutorial is here.

In data_integration.yaml, omics has to be metaG_metaT in the # General parameters in order to integrate metaG and metaT results!
If you don't have both metaG and metaT, you can still run the script with omics set to the appropriate type, to generate normalized abundance tables and plots. Read more about the required parameters here.

You can run data_integration.smk like this:

[user@server mypath]$ cd /mypath/IBD_tutorial
[user@server IBD_tutorial]$ conda activate MIntO
(MIntO) [user@server IBD_tutorial]$ snakemake $SNAKE_PARAMS --snakefile $MINTO_DIR/smk/data_integration.smk --configfile data_integration.yaml

Note: This step has to be run once!

3.9. STEP9: Visualisation and outputs

All the relevant outputs for the user will be generated in the folder /mypath/IBD_tutorial/outputs:

  • assembly-free and assembly-based taxonomic profiles in 6-taxa_profile
  • gene profiles, including the gene IDs (generated by Prokka when selecting assembly-dependent mode or sequence IDs when choosing assembly-free mode) and normalised gene abundance, transcript or expression
  • function profiles per database, including the function IDs, function description and function abundance, transcript or expression normalized counts. For an easier downstream analysis of these data, phyloseq objects are generated for the taxonomic, gene and function profiles.

In addition, MIntO also generate plots useful to have a preliminary idea of the community analysed and to help the user for downstream analysis:

Clone this wiki locally