Releases: arumugamlab/MIntO
MIntO 2.3.0 stable release
We have added a couple of new functionalities to MIntO (see Newly supported functionalities below). And made it leaner, meaner, faster and more efficient.
We also fixed a lot of bugs since v2.2.0
. Outputs are still identical with v2.2.0
within 6-decimal precision (except taxonomic profiles that now include Unknown
by default), so these bugs were not producing wrong results. But v2.3.0
will give you a smoother experience.
Making a fresh installation?
Please see instructions in our Wiki page for installation
Upgrading from 2.0.0, 2.1.0 or 2.2.0?
Please do the following from where you cloned MIntO from github.
git pull
git checkout tags/2.3.0
After this, please rerun dependencies.smk
using the same command you ran previously. This should finish without any complaints. If it does not finish successfully, please submit an issue report on github.
Newly supported functionalities
- Cluster-friendly read-mapping with bwa-mem2. If the nodes in cluster are defined, MIntO will distribute bwa-index files to their local disk and use the local versions for mapping
- See instructions in
MIntO/smk/include/bwa_index_wrapper.smk
to see how it works - See instructions in
<minto_dir>/site/cluster.py
and define variables as necessary - Remember to pass
resources.qsub_args
to batch submission viaSnakemake
arguments. E.g., for slurm,--default-resources gpu=0 mem=4 "qsub_args=''" --cluster 'sbatch -J {name} --mem={resources.mem}G --gres=gpu:{resources.gpu} -c {threads} {resources.qsub_args}'
- Remember to delete the bwa-index files from the nodes when you are done by using
--config CLEAN_BWA_INDEX=True
arguments toSnakemake
- See Snakemake argument FAQ for help with setting up
Snakemake
commandline
- See instructions in
- Support of starting MIntO analysis half-way, if the input for that step is created properly (see FAQ for instructions)
- Automatic estimation of batch size based on avaliable-memory-per-task (
MAX_RAM_GB_PER_JOB
) for binning preparation step (mapping each sample against all assemblies)
Software improvements
- Improved efficiency and speed
- Gene profiling using
bam
andbed
files is orders of magnitude faster with identical output due to switchingbedtools multicov
-->samtools bedcov
coverm
behaves and stays within the threads it is provided, by limitingfastANI
to use only one thread- dbCAN annotation is much faster using
hmmsearch
viapyhmmer
- Distributing bwa-index files to local disks on clusters makes mapping significantly faster
- Gene profiling using
- Smaller disk footprint
- Not storing BAM files after mapping, thus decreasing the footprint of projects
- Ignoring unnecessary columns from contig-depth files as input to binning
- Improved code maintainability
- New config-parsing module makes much cleaner code
Changes
- Automatic estimation of memory requirement for several steps, thus removing several memory-related fields from
yaml
files - Unclassified taxa from MetaPhlAn and mOTUs are reported at the top of file as 'Unknown'
- Functional annotation in batches to handle studies with 1000s of MAGs
Bug-fixes
- Gene abundance and expression PCA plots labelled the points wrong. Please check if you have used this before
- Ensuring the
sample_alias
values are unique
Software and database version upgrades
- MetaPhlAn updated to v4.1.1
- mOTUs updated to v3.1.0
What's Changed
- Update gene_annotation.smk by @CJREID in #57
- Handle unassigned/unclassified taxa in metaphlan and motus by @microbiomix in #58
- Bugfix: eggnog db version by @jszarvas in #60
- Gene annotation in batches; Made it easier to skip QC and start MIntO with assembly or mapping to refgenome by @microbiomix in #61
- BUGFIX: Gene/function abundance/expression plots mislabels by @microbiomix in #62
- Fixed issues with 'MERGE_ILLUMINA_SAMPLES' directive; and other minor improvements by @microbiomix in #63
- Updated example yaml files; Improved search for runs per sample. by @microbiomix in #64
- Cluster-friendly implementation of bwa index to automatically distribute files to nodes by @microbiomix in #65
- Moved taxonomical annotation versions into gene_annotation by @jszarvas in #66
- Parallelized the bottleneck step of 'bedtools multicov' by @microbiomix in #67
- Using checkpoints to create assembly batches for depth calculation by @microbiomix in #68
- BWA index files are made as regular shadow rule and rsync'ed to nodes by @microbiomix in #69
- Replaced 'bedtools multicov' with 10X faster 'samtools bedcov' by @microbiomix in #70
- Saving space during binning - by gzipping files and ignoring useless columns by @microbiomix in #71
- Avoiding too many zcats and gzips by @microbiomix in #72
- Removed 'this.path' and 'include' dependencies by @arumugamlab in #73
- Speed-ups, additional input method, change in batches for binning and added raw read QC by @jszarvas in #74
- Cleanup of bwa index mirrors and automatic estimation of vamb memory usage by @microbiomix in #75
- Config parser module and other improvements by @microbiomix in #77
- Minor changes and bugfixes by @jszarvas in #78
New Contributors
- @CJREID made their first contribution in #57
- @arumugamlab made their first contribution in #73
Full Changelog: 2.2.0...2.3.0
MIntO 2.2.0 stable release
We have added a couple of new functionalities (see Newly supported functionalities below). This required that we moved some config variables from one yaml
file to another. Considering this as a not-so-small change in behavior, we have upgraded MIntO to v2.2.0
We also fixed a lot of bugs since v2.1.0
. Outputs are still identical with v2.1.0
within 6-decimal precision, so these bugs were not producing wrong results. But v2.2.0
will give you a smoother experience.
Making a fresh installation?
Please see instructions in our Wiki page for installation
Upgrading from 2.0.0 or 2.1.0?
Please do the following from where you cloned MIntO from github.
git pull
git checkout tags/2.2.0
After this, please rerun dependencies.smk
using the same command you ran previously. This should finish without any complaints. If it does not finish successfully, please submit an issue report on github.
Newly supported functionalities
- Taxonomic profile outputs from mapping reads to genomes in
MAG
andrefgenome
modes.- A standardized tabular output listing genomes and their taxonomic annotations together with their quantitative profiles. Users can process them to generate e.g., species profiles by grouping genomes from the same species.
- Co-assembly of samples using a newly defined variable
COAS_factor
. If it is absent, still defaults to older behavior of coassembling samples usingMAIN_factor
. - Capable of handling sample names with underscore and more special characters in them. But please try to avoid extremely special special characters (e.g., single- or double-quotes, space, percentage sign) in sample names.
Changes/bug-fixes
- Moving taxonomic annotation
- Adding taxonomic annotations to the genome profiles required that we move taxonomic annotation of genomes from
mags_generation.smk
togene_annotation.smk
- The above change also meant that configuration parameters for taxonomy moved from one
yaml
file to another.
- Adding taxonomic annotations to the genome profiles required that we move taxonomic annotation of genomes from
- Removed
BINSPLIT_CHAR
from configuration- In order to support handling of underscore characters in sample names, we had to take full control of how sample names and contig names will be separated in fasta headers. Therefore, we remove the ability of users to configure
BINSPLIT_CHAR
. Not that someone would need to - we never needed to tinker with it. Hopefully, you wouldn't even notice this change.
- In order to support handling of underscore characters in sample names, we had to take full control of how sample names and contig names will be separated in fasta headers. Therefore, we remove the ability of users to configure
- Simplified MAG/refgenome profile filenames
- They are now more intuitively
gene_abundances
,genome_abundances
, andcontig_abundances
. This also avoids havinggenome
as a suffix somewhere later in the name.
- They are now more intuitively
What's Changed
- Bugfix number of lines in concatting individual annotation files by @jszarvas in #53
- Optional COAS_factor, more flexible sample names and tax. profile plotting by @jszarvas in #54
- Taxonomic profiles from read-mapping in MAG/refgenome modes; other bug-fixes and improvements by @microbiomix in #55
- Upgraded test script to use v2.2.0 when released by @microbiomix in #56
Full Changelog: 2.1.0...2.2.0
MIntO 2.1.0 stable release
MIntO has gone through some more major improvements to handle large studies with 1000+ of shotgun metagenomes. While version 2.0.0
focused on reducing runtime and memory for all the steps up to MAG generation, version 2.1.0
dramatically reduces resource requirements for the gene/functional profiling steps, without changing the outputs (within 6-decimal precision), except when we fixed bugs (see Changes/bug-fixes below). Profiling 100s of samples against MAGs/genomes with millions of genes is now possible, that too efficiently, whereas analyzing them with 2.0.0
would run out of memory or reach data.frame
's in-built limits.
NOTE:
While MIntO can handle large studies, processing a gene profile table with 1000 samples and 15M genes will require lots of memory. In our hands, analysing such study demanded 250GB memory at peak usage, but finishes within 15 minutes.
There are some minor changes to the fields in configuration files, but most are still backwards-compatible with configuration files from version
2.0.0
.
Making a fresh installation?
Please see instructions in our Wiki page for installation
Upgrading from 2.0.0?
Please do the following from where you cloned MIntO from github.
git pull
git checkout tags/2.1.0
After this, please rerun dependencies.smk
using the same command you ran previously. This should finish without any complaints. If it does not finish successfully, please submit an issue report on github.
Software improvements
- More user-friendly
- Installation is further improved.
- Example script has further improvements, made available as
run_IBD_test.sh
. Do check it out!
- Improved efficiency and speed
- Migrated from
data.frame
todata.table
in R scripts. - Large-scale
pandas
-based table-merging using python is replaced with fasterdata.table
-based merging using R. - eggnog-mapper annotation now uses in-memory databases via
--dbmem
option. - Writing phyloseq objects out using the much faster
qsave()
rather thansaveRDS()
.
- Migrated from
- Smaller disk footprint
- Many useless intermediate files are not stored anymore.
- Several unnecessary columns from GFF and BED files are removed.
- Sample-level gene profile files used to keep all GFF columns in front of the actual abundance information. Now only the gene ID is kept.
- Multi-sample profiles used an intermediate file with sample id in the column header, which was then remapped to a final file with the
sample_alias
in column header. For large studies, these files are several GBs each. Intermediate file is not used anymore. - MAG IDs have been dropped from the gene IDs. Gene IDs looked like
<MAG>|<GENE>
before, but since<GENE>
included the unique locus tag from prokka, we decided to use that as a proxy for MAG IDs during calculations. If users need to know where a gene comes from, they can look at the file mapping locus tags to MAG IDs at:DB/MAG/1-prokka/locus_id_list.txt
orDB/refgenome/1-prokka/locus_id_list.txt
- Tutorial output folder reduced from 269 MB to 252 MB (6% reduction).
- In a study with 5500 MAGs and 12.7M genes, single-sample profiles files went from 525 MB (uncompressed, with
0
counts, 12.6M lines) to 19 MB (gzipped, without0
counts, 3.4M lines). Projected savings for 100 samples will be 50 GB.
Changes/bug-fixes
- Functional database naming scheme
- We admit: the naming in functional annotation was confusing. KEGG annotations from
eggnog-mapper
were calledKEGG_KO
,KEGG_Module
, andKEGG_Pathway
; while those fromkofamscan
were calledkofam_KO
,kofam_Module
, andkofam_Pathway
. This was not ideal. While this may not be a big deal for many, users would expect thatKEGG_Module
annotation will come from an official KEGG resource that iskofamscan
. - We have now streamlined functional annotations using
<software>.<functional-category>
as the grammar for functional category names. Available functional categories are:eggNOG.OGs
,eggNOG.KEGG_KO
,eggNOG.KEGG_Pathway
,eggNOG.KEGG_Module
,eggNOG.PFAMs
,kofam.KEGG_KO
,kofam.KEGG_Pathway
,kofam.KEGG_Module
,dbCAN.module
,dbCAN.enzclass
,dbCAN.subfamily
,dbCAN.eCAMI_subfamily
,dbCAN.eCAMI_submodule
,dbCAN.EC
. - E.g.,
KEGG_KO
annotations can be obtained from botheggNOG-mapper
andkofamscan
, which means that we have two versions:eggNOG.KEGG_KO
andkofam.KEGG_KO
. If you enable botheggNOG
andkofam
in functional annotation, MIntO will also make an additional category calledmerged.KEGG_KO
, which merges the twoKEGG_KO
annotations.
- We admit: the naming in functional annotation was confusing. KEGG annotations from
- MIntO mode names
- The three modes are now named:
MAG
,refgenome
orcatalog
. Output directories are named after the modes. - They are specified using
MINTO_MODE
field in the yaml file. 9-mapping-profiles
andDB
directories will have subdirectories named by the MIntO mode (e.g.,9-mapping-profiles/MAG
,DB/refgenome
).output/data_integration
will have subdirectories called<MODE>-genes
, since the profiling are based on genes (e.g.MAG-genes
).- Old names (e.g.,
reference_genome
,db_genes
) are still accepted for backwards-compatibility, but their corresponding output directories will berefgenome
andcatalog
, respectively.
- The three modes are now named:
- Tabular files are strictly TSVs
- All tabular output files are now tab-separated
.tsv
files. It used to be a mix of.csv
and.tsv
files, sometimes with the wrong file extension.
- All tabular output files are now tab-separated
MIN_mapped_reads
bug fixedMIN_mapped_reads = N
inmapping.yaml
file (input togene_abundance.smk
) was wrongly interpreted as> N
in refgenome and MAG modes. This has now been fixed. This also means that outputs are not identical betweenv2.0.0
andv2.1.0+
, but thev2.1.0+
results are more accurate.
- Gene coordinate bug fixed
BED
files were inadvertently changed to have 1-based coordinates like in GFF files, leading to wrong output frombedtools multicov
. This has now been fixed. This also means that outputs are not identical betweenv2.0.0
andv2.1.0+
, but thev2.1.0+
results are more accurate.
Software and database version upgrades
- dbCAN updated to v4.1.4, still maintaining HMMdb v12 (based on CAZyDB 7/26/2023). See info here.
- MetaPhlAn remains in v4.0.6 but its database has been updated to 'mpa_vJun23_CHOCOPhlAnSGB_202403'.
- PhyloPhlAn updated to v3.1.1 and SGB database to Jun23, so both PhyloPhlAn-annotated MAG-based profiles and MetaPhlAn profiles will map to the same SGB features.
- GTDB-Tk has been updated to v2.4.0 and GTDB release has been updated to r220.
Newly supported functionalities
- Versioning of software and database components in directory
output/versions
- All software/package versions are registered. This is useful when writing up methods section in manuscript.
- All functional/taxonomic annotation software versions and annotation file timestamps are also registered.
- Bulk input of fastq files using a tsv file that maps fastq files to sample ids.
Improved gene/functional profiling
- Gene profile output will not list genes with
metaG==0
ormetaT==0
across all samples. This also includesGE.tsv
wheremetaG==0
leads toInf
andmetaT==0
leads to0
. - Function expression profiling (derived from ratio of functions in metaT data and metaG data) will now report features where some (but not all) samples have 0 abundance in metaG.
What's Changed
- Update to MAG dereplication with CoverM v0.7 by @jszarvas in #17
- FracMinHash clustering by @jszarvas in #18
- ensure smash_plot_output() is not None by @cpauvert in #20
- GTDB support, bwa2 index memory, and kofamscan-speedup by @microbiomix in #22
- Bugfix umap min. neighbors by @jszarvas in #23
- Gene table normalization made faster and more efficient; plus some bugfixes and improvement by @microbiomix in #24
- Updated GTDB db location and some more improvements by @microbiomix in #25
- MAG taxonomy bug by @microbiomix in #27
- Update to dbCAN 4.1 and switch to individual processing of MAGs by @jszarvas in #26
- Function profiling enabled for large number of MAGs by @microbiomix in #28
- Bugfixes and speed-ups for gene_annotation and mags_generation by @jszarvas in #29
- New minor features; updated yaml file examples; bug fixes by @microbiomix in #30
- Function profiling enabled for large number of MAGs by @microbiomix in #31
- Renamed functional categories to be more structured and logical by @microbiomix in #32
- Minor improvements by @microbiomix in #33
- Update collate script and bugfixes by @jszarvas in #34
- Bringing back explicit arguments to annotation collation script by @microbiomix in #35
- Bringing back lost kofamscan runtime fix (d1fe76a) by @microbiomix in #36
- Several improvements in annotation and profiling - outputs remain unchanged by @microbiomix in #37
- Gene/functional profiling - Saving disk space and runtime by further optimization by @mi...
MIntO 2.0.0 stable release
MIntO has gone through major improvements. We have gone straight to version 2.0.0
as the directory structures and configuration field names have changed dramatically. This means that old configuration files are not compatible with version 2.0.0
onwards.
Software improvements
- More user-friendly:
- Installation is even more simplified.
- Example script to run the entire tutorial, which we have tested extensively, made available as
run_IBD_test.sh
. Do check it out! - Example TruSeq and MGIEasy sequencing adapters included in distribution.
- Improved efficiency and speed:
- Use of
shadow
directives letsSnakemake
work on local disk as much as possible. This means--shadow-prefix
is a mandatory argument toSnakemake
now. - Flow of processes (rules) optimized so that there is no heavy I/O between rules using NFS.
- Increased use of pipes rather than files.
- Bottleneck steps that summarized all samples at the same time (e.g.
bedtools multicov
, andcoverm contig
) have been replaced with parallel runs with single samples andpandas
-based table-merging. - Bottleneck steps that serially profiled multiple functional databases inside one process (e.g. functional quantification) have been replaced with parallel runs with individual databases.
- Use of
- Code optimization:
- Embraced
Snakemake
style input-output connection via wildcards in nearly all rules rather than bulk outputs. - Removed several obsolete scripts and files.
- Unified multiple mode-specific normalization scripts into a global script.
- Unified multiple mode-specific functional quantification scripts into a global script.
- Descriptions of eggNOG, KEGG, dbCAN functional databases are automatically downloaded and setup when running
dependencies.smk
. This lets the user get matching databases and descriptions at the time of installation, e.g. KEGG orthology definitions and descriptions.
- Embraced
Software version upgrades
- SPAdes updated to 3.15.5.
- MetaPhlAn updated to v4.0.6.
- mOTUs updated to v3.0.3.
- Genome binner updated to AVAMB v4.1.3.
- Genome completeness check updated to CheckM2 v1.0.1.
- dbCAN updated to dbCAN3 HMMdb v12 (based on CAZyDB 7/26/2023). See info here.
Newly supported functionalities
- New script
QC_0.smk
to enable reanalysis of publicly available datasets. - Taxonomic profiling, assembly and MAG generation now enabled for
metaT
. - Taxonomy profiles are tagged with profiler version, so that upgrading a profiler does not overwrite existing results.
- External reference genomes can be given as just
.fna
files and MIntO will perform genome annotation. - Replicate samples can be merged post QC into a single sample using
MERGE_ILLUMINA_SAMPLES
section in the YAML file. - It is now possible to provide custom kmer-sizes for MEGAHIT coassembly.
- MAGs are made using AVAMB with a combination of VAE and AAE, as recommended by AVAMB developers.
- VAMB GPU mode can be enabled if the server installing via
dependencies.smk
has CUDA drivers installed. (Automatically generatedmags.yaml
still disables GPU mode to avoid issues. User has to manually enable it before running the MAG generation step.)
Improved functional quantification
MG normalization (only available in refgenome or MAG mode)
- We have updated how to quantify functional categories under marker gene (MG) normalization. Quantification is a little bit more involved now.
- Weighted sum of genes belonging to a given functional unit (e.g. K12345), weighted by the relative abundance of the MAG/refgenome it belongs to.
- Weights for metaG/metaT gene profiles are derived from relative abundance from metaG/metaT space, respectively.
- The way to interpret this: If each species in the community carries the functional unit and expresses it at the same level as marker genes, it will be 1.0.
TPM normalization mode
- We have updated how to quantify functional categories under TPM normalization.
- Quantification is straightforward: sum up all genes mapped to a given functional unit (e.g. K12345).
- This also means that sum of functional profiles is not bounded by 1 million anymore. Since each gene can map to multiple KEGG KO's for example, total might be over million. We have decided not to renormalize to 1 million, as this will unfairly alter abundances when more promiscuous genes are present.
- The way to interpret this: Given a million transcripts, this is the number of transcripts representing this function. This is more comparable across samples.
What's Changed via PRs
- Removed pip dependencies for easier installation by @askerdb in #1
- Update to remove pip dependencies by @askerdb in #2
- Updated to remove pip dependencies by @askerdb in #3
- Switch Flye to use conda by @jszarvas in #10
- Made light rules local by @jszarvas in #11
- Alternative qc and trimming for external data (QC_0) & custom megahit settings by @jszarvas in #12
- Simple slurm cluster congfiguration by @jszarvas in #14
- Bugfix feature custom megahit by @jszarvas in #13
New Contributors
Full Changelog: 1.0.1...2.0.0
MIntO 2.0.0 first beta version
MIntO has gone through major improvements. We have gone straight to version 2.0.0
as the directory structures and configuration field names have changed dramatically. This means that old configuration files are not compatible with version 2.0.0
onwards.
Improvements:
- Much faster due to several optimizations under the hood.
- More user-friendly:
- Installation is even more simplified.
- Example script to run the entire tutorial, which we have tested extensively.
Software version upgrades:
- MetaPhlAn updated to v4.0.6.
- mOTUs updated to v3.0.3.
- Genome binner updated to AVAMB v4.1.1.
- Genome completeness check updated to CheckM2 v1.0.1.
Newly supported features:
- Taxonomic profiling, assembly and MAG generation now enabled for
metaT
. - Taxonomy profiles are tagged with profiler version, so that upgrading a profiler does not overwrite existing results.
- External reference genomes can be given as just
.fna
files and MIntO will perform genome annotation. - Replicate samples can be merged for post analysis.
- MAGs are made using AVAMB with a combination of VAE and AAE, as recommended by AVAMB developers.
- VAMB GPU mode can be enabled if the server installing via
dependencies.smk
has CUDA drivers installed. (Automatically generatedmags.yaml
still disables GPU mode to avoid issues. User has to manually enable it before running the MAG generation step.)
MIntO 1.0.1
Minor update that enables to set a minimum number of mapped reads to a gene prior to the normalization step for the gene to be considered “expressed”:
MIN_mapped_reads
: Filter the aligned reads by establishing the minimum number of mapped reads to a gene. This parameter is included inmapping.yaml
, when runninggene_abundance.smk
.msamtools
has been updated to v1.1.0
Additionally, Flye
has been updated to v2.9.
Full Changelog: 1.0.0...1.0.1
MIntO 1.0.0
First public available release.