Release MIntO 2.1.0 stable release · arumugamlab/MIntO

MIntO has gone through some more major improvements to handle large studies with 1000+ of shotgun metagenomes. While version 2.0.0 focused on reducing runtime and memory for all the steps up to MAG generation, version 2.1.0 dramatically reduces resource requirements for the gene/functional profiling steps, without changing the outputs (within 6-decimal precision), except when we fixed bugs (see Changes/bug-fixes below). Profiling 100s of samples against MAGs/genomes with millions of genes is now possible, that too efficiently, whereas analyzing them with 2.0.0 would run out of memory or reach data.frame's in-built limits.

NOTE:

While MIntO can handle large studies, processing a gene profile table with 1000 samples and 15M genes will require lots of memory. In our hands, analysing such study demanded 250GB memory at peak usage, but finishes within 15 minutes.

There are some minor changes to the fields in configuration files, but most are still backwards-compatible with configuration files from version 2.0.0.

Making a fresh installation?

Please see instructions in our Wiki page for installation

Upgrading from 2.0.0?

Please do the following from where you cloned MIntO from github.

git pull
git checkout tags/2.1.0

After this, please rerun dependencies.smk using the same command you ran previously. This should finish without any complaints. If it does not finish successfully, please submit an issue report on github.

Software improvements

More user-friendly
- Installation is further improved.
- Example script has further improvements, made available as run_IBD_test.sh. Do check it out!
Improved efficiency and speed
- Migrated from data.frame to data.table in R scripts.
- Large-scale pandas-based table-merging using python is replaced with faster data.table-based merging using R.
- eggnog-mapper annotation now uses in-memory databases via --dbmem option.
- Writing phyloseq objects out using the much faster qsave() rather than saveRDS().
Smaller disk footprint
- Many useless intermediate files are not stored anymore.
- Several unnecessary columns from GFF and BED files are removed.
- Sample-level gene profile files used to keep all GFF columns in front of the actual abundance information. Now only the gene ID is kept.
- Multi-sample profiles used an intermediate file with sample id in the column header, which was then remapped to a final file with the sample_alias in column header. For large studies, these files are several GBs each. Intermediate file is not used anymore.
- MAG IDs have been dropped from the gene IDs. Gene IDs looked like <MAG>|<GENE> before, but since <GENE> included the unique locus tag from prokka, we decided to use that as a proxy for MAG IDs during calculations. If users need to know where a gene comes from, they can look at the file mapping locus tags to MAG IDs at: DB/MAG/1-prokka/locus_id_list.txt or DB/refgenome/1-prokka/locus_id_list.txt
- Tutorial output folder reduced from 269 MB to 252 MB (6% reduction).
- In a study with 5500 MAGs and 12.7M genes, single-sample profiles files went from 525 MB (uncompressed, with 0 counts, 12.6M lines) to 19 MB (gzipped, without 0 counts, 3.4M lines). Projected savings for 100 samples will be 50 GB.

Changes/bug-fixes

Functional database naming scheme
- We admit: the naming in functional annotation was confusing. KEGG annotations from eggnog-mapper were called KEGG_KO, KEGG_Module, and KEGG_Pathway; while those from kofamscan were called kofam_KO, kofam_Module, and kofam_Pathway. This was not ideal. While this may not be a big deal for many, users would expect that KEGG_Module annotation will come from an official KEGG resource that is kofamscan.
- We have now streamlined functional annotations using <software>.<functional-category> as the grammar for functional category names. Available functional categories are: eggNOG.OGs, eggNOG.KEGG_KO, eggNOG.KEGG_Pathway, eggNOG.KEGG_Module, eggNOG.PFAMs, kofam.KEGG_KO, kofam.KEGG_Pathway, kofam.KEGG_Module,dbCAN.module, dbCAN.enzclass, dbCAN.subfamily, dbCAN.eCAMI_subfamily, dbCAN.eCAMI_submodule, dbCAN.EC.
- E.g., KEGG_KO annotations can be obtained from both eggNOG-mapper and kofamscan, which means that we have two versions: eggNOG.KEGG_KO and kofam.KEGG_KO. If you enable both eggNOG and kofam in functional annotation, MIntO will also make an additional category called merged.KEGG_KO, which merges the two KEGG_KO annotations.
MIntO mode names
- The three modes are now named: MAG, refgenome or catalog. Output directories are named after the modes.
- They are specified using MINTO_MODE field in the yaml file.
- 9-mapping-profiles and DB directories will have subdirectories named by the MIntO mode (e.g., 9-mapping-profiles/MAG, DB/refgenome).
- output/data_integration will have subdirectories called <MODE>-genes, since the profiling are based on genes (e.g. MAG-genes).
- Old names (e.g., reference_genome, db_genes) are still accepted for backwards-compatibility, but their corresponding output directories will be refgenome and catalog, respectively.
Tabular files are strictly TSVs
- All tabular output files are now tab-separated .tsv files. It used to be a mix of .csv and .tsv files, sometimes with the wrong file extension.
MIN_mapped_reads bug fixed
- MIN_mapped_reads = N in mapping.yaml file (input to gene_abundance.smk) was wrongly interpreted as > N in refgenome and MAG modes. This has now been fixed. This also means that outputs are not identical between v2.0.0 and v2.1.0+, but the v2.1.0+ results are more accurate.
Gene coordinate bug fixed
- BED files were inadvertently changed to have 1-based coordinates like in GFF files, leading to wrong output from bedtools multicov. This has now been fixed. This also means that outputs are not identical between v2.0.0 and v2.1.0+, but the v2.1.0+ results are more accurate.

Software and database version upgrades

dbCAN updated to v4.1.4, still maintaining HMMdb v12 (based on CAZyDB 7/26/2023). See info here.
MetaPhlAn remains in v4.0.6 but its database has been updated to 'mpa_vJun23_CHOCOPhlAnSGB_202403'.
PhyloPhlAn updated to v3.1.1 and SGB database to Jun23, so both PhyloPhlAn-annotated MAG-based profiles and MetaPhlAn profiles will map to the same SGB features.
GTDB-Tk has been updated to v2.4.0 and GTDB release has been updated to r220.

Newly supported functionalities

Versioning of software and database components in directory output/versions
- All software/package versions are registered. This is useful when writing up methods section in manuscript.
- All functional/taxonomic annotation software versions and annotation file timestamps are also registered.
Bulk input of fastq files using a tsv file that maps fastq files to sample ids.

Improved gene/functional profiling

Gene profile output will not list genes with metaG==0 or metaT==0 across all samples. This also includes GE.tsv where metaG==0 leads to Inf and metaT==0 leads to 0.
Function expression profiling (derived from ratio of functions in metaT data and metaG data) will now report features where some (but not all) samples have 0 abundance in metaG.

What's Changed

Update to MAG dereplication with CoverM v0.7 by @jszarvas in #17
FracMinHash clustering by @jszarvas in #18
ensure smash_plot_output() is not None by @cpauvert in #20
GTDB support, bwa2 index memory, and kofamscan-speedup by @microbiomix in #22
Bugfix umap min. neighbors by @jszarvas in #23
Gene table normalization made faster and more efficient; plus some bugfixes and improvement by @microbiomix in #24
Updated GTDB db location and some more improvements by @microbiomix in #25
MAG taxonomy bug by @microbiomix in #27
Update to dbCAN 4.1 and switch to individual processing of MAGs by @jszarvas in #26
Function profiling enabled for large number of MAGs by @microbiomix in #28
Bugfixes and speed-ups for gene_annotation and mags_generation by @jszarvas in #29
New minor features; updated yaml file examples; bug fixes by @microbiomix in #30
Function profiling enabled for large number of MAGs by @microbiomix in #31
Renamed functional categories to be more structured and logical by @microbiomix in #32
Minor improvements by @microbiomix in #33
Update collate script and bugfixes by @jszarvas in #34
Bringing back explicit arguments to annotation collation script by @microbiomix in #35
Bringing back lost kofamscan runtime fix (d1fe76a) by @microbiomix in #36
Several improvements in annotation and profiling - outputs remain unchanged by @microbiomix in #37
Gene/functional profiling - Saving disk space and runtime by further optimization by @microbiomix in #38
Removed unnecessary 'mag_omics' from 'data_integration.yaml' by @microbiomix in #39
Efficient reimplementation of generating & merging single-sample profile files to save disk space, memory and time. Plus other efficiency improvements. by @microbiomix in #40
Software and database versions by @jszarvas in #41
Streamlining/simplifying directory structure, using GNU parallel within rules for further speedup, smaller output file footprints by @microbiomix in #42
Bugfix motus version print after moving the db by @jszarvas in #43
TSVs only for tabular output; BED coordinate bugfix; more file compression by @microbiomix in #44
Feature: bulk input in one folder by @jszarvas in #45
Upgraded metaphlan, phylophlan and GTDB versions/databases. Other minor improvements by @microbiomix in #46
Added shadow directive to gtdb download by @jszarvas in #47
Removed unnecessary config variables from yaml files by @microbiomix in #48
Removed redundant rules; Moved marker gene identification from gene_abundance to gene_annotation. by @microbiomix in #49
Reordered rules and added comments for easier comprehension by @microbiomix in #50
Safely handling sample names as strings; Other bug fixes by @microbiomix in #51

New Contributors

@cpauvert made their first contribution in #20
@microbiomix made their first contribution in #22

Full Changelog: 2.0.0...2.1.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MIntO 2.1.0 stable release