Releases · arumugamlab/MIntO

19 Dec 15:56

2.3.0

1c3748e

Latest

We have added a couple of new functionalities to MIntO (see Newly supported functionalities below). And made it leaner, meaner, faster and more efficient.

We also fixed a lot of bugs since v2.2.0. Outputs are still identical with v2.2.0 within 6-decimal precision (except taxonomic profiles that now include Unknown by default), so these bugs were not producing wrong results. But v2.3.0 will give you a smoother experience.

Making a fresh installation?

Please see instructions in our Wiki page for installation

Upgrading from 2.0.0, 2.1.0 or 2.2.0?

Please do the following from where you cloned MIntO from github.

git pull
git checkout tags/2.3.0

After this, please rerun dependencies.smk using the same command you ran previously. This should finish without any complaints. If it does not finish successfully, please submit an issue report on github.

Newly supported functionalities

Cluster-friendly read-mapping with bwa-mem2. If the nodes in cluster are defined, MIntO will distribute bwa-index files to their local disk and use the local versions for mapping
- See instructions in MIntO/smk/include/bwa_index_wrapper.smk to see how it works
- See instructions in <minto_dir>/site/cluster.py and define variables as necessary
- Remember to pass resources.qsub_args to batch submission via Snakemake arguments. E.g., for slurm, --default-resources gpu=0 mem=4 "qsub_args=''" --cluster 'sbatch -J {name} --mem={resources.mem}G --gres=gpu:{resources.gpu} -c {threads} {resources.qsub_args}'
- Remember to delete the bwa-index files from the nodes when you are done by using --config CLEAN_BWA_INDEX=True arguments to Snakemake
- See Snakemake argument FAQ for help with setting up Snakemake commandline
Support of starting MIntO analysis half-way, if the input for that step is created properly (see FAQ for instructions)
Automatic estimation of batch size based on avaliable-memory-per-task (MAX_RAM_GB_PER_JOB) for binning preparation step (mapping each sample against all assemblies)

Software improvements

Improved efficiency and speed
- Gene profiling using bam and bed files is orders of magnitude faster with identical output due to switching bedtools multicov --> samtools bedcov
- coverm behaves and stays within the threads it is provided, by limiting fastANI to use only one thread
- dbCAN annotation is much faster using hmmsearch via pyhmmer
- Distributing bwa-index files to local disks on clusters makes mapping significantly faster
Smaller disk footprint
- Not storing BAM files after mapping, thus decreasing the footprint of projects
- Ignoring unnecessary columns from contig-depth files as input to binning
Improved code maintainability
- New config-parsing module makes much cleaner code

Changes

Automatic estimation of memory requirement for several steps, thus removing several memory-related fields from yaml files
Unclassified taxa from MetaPhlAn and mOTUs are reported at the top of file as 'Unknown'
Functional annotation in batches to handle studies with 1000s of MAGs

Bug-fixes

Gene abundance and expression PCA plots labelled the points wrong. Please check if you have used this before
Ensuring the sample_alias values are unique

Software and database version upgrades

MetaPhlAn updated to v4.1.1
mOTUs updated to v3.1.0

What's Changed

Update gene_annotation.smk by @CJREID in #57
Handle unassigned/unclassified taxa in metaphlan and motus by @microbiomix in #58
Bugfix: eggnog db version by @jszarvas in #60
Gene annotation in batches; Made it easier to skip QC and start MIntO with assembly or mapping to refgenome by @microbiomix in #61
BUGFIX: Gene/function abundance/expression plots mislabels by @microbiomix in #62
Fixed issues with 'MERGE_ILLUMINA_SAMPLES' directive; and other minor improvements by @microbiomix in #63
Updated example yaml files; Improved search for runs per sample. by @microbiomix in #64
Cluster-friendly implementation of bwa index to automatically distribute files to nodes by @microbiomix in #65
Moved taxonomical annotation versions into gene_annotation by @jszarvas in #66
Parallelized the bottleneck step of 'bedtools multicov' by @microbiomix in #67
Using checkpoints to create assembly batches for depth calculation by @microbiomix in #68
BWA index files are made as regular shadow rule and rsync'ed to nodes by @microbiomix in #69
Replaced 'bedtools multicov' with 10X faster 'samtools bedcov' by @microbiomix in #70
Saving space during binning - by gzipping files and ignoring useless columns by @microbiomix in #71
Avoiding too many zcats and gzips by @microbiomix in #72
Removed 'this.path' and 'include' dependencies by @arumugamlab in #73
Speed-ups, additional input method, change in batches for binning and added raw read QC by @jszarvas in #74
Cleanup of bwa index mirrors and automatic estimation of vamb memory usage by @microbiomix in #75
Config parser module and other improvements by @microbiomix in #77
Minor changes and bugfixes by @jszarvas in #78

New Contributors

@CJREID made their first contribution in #57
@arumugamlab made their first contribution in #73

Full Changelog: 2.2.0...2.3.0

Contributors

CJREID, arumugamlab, and 2 other contributors

Assets 2

13 Aug 16:33

microbiomix

2.2.0

3dad101

MIntO 2.2.0 stable release

We have added a couple of new functionalities (see Newly supported functionalities below). This required that we moved some config variables from one yaml file to another. Considering this as a not-so-small change in behavior, we have upgraded MIntO to v2.2.0

We also fixed a lot of bugs since v2.1.0. Outputs are still identical with v2.1.0 within 6-decimal precision, so these bugs were not producing wrong results. But v2.2.0 will give you a smoother experience.

Making a fresh installation?

Please see instructions in our Wiki page for installation

Upgrading from 2.0.0 or 2.1.0?

Please do the following from where you cloned MIntO from github.

git pull
git checkout tags/2.2.0

Newly supported functionalities

Taxonomic profile outputs from mapping reads to genomes in MAG and refgenome modes.
- A standardized tabular output listing genomes and their taxonomic annotations together with their quantitative profiles. Users can process them to generate e.g., species profiles by grouping genomes from the same species.
Co-assembly of samples using a newly defined variable COAS_factor. If it is absent, still defaults to older behavior of coassembling samples using MAIN_factor.
Capable of handling sample names with underscore and more special characters in them. But please try to avoid extremely special special characters (e.g., single- or double-quotes, space, percentage sign) in sample names.

Changes/bug-fixes

Moving taxonomic annotation
- Adding taxonomic annotations to the genome profiles required that we move taxonomic annotation of genomes from mags_generation.smk to gene_annotation.smk
- The above change also meant that configuration parameters for taxonomy moved from one yaml file to another.
Removed BINSPLIT_CHAR from configuration
- In order to support handling of underscore characters in sample names, we had to take full control of how sample names and contig names will be separated in fasta headers. Therefore, we remove the ability of users to configure BINSPLIT_CHAR. Not that someone would need to - we never needed to tinker with it. Hopefully, you wouldn't even notice this change.
Simplified MAG/refgenome profile filenames
- They are now more intuitively gene_abundances, genome_abundances, and contig_abundances. This also avoids having genome as a suffix somewhere later in the name.

What's Changed

Bugfix number of lines in concatting individual annotation files by @jszarvas in #53
Optional COAS_factor, more flexible sample names and tax. profile plotting by @jszarvas in #54
Taxonomic profiles from read-mapping in MAG/refgenome modes; other bug-fixes and improvements by @microbiomix in #55
Upgraded test script to use v2.2.0 when released by @microbiomix in #56

Full Changelog: 2.1.0...2.2.0

Contributors

microbiomix and jszarvas

Assets 2

05 Jul 12:36

microbiomix

2.1.0

4177a7a

MIntO 2.1.0 stable release

MIntO has gone through some more major improvements to handle large studies with 1000+ of shotgun metagenomes. While version 2.0.0 focused on reducing runtime and memory for all the steps up to MAG generation, version 2.1.0 dramatically reduces resource requirements for the gene/functional profiling steps, without changing the outputs (within 6-decimal precision), except when we fixed bugs (see Changes/bug-fixes below). Profiling 100s of samples against MAGs/genomes with millions of genes is now possible, that too efficiently, whereas analyzing them with 2.0.0 would run out of memory or reach data.frame's in-built limits.

NOTE:

While MIntO can handle large studies, processing a gene profile table with 1000 samples and 15M genes will require lots of memory. In our hands, analysing such study demanded 250GB memory at peak usage, but finishes within 15 minutes.

There are some minor changes to the fields in configuration files, but most are still backwards-compatible with configuration files from version 2.0.0.

Making a fresh installation?

Please see instructions in our Wiki page for installation

Upgrading from 2.0.0?

Please do the following from where you cloned MIntO from github.

git pull
git checkout tags/2.1.0

Software improvements

More user-friendly
- Installation is further improved.
- Example script has further improvements, made available as run_IBD_test.sh. Do check it out!
Improved efficiency and speed
- Migrated from data.frame to data.table in R scripts.
- Large-scale pandas-based table-merging using python is replaced with faster data.table-based merging using R.
- eggnog-mapper annotation now uses in-memory databases via --dbmem option.
- Writing phyloseq objects out using the much faster qsave() rather than saveRDS().
Smaller disk footprint
- Many useless intermediate files are not stored anymore.
- Several unnecessary columns from GFF and BED files are removed.
- Sample-level gene profile files used to keep all GFF columns in front of the actual abundance information. Now only the gene ID is kept.
- Multi-sample profiles used an intermediate file with sample id in the column header, which was then remapped to a final file with the sample_alias in column header. For large studies, these files are several GBs each. Intermediate file is not used anymore.
- MAG IDs have been dropped from the gene IDs. Gene IDs looked like <MAG>|<GENE> before, but since <GENE> included the unique locus tag from prokka, we decided to use that as a proxy for MAG IDs during calculations. If users need to know where a gene comes from, they can look at the file mapping locus tags to MAG IDs at: DB/MAG/1-prokka/locus_id_list.txt or DB/refgenome/1-prokka/locus_id_list.txt
- Tutorial output folder reduced from 269 MB to 252 MB (6% reduction).
- In a study with 5500 MAGs and 12.7M genes, single-sample profiles files went from 525 MB (uncompressed, with 0 counts, 12.6M lines) to 19 MB (gzipped, without 0 counts, 3.4M lines). Projected savings for 100 samples will be 50 GB.

Changes/bug-fixes

Functional database naming scheme
- We admit: the naming in functional annotation was confusing. KEGG annotations from eggnog-mapper were called KEGG_KO, KEGG_Module, and KEGG_Pathway; while those from kofamscan were called kofam_KO, kofam_Module, and kofam_Pathway. This was not ideal. While this may not be a big deal for many, users would expect that KEGG_Module annotation will come from an official KEGG resource that is kofamscan.
- We have now streamlined functional annotations using <software>.<functional-category> as the grammar for functional category names. Available functional categories are: eggNOG.OGs, eggNOG.KEGG_KO, eggNOG.KEGG_Pathway, eggNOG.KEGG_Module, eggNOG.PFAMs, kofam.KEGG_KO, kofam.KEGG_Pathway, kofam.KEGG_Module,dbCAN.module, dbCAN.enzclass, dbCAN.subfamily, dbCAN.eCAMI_subfamily, dbCAN.eCAMI_submodule, dbCAN.EC.
- E.g., KEGG_KO annotations can be obtained from both eggNOG-mapper and kofamscan, which means that we have two versions: eggNOG.KEGG_KO and kofam.KEGG_KO. If you enable both eggNOG and kofam in functional annotation, MIntO will also make an additional category called merged.KEGG_KO, which merges the two KEGG_KO annotations.
MIntO mode names
- The three modes are now named: MAG, refgenome or catalog. Output directories are named after the modes.
- They are specified using MINTO_MODE field in the yaml file.
- 9-mapping-profiles and DB directories will have subdirectories named by the MIntO mode (e.g., 9-mapping-profiles/MAG, DB/refgenome).
- output/data_integration will have subdirectories called <MODE>-genes, since the profiling are based on genes (e.g. MAG-genes).
- Old names (e.g., reference_genome, db_genes) are still accepted for backwards-compatibility, but their corresponding output directories will be refgenome and catalog, respectively.
Tabular files are strictly TSVs
- All tabular output files are now tab-separated .tsv files. It used to be a mix of .csv and .tsv files, sometimes with the wrong file extension.
MIN_mapped_reads bug fixed
- MIN_mapped_reads = N in mapping.yaml file (input to gene_abundance.smk) was wrongly interpreted as > N in refgenome and MAG modes. This has now been fixed. This also means that outputs are not identical between v2.0.0 and v2.1.0+, but the v2.1.0+ results are more accurate.
Gene coordinate bug fixed
- BED files were inadvertently changed to have 1-based coordinates like in GFF files, leading to wrong output from bedtools multicov. This has now been fixed. This also means that outputs are not identical between v2.0.0 and v2.1.0+, but the v2.1.0+ results are more accurate.

Software and database version upgrades

dbCAN updated to v4.1.4, still maintaining HMMdb v12 (based on CAZyDB 7/26/2023). See info here.
MetaPhlAn remains in v4.0.6 but its database has been updated to 'mpa_vJun23_CHOCOPhlAnSGB_202403'.
PhyloPhlAn updated to v3.1.1 and SGB database to Jun23, so both PhyloPhlAn-annotated MAG-based profiles and MetaPhlAn profiles will map to the same SGB features.
GTDB-Tk has been updated to v2.4.0 and GTDB release has been updated to r220.

Newly supported functionalities

Versioning of software and database components in directory output/versions
- All software/package versions are registered. This is useful when writing up methods section in manuscript.
- All functional/taxonomic annotation software versions and annotation file timestamps are also registered.
Bulk input of fastq files using a tsv file that maps fastq files to sample ids.

Improved gene/functional profiling

Gene profile output will not list genes with metaG==0 or metaT==0 across all samples. This also includes GE.tsv where metaG==0 leads to Inf and metaT==0 leads to 0.
Function expression profiling (derived from ratio of functions in metaT data and metaG data) will now report features where some (but not all) samples have 0 abundance in metaG.

What's Changed

Update to MAG dereplication with CoverM v0.7 by @jszarvas in #17
FracMinHash clustering by @jszarvas in #18
ensure smash_plot_output() is not None by @cpauvert in #20
GTDB support, bwa2 index memory, and kofamscan-speedup by @microbiomix in #22
Bugfix umap min. neighbors by @jszarvas in #23
Gene table normalization made faster and more efficient; plus some bugfixes and improvement by @microbiomix in #24
Updated GTDB db location and some more improvements by @microbiomix in #25
MAG taxonomy bug by @microbiomix in #27
Update to dbCAN 4.1 and switch to individual processing of MAGs by @jszarvas in #26
Function profiling enabled for large number of MAGs by @microbiomix in #28
Bugfixes and speed-ups for gene_annotation and mags_generation by @jszarvas in #29
New minor features; updated yaml file examples; bug fixes by @microbiomix in #30
Function profiling enabled for large number of MAGs by @microbiomix in #31
Renamed functional categories to be more structured and logical by @microbiomix in #32
Minor improvements by @microbiomix in #33
Update collate script and bugfixes by @jszarvas in #34
Bringing back explicit arguments to annotation collation script by @microbiomix in #35
Bringing back lost kofamscan runtime fix (d1fe76a) by @microbiomix in #36
Several improvements in annotation and profiling - outputs remain unchanged by @microbiomix in #37
Gene/functional profiling - Saving disk space and runtime by further optimization by @mi...

Contributors

cpauvert, microbiomix, and jszarvas

Assets 2

29 Jan 09:28

microbiomix

2.0.0

c5780ea

MIntO 2.0.0 stable release

MIntO has gone through major improvements. We have gone straight to version 2.0.0 as the directory structures and configuration field names have changed dramatically. This means that old configuration files are not compatible with version 2.0.0 onwards.

Software improvements

More user-friendly:
- Installation is even more simplified.
- Example script to run the entire tutorial, which we have tested extensively, made available as run_IBD_test.sh. Do check it out!
- Example TruSeq and MGIEasy sequencing adapters included in distribution.
Improved efficiency and speed:
- Use of shadow directives lets Snakemake work on local disk as much as possible. This means --shadow-prefix is a mandatory argument to Snakemake now.
- Flow of processes (rules) optimized so that there is no heavy I/O between rules using NFS.
- Increased use of pipes rather than files.
- Bottleneck steps that summarized all samples at the same time (e.g. bedtools multicov, and coverm contig) have been replaced with parallel runs with single samples and pandas-based table-merging.
- Bottleneck steps that serially profiled multiple functional databases inside one process (e.g. functional quantification) have been replaced with parallel runs with individual databases.
Code optimization:
- Embraced Snakemake style input-output connection via wildcards in nearly all rules rather than bulk outputs.
- Removed several obsolete scripts and files.
- Unified multiple mode-specific normalization scripts into a global script.
- Unified multiple mode-specific functional quantification scripts into a global script.
- Descriptions of eggNOG, KEGG, dbCAN functional databases are automatically downloaded and setup when running dependencies.smk. This lets the user get matching databases and descriptions at the time of installation, e.g. KEGG orthology definitions and descriptions.

Software version upgrades

SPAdes updated to 3.15.5.
MetaPhlAn updated to v4.0.6.
mOTUs updated to v3.0.3.
Genome binner updated to AVAMB v4.1.3.
Genome completeness check updated to CheckM2 v1.0.1.
dbCAN updated to dbCAN3 HMMdb v12 (based on CAZyDB 7/26/2023). See info here.

Newly supported functionalities

New script QC_0.smk to enable reanalysis of publicly available datasets.
Taxonomic profiling, assembly and MAG generation now enabled for metaT.
Taxonomy profiles are tagged with profiler version, so that upgrading a profiler does not overwrite existing results.
External reference genomes can be given as just .fna files and MIntO will perform genome annotation.
Replicate samples can be merged post QC into a single sample using MERGE_ILLUMINA_SAMPLES section in the YAML file.
It is now possible to provide custom kmer-sizes for MEGAHIT coassembly.
MAGs are made using AVAMB with a combination of VAE and AAE, as recommended by AVAMB developers.
VAMB GPU mode can be enabled if the server installing via dependencies.smk has CUDA drivers installed. (Automatically generated mags.yaml still disables GPU mode to avoid issues. User has to manually enable it before running the MAG generation step.)

Improved functional quantification

MG normalization (only available in refgenome or MAG mode)

We have updated how to quantify functional categories under marker gene (MG) normalization. Quantification is a little bit more involved now.
Weighted sum of genes belonging to a given functional unit (e.g. K12345), weighted by the relative abundance of the MAG/refgenome it belongs to.
Weights for metaG/metaT gene profiles are derived from relative abundance from metaG/metaT space, respectively.
The way to interpret this: If each species in the community carries the functional unit and expresses it at the same level as marker genes, it will be 1.0.

TPM normalization mode

We have updated how to quantify functional categories under TPM normalization.
Quantification is straightforward: sum up all genes mapped to a given functional unit (e.g. K12345).
This also means that sum of functional profiles is not bounded by 1 million anymore. Since each gene can map to multiple KEGG KO's for example, total might be over million. We have decided not to renormalize to 1 million, as this will unfairly alter abundances when more promiscuous genes are present.
The way to interpret this: Given a million transcripts, this is the number of transcripts representing this function. This is more comparable across samples.

What's Changed via PRs

Removed pip dependencies for easier installation by @askerdb in #1
Update to remove pip dependencies by @askerdb in #2
Updated to remove pip dependencies by @askerdb in #3
Switch Flye to use conda by @jszarvas in #10
Made light rules local by @jszarvas in #11
Alternative qc and trimming for external data (QC_0) & custom megahit settings by @jszarvas in #12
Simple slurm cluster congfiguration by @jszarvas in #14
Bugfix feature custom megahit by @jszarvas in #13

New Contributors

@askerdb made their first contribution in #1
@jszarvas made their first contribution in #10

Full Changelog: 1.0.1...2.0.0

Contributors

askerdb and jszarvas

Assets 2

07 Jun 16:30

arumugamlab

2.0.0-beta.1

4e3d8b6

MIntO 2.0.0 first beta version Pre-release

Pre-release

Improvements:

Much faster due to several optimizations under the hood.
More user-friendly:
- Installation is even more simplified.
- Example script to run the entire tutorial, which we have tested extensively.

Software version upgrades:

MetaPhlAn updated to v4.0.6.
mOTUs updated to v3.0.3.
Genome binner updated to AVAMB v4.1.1.
Genome completeness check updated to CheckM2 v1.0.1.

Newly supported features:

Taxonomic profiling, assembly and MAG generation now enabled for metaT.
Taxonomy profiles are tagged with profiler version, so that upgrading a profiler does not overwrite existing results.
External reference genomes can be given as just .fna files and MIntO will perform genome annotation.
Replicate samples can be merged for post analysis.
MAGs are made using AVAMB with a combination of VAE and AAE, as recommended by AVAMB developers.
VAMB GPU mode can be enabled if the server installing via dependencies.smk has CUDA drivers installed. (Automatically generated mags.yaml still disables GPU mode to avoid issues. User has to manually enable it before running the MAG generation step.)

Assets 2

20 Apr 08:30

carmen-saenz

1.0.1

75e10c8

MIntO 1.0.1

Minor update that enables to set a minimum number of mapped reads to a gene prior to the normalization step for the gene to be considered “expressed”:

MIN_mapped_reads: Filter the aligned reads by establishing the minimum number of mapped reads to a gene. This parameter is included in mapping.yaml, when running gene_abundance.smk.
msamtools has been updated to v1.1.0

Additionally, Flye has been updated to v2.9.

Full Changelog: 1.0.0...1.0.1

Assets 2

18 Mar 18:59

carmen-saenz

1.0.0

b7858d0

MIntO 1.0.0

First public available release.

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Making a fresh installation?

Upgrading from 2.0.0, 2.1.0 or 2.2.0?

Newly supported functionalities

Software improvements

Changes

Bug-fixes

Software and database version upgrades

What's Changed

New Contributors

Contributors

Making a fresh installation?

Upgrading from 2.0.0 or 2.1.0?

Newly supported functionalities

Changes/bug-fixes

What's Changed

Contributors

Making a fresh installation?

Upgrading from 2.0.0?

Software improvements

Changes/bug-fixes

Software and database version upgrades

Newly supported functionalities

Improved gene/functional profiling

What's Changed

Contributors

Software improvements

Software version upgrades

Newly supported functionalities

Improved functional quantification

MG normalization (only available in refgenome or MAG mode)

TPM normalization mode

What's Changed via PRs

New Contributors

Contributors

Releases: arumugamlab/MIntO

MIntO 2.3.0 stable release

Making a fresh installation?

Upgrading from 2.0.0, 2.1.0 or 2.2.0?

Newly supported functionalities

Software improvements

Changes

Bug-fixes

Software and database version upgrades

What's Changed

New Contributors

Contributors

MIntO 2.2.0 stable release

Making a fresh installation?

Upgrading from 2.0.0 or 2.1.0?

Newly supported functionalities

Changes/bug-fixes

What's Changed

Contributors

MIntO 2.1.0 stable release

Making a fresh installation?

Upgrading from 2.0.0?

Software improvements

Changes/bug-fixes

Software and database version upgrades

Newly supported functionalities

Improved gene/functional profiling

What's Changed

Contributors

MIntO 2.0.0 stable release

Software improvements

Software version upgrades

Newly supported functionalities

Improved functional quantification

MG normalization (only available in refgenome or MAG mode)

TPM normalization mode

What's Changed via PRs

New Contributors

Contributors

MIntO 2.0.0 first beta version

MIntO 1.0.1

MIntO 1.0.0