Skip to content

Releases: soedinglab/MMseqs2

MMseqs2 Release 17-b804f

18 Jan 14:36
Compare
Choose a tag to compare

MMseqs2 Release 17 is mostly a bug fix release. Highlights include usability improvements in MMseqs2-GPU and fix for a common crash in the prefilter that was affecting many clustering runs.

New Features and Enhancements

  • result2profile can print frequencies in TSV format (c2c3ad9)
  • add a new masking mode --mask-n-repeat (c2c3ad9)
  • Improve GPU clients in server mode to wait for databasess to be loaded (e095774)
  • GPU server now takes CUDA_VISIBLE_DEVICES into account. (b804fbe)
  • Reduced glibc requirements for precompiled MMseqs2-GPU binaries to 2.17 (i.e. CentOS 7; db8ad2d)

Bug Fixes

  • Segmentation fault in easy-cluster starting in #916 (dc7f8ad)
  • GPU version generated corrupted sequence outputs #912 (e3b16fa)
  • Sequences starting with * could break Sequence mapping #927 (492297b)
  • Indexes without k-mer index are masked now (4766f92)
  • Invalid taxids check in majoritylca does not abort the whole process (8d17137)
  • Merged taxID larger than any taxid in nodes.dmp could corrupt memory #931 (fd37b37)

Developer Notes

  • Export NATIVE_ARCH in cmake (17cd5c0)

MMseqs2 Release 16-747c6

26 Nov 13:16
Compare
Choose a tag to compare

MMseqs2 Release 16 introduces support for GPU-accelerated searches [1]. Additionally, we fixed numerous bugs and relicensed MMseqs2 under the MIT license.

[1] Kallenborn F, Chacon A, Hundt C, Sirelkhatim H, Didi K, Dallago C, Mirdita M, Schmidt B, Steinegger M: GPU-accelerated homology search with MMseqs2. bioRxiv (2024).

Breaking Changes

  • Custom substitution matrices (--seed-sub-mat, --sub-mat) are not supported in this release. Only the built-in matrices will work. We will restore support in the next release. (93b2d94)

New Features and Enhancements

  • Added GPU support to MMseqs2, allowing for faster computations of sensitive alignments on CUDA-compatible hardware on the Turing generation or newer (a66ad0c, 81171a5, 1806c0c)
  • Added full-length six-frame translated search with --translation-mode 1 (#885)
  • Implement qframe and tframe output fields in convertalis (#615, #803, 417f22f)
  • Allows resuming of interrupted downloads in databases and createtaxdb (0b27c9d)
  • MMseqs2 taxonomy now always keeps at least the longest open reading frame within each input sequence after fragment elimination (#832, 5b4c816)
  • Added option to not compress outputs in tsv2exprofiledb (a146887)
  • filterdb has learned a new sort mode (--sort-entries 4 --weights file) to sort by priority (54f8983)
  • Updated tantan (3e53eee)

Bug Fixes

  • prefilter could use excessive memory and crash for highly redundant databases (950342d)
  • prefilter was not properly evaluating the last potential hit, increases sensitivity of k-mer prefilter slightly (06f7429)
  • result2msa works correctly with clustered clustered databases (78ae2c5)
  • Fixed ppos output field calculation in convertalis (fb38b7d, 816c5c9)
  • Fixed wrong coverage being passed to realignment (6267ffb)
  • Fixed --taxon-list being broken in multi-threaded prefilter and ungappedprefilter (804bb2a)
  • Fixed segmentation fault in prefilter (#872, a64d60a, ef2ebe9)
  • Fixed inconsistent ordering issue in createclusearchdb (b59ad53)
  • Corrected backtrace in SAM output for nucleotide-protein alignments and show reverse complement sequence correctly (#845, 5f23f1f)

Developer Notes

  • Disabled nedmalloc due a OpenMP crash in Cygwin (c498f51)
  • Breaking changes in how (sub)project command initialization works (1c08685, af2cc52)
  • Removed gzstream (111d893)
  • Breaking fix for parameter singleton in subprojects (5c6e32c)
  • Export MMSEQS_ARCH in CMakeCache for subprojects to use (48f13f9)

MMseqs2 Release 15-6f452

31 Oct 09:22
Compare
Choose a tag to compare

MMseqs2 Release 15 brings efficient single query searches with low memory overhead through the new ungapped-prefiltering mode (--prefilter-mode 1). We also improved our greedy clustering algorithm and added a large swath of smaller fixes and features. Thanks to all contributors for their vital contributions and fixes.

Breaking

  • Updated greedy cluster algorithm. The clustering picks better representatives to respect the sequence identity and coverage criteria. (2568829) Thanks @bbuchfink

New Features and Enhancements

  • Implement additional prefilter modes (standard double k-mer prefilter, ungapped prefilter, exhaustive searching) (5e119e9)
  • Added createclusearchdb and mkrepseqdb modules to build cluster-search databases, this was implemented for Foldseek, cluster-search in MMseqs2 will be implemented at a later point (9ae4458, 80f8b0b, 542f362, ad6dfc6, 91f2a6a, 8310cd6, 0019026, 76b7df1)
  • Implement target-side similar k-mer search mode for sequence-sequence prefiltering (71dd32e)
  • Rework ungappedprefilter to improve performance and expose additional parameters such as taxon filtering and db-load-mode to ungappedprefilter (8a89305, 800eb09, eb01b5b, 20d3afc)
  • Added gappedprefilter module for Smith-Waterman prefiltering, similar to ungappedprefilter (df77d9e)
  • Reworked pairaln for the ColabFold greedy taxonomy pairing mode (1514015)
  • Implemented experimental module for A3M filtering (167bbd1, 499bb73)
  • Implemented weighted clustering (bd080e6, b36070a, fd1837b) Thanks @AnnSeidel
  • Precomputed indices without k-mers can be created with --index-subset (314c1f0, 8fe3bf9)
  • Add result2neff module to extract Neff scores (4148e09) Thanks @neftlon
  • Add ppos format-output to convertalis for count of positive substitution scores (5edc79b) Thanks @Dohyun-s
  • Speed-up FASTA parsing in kseq.h with memchr (98406dd) Thanks @valentynbez @kloetzl

Bugfixes

  • Add min and max modes for result2stats (19dce03, 61e7734) Thanks @ClovisG
  • Fixed a segmentation fault in ca3m with the same database (f5f780a) Thanks @ClovisG
  • Fix crash when some input file sizes are an exact multiple of 4096 in convertalis and gff2db (712f288) Thanks @RuoshiZhang
  • Fixed issues for GTDB r214 database creation (4b52296) Thanks @apcamargo
  • Fix source number being limited to 16-bit (65k) (1d62fa0)
  • kseq now correctly handles input sequences larger than 2^31 bytes (07ca4a7)
  • Fixed unpackdb to work without a .lookup file and added support for writing compressed files (92d8cc3, 570e3ed)
  • createindex --check-compatible check the k-mer threshold correctly now (bb0a1b3)
  • Fixed prefilter exclusively long result lists reading to result truncation. This was primarily a Foldseek issue and shouldn't affect MMseqs2 (ed4c55f)
  • Corrected handling of multiline checks in createdb (6b93884)
  • Fix crash by disabling wrapped scoring when the target sequence is shorter than the query (8459b6b) Thanks @AnnSeidel
  • Fixed logic in reciprocal-best-hit by removing resAB_sort (3bcbdba) Thanks @StephanieSKim
  • Corrected handling of differently ordered parts of sequence databases in concatdbs (ea17d30)
  • Fix --single-step-clustering misspelled in cluster warning (fa6c093) Thanks @valentynbez

Build and Compatibility Updates

  • Addressed build and compatibility issues, including updates for newer compilers and architectures (e.g., Mac ARM64) (e26b9ad, 3e43617, b341b66, 932d32b) Thanks @A-N-Other
  • Added Mac ARM64 support in GitHub actions and updated from Ubuntu 18.04 to a newer image (1fea43d, 05132de)
  • Updated regression testing to fix errors in MPI test (2113766)

Developer

  • Introduced base: prefix to enable inheriting subprojects to find shadowed modules (i.e. Foldseek shadows createdb, but can use base:createdb to use the MMseq2's one) (90aa913)
  • Exported build architecture in CMake so subprojects can use it (fce06b1)

MMseqs2 Release 14-7e284

13 Oct 12:31
7e28409
Compare
Choose a tag to compare

This is a major release containing features implemented for ColabFold, Foldseek, MMseqs2 profile-profile (not published yet, and still in preview) and many bugfixes. Thanks a lot to the contributors who submitted bug fixes.

If you are using the Docker Hub based MMseqs2 containers, please switch to the new Github Container Registry based ones. The Docker Hub containers will not be maintained in the future.

Breaking

  • Profile databases created by previous MMseqs2 releases won't work anymore with this release. Please recreate them from previous search results or MSAs with result2profile or `msa2profile.
  • Profile k-mer threshold parameter were fitted to new pseudo-counter parameter (--pca,--pcb). Previous --k-score parameters will have differing sensitivity. However, most users will have set -s instead, which was fitted to match as closely as possible.

Features

  • gff2db now should actually work correctly after refactoring (488df86, thanks @RuoshiZhang)
  • result2msa now supports reading from precomputed index
  • Add db2tar: Create a tar file from a database
  • Add parsable columnar tsv output to databases with --tsv
  • Add taxonomic filtering during prefilter with --taxon-list
  • Add --comp-bias-corr-scale to adjust the weight of the compositional bias correction
  • Add --mask-prob parameter to adjust tantan's masking threshold
  • Add context specific pseudo-counts to result2profile
  • Add iterative profile-profile search workflow (thanks @haydenji0731)
  • Add support for profile-profile scoring in striped Smith-Waterman algorithm (thanks @haydenji0731)
  • Add support for gap-open/gap-close costs to striped Smith-Waterman algorithm (thanks @hgsommer)
  • Add environment variable MMSEQS_IGNORE_INDEX to ignore an existing precomputed index
  • createsubdb and view can now return results from identifiers in .lookup with --id-mode 1
  • Change compressdb loop to omp static to keep order
  • Improvements to nucleotide alignments and scoring (thanks @AnnSeidel)

Features built for ColabFold now available in MMseqs2

  • Add pairaln: taxonomic pairing on sequences for MSA building (9a0df0d, 5e245d1, 3f8695e, 3e92abf, edb8223, e19df7c)
  • Add A3M support to result2msa (--msa-format-mode 5)
  • Add A3M support with alignment information to result2msa (--msa-format-mode 6)
  • result2profile allows --diff 0
  • Make taxonomy mapping mmap'able for (near) instant read-in
  • Add workflow to create expandable profile (profile-profile) db from TSVs tsv2exprofiledb
  • Enable result2profile/filterresult to read new expand alignment index
  • Add support to filter MSAs in buckets filterresult, result2profile
  • Add --filter-min-enable to enable filtering only above a minimum threshold of hits (c6d8ae0)
  • Expand can filter in each target cluster before expanding (75af0c8, 85ce847)

Bugfixes

  • summarizeresult was rejecting hits that match the coverage threshold exactly (#586, 67949d7)
  • Don’t use reserved filename characters in unpackdb (#467, c663497 thanks @cutecutecat)
  • Fix typo (violoations -> violations) (#526, 74c3aa6, thanks @Benjamin-Lee)
  • Fix potential endless loop in rescorediagonal
  • Fix prefilter/alignment with 0-size query input #433
  • Fix unpackdb parameter parsing issue
  • Make sure FILTER_RESULT variable is always correctly set for exhaustive search (d4a3354)
  • tar2db breaking with --tar-include/exclude (#561)
  • Wrong database name printed for variadic input when creating a tmp directory
  • extractorfs sometimes loading invalid start/stop codons on non-avx2 platforms
  • Don't mask consensus sequences in profiles
  • result2msa correctly prints X residues
  • Allocate CSProfile only if it's going to be used (d873697)
  • Taxonomy db paths are now correctly found if given a precomputed index (8ff26f2)
  • Encode more strings internally as base64 if special characters are used (16b5774, d155586)
  • Disable broken iterative profile searches in taxonomy (#432)
  • Fixed a possible segmentation fault in align (thanks @rchikhi)

MMseqs2 databases

Speedup

  • Rework of result2msa to avoid allocating a lot of memory
  • Improvement of speed for ungapped alignment in prefilter
  • TaxonomyExpression is faster with a single tax identifier (8ff7279)

MMseqs2 subprojects

  • MMseqs2-based subprojects can use databases too (5afd33c)
  • Add appenddbtoindex: augment a precomputed index with other databases in sub-projects
  • Allow subprojects to build their own precomputed indices (a506d67)
  • Add support for external k-mer thresholds for the prefilter (fea8d20)
  • Subprojects can define their own DbType validators

Developers

  • Added CirrusCI to test FreeBSD and old compilers (a2e2129, 904d0c6, a09a704, 4f1996a, 482dedc, 16830a5)
  • MMseqs2 Docker containers are now published in the Github Container Registry (eb203d3, 5185d3c, ba4e11f)
  • Our microtar fork can write tar files again (dcd180b)
  • Add URIs as allowed parameter inputs (3b9cf88)
  • Additional s390x fixes (linclust might work now)
  • Add support for new MultiParameter type
  • Bundled SIMDe was updated (thanks @mr-c)

MMseqs2 Release 13-45111

24 Feb 11:08
Compare
Choose a tag to compare

New Taxonomy Workflow (new feature and breaking change)

We introduce a new taxonomy workflow for assigning taxonomic labels to nucleotide sequences by searching against protein reference databases. For details see:

Mirdita M, Steinegger M, Breitwieser F, Söding J, Levy Karin E: Fast and sensitive taxonomic assignment to metagenomic contigs. bioRxiv, doi: 10.1101/2020.11.27.401018 (2020)

The nucleotide-to-protein taxonomic assignment is now much faster and is optimized towards annotation of contigs. If you use MMseqs2 taxonomy to assign taxonomic labels to short reads, consider using the --orf-filter 0 parameter to disable the new filter stage as it can reject too many short query sequences. MMseqs2 is still considerably faster with this parameter set.

As our nucleotide-to-nucleotide taxonomic assignment does not support the 2bLCA assignment mode for stable lowest-common-ancestor computation, we previously set MMseqs2 to perform LCA assignment by top-hit (--lca-mode 4) as default. Approximate (see manuscript) 2bLCA is now again the default and we automatically switch to top-hit if given nucleotide-to-nucleotide input.

Breaking changes

  • --slice-search in now called --exhaustive-search
  • Unify --compress --summarize --omit-consensus (in result2msa) to --msa-format-mode

Features

  • Add GTDB and CDD to databases downloader #410
  • Add nrtotaxmapping to create taxonomy mapping from NR
  • Add unpackdb to split a database into separate files #406
  • Add majoritylca module for majority voting based taxonomy from alignment results
  • Add cpdb and lndb
  • Taxonomy information is stored in binary format (a single db_taxonomy file, instead of db_{named,nodes,merged}.dmp,db_mapping) to speed up read-in. Old format is still supported.
  • --exhaustive-search is usable with ungapped alignments (--alignment-mode 4)
  • Allow sequence/result database input in taxonomyreport #401/#408
  • msa2profile/result can skip the first sequence with --skip-query
  • createtaxdb can create a taxdb by mapping through .source in addition to .lookup (--tax-mapping-mode 1)
  • splitsequence can create a sequence database with original headers
  • align can return short cluster format if only identifiers are required --alignment-output-mode
  • tar2db can be used multi-threaded if input allows (e.g. .tar containing .gz files)
  • Encode species names in taxonomy blocklist to make sure we don't block random nodes in * e.g. GTDB)
  • Split non-index parts over additional files in split index case to reduce peak memory use
  • proteinaln2nucl can now compute scores and e-values
  • createdb can create a sequence database from a database containing fasta files (e.g. created by tar2db)
  • Add MMSEQS_FORCE_MERGE environment variable to force generating fully merged databases
  • Improved many descriptions, warnings and error messages

Bugs fixed

  • Fix filterresult off by one issue removing wrong sequences
  • Fix addtaxonomy always crashing due to invalid check #355
  • Reduce numbers of calls to posix_memalign to fix lock contention on macOS
  • extractorfs doesn't flood warnings due to short sequences anymore
  • expand2profile --pca is correctly set to 0
  • msa2profile always copies .lookup/source files instead of symlinking
  • Clustering of clustering input would not work with set-cover or connected-component
  • Short circuit --cluster-reassign if nothing can be reassigned
  • Fix temporary files not getting removed in linclust/cluster with --remove-tmp--files
  • Fix kmermatcher setting user k-mer pattern in auto k-mer selection and breaking
  • Krona taxonomyreport was not working if no sequence was unclassified
  • Make Matcher::resultToBuffer buffer sizes consistent (could crash with very long backtraces, needs further refactoring)
  • Fix multiple locations where Util::checkAllocation could never be called as it would have crashed before
  • Whitespace containing parameters do not break workflows anymore (e.g. passing whitespaces to --sub-mat)
  • taxonomyreport and addtaxonomy parameter were not adjustable in easy-taxonomy
  • E-value parameters are now correctly parsed as doubles instead of floats #379
  • Add symlinks to splitdb #376
  • Increase maximum number of open files in DBReader
  • Include file size and modified date of inputs in temporary file hash calculation #372
  • --cov-mode 5 was not working #371
  • Database downloader deals correctly with redirects now
  • result2profile could crash if target database contained much longer sequences than query database
  • Stop symlinking header database (and other ancillary files) in filterresult

Developer

  • Add vector of predefined substitution matrices to add additional matrices in subprojects
  • Don't create false _has_{builtin,attribute} macros (see simd-everywhere/simde#691 (comment))
  • Add USE_SYSTEM_ZSTD cmake flag to use system provided zstd #411
  • Replace texlive with tectonic for faster/prettier userguide
  • Add more instructions to simd.h
  • Add initial fixes to get MMseqs2 working on s390x (work in progress)
  • Prebuilt macOS binary is now a Universal Mac Binary supporting SSE, AVX and Apple Silicon NEON
  • Build ARM64/PPC64LE binaries by cross-compiling
  • Add missing licenses and READMEs for vendored libraries #403
  • Update ALP to 1.98
  • Update xxhash to v0.8.0

MMseqs2 Release 12-113e3

01 Sep 11:22
Compare
Choose a tag to compare

Breaking changes

  • Remove --add-internal-id parameter from result2msa
  • filterdb --shuffle is now randomly instead of deterministically shuffled
  • Taxonomy expressions in filtertax(seq)db interpret , as || now #320
  • convertalis pident output field now correctly reports percentage (0-100) sequence identity instead of fraction (0.00-1.00), use fident to print the fraction instead

Features

  • Support nucleotide clustering in cluster and easy-cluster
  • Support other architectures (SSE2/ARM64/POWER8/POWER9/etc) through SIMDe
  • Linclust is much faster on systems with a lot of CPU cores
  • Clustering update is faster, more stable and correctly deals with deleted sequences #272
  • Add easy workflow for reciprocal best hit searches easy-rbh
  • Add SILVA, Pfam-B, dbCAN2 to databases
  • databases produces taxonomy information for NR
  • Replace old greedy incremental clustering with new memory efficient version
  • Add result2dnamsa module to create MSAs of nucleotide sequences
  • Continued progress on profile-profile searching (result2pp,expandaln,expand2profile) , stay tuned!
  • Add multi-parameter to support to overwrite sequence type specific parameters: e.g. --gap-open "nucl:5,aa:11"
  • Add ORF information as output options to convertalis (qOrfStart/qOrfEnd, dbOrfStart, dbOrfEnd)
  • Speed up sorting using ips4o
  • Speed up masking through new version of tantan
  • Speed up multi-threaded writing of clustering results
  • Speed up reading of database indices and merging target split databases
  • Add memory tracking to account for index size when computing available memory (--split-memory-limit should be more reliable when searching/clustering billions of sequences).
  • Add --search-type 4 (translated/translated search) to createindex
  • Add convertalis --format-mode 3 HTML output based on MMseqs2 app (app.mmseqs.com)
  • Improve memory management in result2msa and result2profile modules
  • Add msa2result module to create an alignment result db from MSAs
  • Add filterresult to slim down result dbs with pairwise HHblits filtering #316
  • Add --kmers-per-sequence-scale to linsearch to extract a k-mer fraction instead of a fixed count
  • Add a random integer to --local-tmp path to avoid race conditions if multiple MMseqs2 happen on the same machine
  • Add --max-seqs to ungappedprefilter
  • Add --tax-lineage-mode 2 parameter to print numeric taxids

Bugs fixed

  • rbh workflow was broken due to issues with filterdb
  • Fix -a in RBH search to show alignments
  • Fix PDB70 database creation in databases
  • Fix aria2c download support
  • Fix memory issues and MPI in kmermatcher
  • Fix memory issues in extractorfs when using AVX2
  • Fix --cluster-reassign to respect --cov-mode
  • Set-cover supports up to 2^32 sequences (previously crashed with more than 2^31)
  • Exit correctly if there is not have enough disk space instead of crashing in the next module
  • Fix prefilter order instability when searching very redundant databases
  • Correctly parse keys from data files in filterdb --filter-file, this was causing instability in linsearch
  • Allow overwriting string parameters with empty strings
  • Fix ASAN issue in extractorf when using AVX2
  • Microtar would try to seek backwards constantly resulting in horrible gzip read performance
  • Avoid lookup writing to corrupt memory if an accession is too long
  • Fix various inconsistencies and usability issues in alignall:
    • --alignment-mode inconsistent with align module
    • --add-backtrace did not do anything
  • Fix restart of clusterings using reassignment cluster --cluster-reassign
  • Fix createdb did not correctly read gz/bzip files with --createdb-mode 1 #323

MMseqs2 Release 11-e1a1c

11 Feb 22:31
Compare
Choose a tag to compare

At a glance: The MMseqs2 command line interface is cleaner and validates user input. Many MMseqs2 modules use less memory and run faster. The new databases module helps to download and setup database. We now have a chat support at chat.mmseqs.com.

Known Issues

  • rbh crashes due to invalid sorting mode (#290)
  • Homebrew's macOS version does not use multiple cores (#289)
  • prefilter results can be unstable between different runs for extremely redundant databases (#277)
  • linclust/cluster can crash for very small input sets (#274)

Breaking Changes

  • kmermatcher --skip-n-repeat-kmer parameter was replaced with --ignore-multi-kmer
    Does not discard whole sequences anymore if a k-mer occured to often, instead it skips the specific k-mers.
    Either mode is only used in Plass and not in Linclust
  • --lca-ranks from (easy-)taxonomy and lca has to be delimited with semicolons (;) instead of colons (:)
  • --dont-shuffle flag was renamed to --shuffle true/false

Features

  • new databases workflow to list and download common databases.
    Supported databases:
  Name                	Type      	Taxonomy	Url
- UniRef100           	Aminoacid 	     yes	https://www.uniprot.org/help/uniref
- UniRef90            	Aminoacid 	     yes	https://www.uniprot.org/help/uniref
- UniRef50            	Aminoacid 	     yes	https://www.uniprot.org/help/uniref
- UniProtKB           	Aminoacid 	     yes	https://www.uniprot.org/help/uniprotkb
- UniProtKB/TrEMBL    	Aminoacid 	     yes	https://www.uniprot.org/help/uniprotkb
- UniProtKB/Swiss-Prot	Aminoacid 	     yes	https://uniprot.org
- NR                  	Aminoacid 	       -	https://ftp.ncbi.nlm.nih.gov/blast/db/FASTA
- NT                  	Nucleotide	       -	https://ftp.ncbi.nlm.nih.gov/blast/db/FASTA
- PDB                 	Aminoacid 	       -	https://www.rcsb.org
- PDB70               	Profile   	       -	https://github.com/soedinglab/hh-suite
- Pfam-A.full         	Profile   	       -	https://pfam.xfam.org
- Pfam-A.seed         	Profile   	       -	https://pfam.xfam.org
- eggNOG              	Profile   	       -	http://eggnog5.embl.de
- Resfinder           	Nucleotide	       -	https://cge.cbs.dtu.dk/services/ResFinder
- Kalamari            	Nucleotide	     yes	https://github.com/lskatz/Kalamari
  • (easy-)search --slice-search is now usable. Slice search finds all hits that fulfill the alignment criteria while using only as much disk space as defined by --disk-space-limit
  • createdb and the various easy- workflows learned to read query input from STDIN
  • taxonomyreport learned to display the summarized taxonomy result with Krona
  • new filtertaxseqdb module for filtering sequence DBs with taxonomy information according to provided taxa
  • --taxon-list parameter understands expressions. E.g. get all bacterial and human sequences --taxon-list "2||9606"
  • easy-search and convertalis can now output taxonomic information using --format-output
taxid      Taxonomic identifier
taxname    Taxon Name
taxlineage Taxonomic lineage
  • speed up in (easy-)cluster/linclust by improving k-mer extraction
  • MMseqs2 consistently creates .source and .lookup files to match from which input file a sequence came from
    E.g.: mmseqs createdb input1.fa input2.fa seqDB each sequence in seqDB can tell if it came from input1.fa or input2.fa
  • createdb learned to index an existing (single-line-seq per entry) FASTA file without copying the FASTA content to a new database
  • align and rescorediagonal learned to align circular sequences
  • align exposes the z-drop parameter of its Banded Nucleotide alignment algorithm
  • reverseseq learned to reverse profiles
  • filterdb can filter rows with value within given percentage of first row
  • new aggragatetax module to assign a taxonomic label to contigs according to the fragments matched on the contig
  • Adjusting --max-seq-len is not required anymore, MMseqs2 automatically increases the length now.
  • MMseqs2 on Cygwin/Windows uses nedmalloc as its memory allocator now and does not massively slow down due to lock contention
  • new tar2db module to efficiently transform content of tar archives to MMseqs2 databases

Bug fixes

  • createindex would create corrupted indices for profile target databases
  • rbh workflow would create its result DB at an unexpected (wrong) location
  • (easy)-taxonomy --lca-mode 3 (Approx. LCA) was aligning invalid sequences in the second iteration and producing bad results
  • lca (and (easy)-taxonomy) add empty columns for unclassifed sequences to be valid TSVs
  • kmermatcher uses xxhash for hashing now (faster)
  • kmermatcher avoid crash machine has not enough memory to process data at once (affects linclust/cluster)
  • kmermatcher correctly deals with sequences longer than MAX_SHRT now
  • kmermatcher fixed various edge cases (e.g. alignment of 1-char sequences)
  • kmermatcher hash-shift would be ignored
  • offsetalignment could produce wrong results in the minus-strand
  • clust now correctly and consistently handles alignment DB input
  • clusthash better deals with nucleotide input now and several multi-threaded inefficiencies were resolved
  • (easy-)cluster --single-step-clustering could cluster unrelated sequences due to hash collisions
  • prefilter --diag-score 0 respects --min-ungapped-score
  • createseqfiledb could print empty sequence lines
  • taxonomyreport could crash if no sequence was unclassified
  • result2flat could crash with long sequence input
  • result2msa, result2profile, msa2profile backport filtering fix from HHblits
  • align could produce bad alignments if all sequence lenghts in query DB where a lot shorter than in target DB
  • splitsequence fix issues with splitsequence if combined with compressed
  • result2profile fix Filter2 bug of HH-suite in MMseqs2
  • apply would crash due to reading wrong entry lengths
  • filterdb --filter-expression was not thread safe and could corrupt results
  • filterdb --extract-lines and --trim-to-one-column are compatible with each other

Developers

  • Internal representation of sequences changed from 4-byte per character to 1-byte per character
  • Compilation under AppleClang + libomp works now (see util/build_osx.sh)
  • Tools inheriting from MMseqs2 can now add their own citations
  • MMseqs2 on macOS compiles with the macOS 10.9 SDK (removed symlinkat call; relevant for bioconda)

MMseqs2 Release 10-6d92c

23 Aug 12:05
Compare
Choose a tag to compare

At a glance: The MMseqs2 command line interface is cleaner and validates user input. Many MMseqs2 modules use less memory and run faster.

Known Issues

  • High sensitivity searches (higher than -s 6) with precomputed indices should fail. Pass --db-load-mode 3 as a workaround to the MMseqs2 call.

Breaking Changes

  • Default taxonomy mode is assigning the same taxonomic label as the top hit. The previous "approximate 2bLCA" mode can be used with --lca-mode 3 or the non-approximated 2bLCA with --lca-mode 2
  • MMseqs2 will refuse to compile on compilers without OpenMP support (Use -DREQUIRE_OPENMP=0 to force a single-threaded no OpenMP build)
  • The confusingly named (and probably non-functional) --global-alignment parameter is gone
  • File names of the latest precompiled binaries changed. All archives contain a copy of the user guide and the MMseqs2 binary in the same subfolder (see further down for binaries of release 10-6d92c):
SIMD Linux macOS Windows
SSE4.1 mmseqs-linux-sse41.tar.gz mmseqs-osx-sse41.tar.gz mmseqs-win64.zip
AVX2 mmseqs-linux-avx2.tar.gz mmseqs-osx-avx2.tar.gz -

Known Issues

  • MMseqs2 on Windows seems to not scale well on multiple threads
  • MMseqs2 on Windows can crash when built with AVX2 support (mostly on VMs)

Features

  • createindex can precompute split indices to improve runtime when searching against a database that is larger than the system memory. Precomputed databases also require less overhead RAM, since only the required parts are loaded
  • easy-search, easy-taxonomy, easy-linclust and easy-cluster workflows can take any number of query FASTA or FASTQ files
  • MMseqs2 validates database types. It will exit with an error message on wrong input, where it would previously crash
  • kmermatcher reports the diagonal with the most k-mer matches
  • kmermatcher scales the number of k-mers with sequence length (--kmer-per-seq-scale)
  • rescorediagonal got two new rescore modes, one for global alignment scoring and one for scoring a quasi global alignment fullfilling a local window criterion
  • Peak memory usage for reading in very large databases is greatly reduced. 128GB nodes should comfortably be able to deal with up to the maximum of 4.2 billion entries
  • Parameters taking byte values support syntax with a SI suffix (e.g., --split-memory-limit 64G)
  • Nucleotide substitution matrices should be user definable
  • Taxonomy report is compatible with Pavian. Thanks to Florian Breitwieser!
  • cluster workflow learned a reassignment mode --cluster-reassign. This mode corrects errors that occured because of cascaded clustering
  • extractorfs can directly translate a nucleotide ORF to an amino acid sequence
  • result2stats can write TSV files
  • createsubdb supports softlinks instead of always hard copying the whole file to disk
  • reduced harddisk space usage for all cascaded clusterings
  • easy-taxonomy reports the top hit alignment as a separate output file with the suffix tophit_aln
  • createindex checks if an index needs to be recomputed were improved

Bug fixes

  • MMseqs2 did not compile on FreeBSD. Please let us know about free continuous integration options to make sure it will keep working in the future
  • proteinaln2nucl could return wrong coordinates
  • apply would deadlock when running with multiple threads
  • MPI searches are way more reliable, there were various issues around merging the separate results. MPI logic of split and merge is also integrated into the regression tests suite
  • prefilter splits nucleotide searches if not enough memory is available
  • kmermatcher could corrupt memory
  • rescorediagonal could produce wrong sequence identities when aligning mixed-case sequences
  • macOS builds were not actually static (still dynamically link libsystem however)
  • lca module could corrupt memory and crash
  • createdb does not crash on systems with only 4GB of RAM anymore
  • AVX2 and SSE4.1 builds could produce slightly different results
  • summarizeresults does not crash on empty alignments results anymore
  • fix wrong tophit_report in easy-taxonomy
  • Precompiled Windows builds were broken
  • Precomputed indices of databases with very short sequences could truncate alignments if the query sequences were longer

Developers

  • Tools using MMseqs2 as a framework do not need to export MMseqs2 modules again anymore

  • MMseqs2 uses Azure Pipelines for all platforms to run our regression tests suite and provide precompiled binaries

  • MMseqs2 runs under ASan without any issues. We fixed various small memory leaks

  • The regression suite is directly linked through a submodule

    It can be used by running:

    git submodule update --init
    ./util/regression/run_regression.sh $PATH_TO_MMSEQS/mmseqs $TMP_DIR
    

MMseqs2 Release 9-d36de

04 May 03:26
Compare
Choose a tag to compare

At a glance: Improved taxonomy, add colors to user output, improve computation progress bar, small speed ups and many bug fixes

Features

  • Add support for Kraken style taxonomy reports. Thanks to Florian Breitwieser
  • New easy-taxonomy workflow
  • New progress bar to reduce output
  • Colored errors and warnings

Bugs

MMseqs2 Release 8-fac81

01 Apr 01:18
fac81fa
Compare
Choose a tag to compare

At a glance: Faster searches and clustering through improved IO and better seeding. More search modes like tblastx, reciprocal best hit and linsearch. New output format SAM. Support for compressed databases to reduce hard disk and memory requirements.

Known Issues

  • Iterative search only works up to 2 iterations

Breaking Changes

  • MMseqs2 now saves a lot on IO by not merging result datafiles
    There is still a single .index file, but the corresponding data files are split into multiple parts (as many as threads were used previously)
  • MMseqs2 now uses the VTML80 [1,2] substitution matrix to speed up the prefiltering (changeable by --seed-sub-mat), the final alignment is still computed with the Blosum62 (still changeable by --sub-mat)
  • All databases have now a .dbtype file
  • MMseqs2 Docker image is now based on Debian instead of Alpine
  • Changed Orf header format to be more space efficent. The new format is now orignIdentifer startPos(-/+)len flag
  • prefilter returns ungapped-alignment scores instead of e-values
  • createindex the file extention is now .idx instead of the previous .[s]k[6,7] format

Features

  • Support for tblastx-style nucl-nucl translated searches
    mmseqs search nuclDB1 nuclDb2 aln tmp --search-mode 2
  • Support for nucleotide searches
    mmseqs search nuclDB1 nuclDb2 aln tmp --search-mode 3
  • convertalis has learned to return SAM formatted output (preview)
  • Database can be compressed by applying zstd on each entry (--compressed 1)
    • Also added compress and decompress modules
  • rbh workflow for reciprocal best hit searches added
  • linclust can now cluster nucleotide sequences on both forward and reverse strand
  • Added linsearch, a lightning fast search for proteins and nucleotide sequences (preview; easy workflow variant easy-linsearch also added)
  • createlinindex computes an index for linsearch
  • taxonomy uses --orf-start-mode 1 to annotate more sequences
  • Added approx. 2bLCA to speed up computation, this is now the new default. The old mode can be turned on by --lca-mode 2
  • createdb recognizes sequences containing Uracil as DNA sequences
  • createdb is now faster through speeding up its shuffle operations
  • view module to view single entry in an MMseqs2 database
  • align module has learned --min-aln-len parameter to filter by minimal alignment length
  • Alignment modules (rescorediagonal, align) can align longer sequences now (not limited to 2^15 length)
  • Input sequences can now be softmasked (lower letter masking) instead of only hard masking (replacing with X) ``--mask-lower-case. The masking only applies to the prefilter stages kmermatcher` or `prefilter` and can be combined with `--mask`
  • filterdb has learned --filter-expression parameter and mode that allows filtering by simple mathematical expressions
  • alignbykmer can be used for nucleotide searches
  • MMseqs2 did-you-mean functionality gives better suggestions
  • MMseqs2 does not repeat the whole parameter list for each submodule call anymore

Bugs

  • Default parameters of map workflow are now set correctly
  • Some modules were using the wrong coverage parameter
  • Sliced profile search was losing high E-value hits
  • Sliced profile search is now stable
  • Profile-Sequence alignment E-values where slightly too high
  • result2msa was crashing with profiles on the target side
  • result2msa should not crash with --alow-deletion anymore
  • Some parameters were never visible (with or without -h)
  • Various issues with MPI were resolved

Developers

  • Continous integration enforces no compile warnings now
  • Continous integration now tries to build AArch64 builds with Docker and Qemu
  • We added a first draft of our developer guide to the wiki

References

[1] Müller T & Martin Vingron, Modeling Amino Acid Replacement, J Comput Biol. 2000;7:761–76. doi: 10.1089/10665270050514918.

[2] Müller T, Spang R, Vingron M. Estimating amino acid substitution models: a comparison of Dayhoff's estimator, the resolvent approach and a maximum likelihood method. Mol Biol Evol. 2002;19:8–13. doi: 10.1093/oxfordjournals.molbev.a003985