Skip to content

MMseqs2 Release 10-6d92c

Compare
Choose a tag to compare
@martin-steinegger martin-steinegger released this 23 Aug 12:05
· 1239 commits to master since this release

At a glance: The MMseqs2 command line interface is cleaner and validates user input. Many MMseqs2 modules use less memory and run faster.

Known Issues

  • High sensitivity searches (higher than -s 6) with precomputed indices should fail. Pass --db-load-mode 3 as a workaround to the MMseqs2 call.

Breaking Changes

  • Default taxonomy mode is assigning the same taxonomic label as the top hit. The previous "approximate 2bLCA" mode can be used with --lca-mode 3 or the non-approximated 2bLCA with --lca-mode 2
  • MMseqs2 will refuse to compile on compilers without OpenMP support (Use -DREQUIRE_OPENMP=0 to force a single-threaded no OpenMP build)
  • The confusingly named (and probably non-functional) --global-alignment parameter is gone
  • File names of the latest precompiled binaries changed. All archives contain a copy of the user guide and the MMseqs2 binary in the same subfolder (see further down for binaries of release 10-6d92c):
SIMD Linux macOS Windows
SSE4.1 mmseqs-linux-sse41.tar.gz mmseqs-osx-sse41.tar.gz mmseqs-win64.zip
AVX2 mmseqs-linux-avx2.tar.gz mmseqs-osx-avx2.tar.gz -

Known Issues

  • MMseqs2 on Windows seems to not scale well on multiple threads
  • MMseqs2 on Windows can crash when built with AVX2 support (mostly on VMs)

Features

  • createindex can precompute split indices to improve runtime when searching against a database that is larger than the system memory. Precomputed databases also require less overhead RAM, since only the required parts are loaded
  • easy-search, easy-taxonomy, easy-linclust and easy-cluster workflows can take any number of query FASTA or FASTQ files
  • MMseqs2 validates database types. It will exit with an error message on wrong input, where it would previously crash
  • kmermatcher reports the diagonal with the most k-mer matches
  • kmermatcher scales the number of k-mers with sequence length (--kmer-per-seq-scale)
  • rescorediagonal got two new rescore modes, one for global alignment scoring and one for scoring a quasi global alignment fullfilling a local window criterion
  • Peak memory usage for reading in very large databases is greatly reduced. 128GB nodes should comfortably be able to deal with up to the maximum of 4.2 billion entries
  • Parameters taking byte values support syntax with a SI suffix (e.g., --split-memory-limit 64G)
  • Nucleotide substitution matrices should be user definable
  • Taxonomy report is compatible with Pavian. Thanks to Florian Breitwieser!
  • cluster workflow learned a reassignment mode --cluster-reassign. This mode corrects errors that occured because of cascaded clustering
  • extractorfs can directly translate a nucleotide ORF to an amino acid sequence
  • result2stats can write TSV files
  • createsubdb supports softlinks instead of always hard copying the whole file to disk
  • reduced harddisk space usage for all cascaded clusterings
  • easy-taxonomy reports the top hit alignment as a separate output file with the suffix tophit_aln
  • createindex checks if an index needs to be recomputed were improved

Bug fixes

  • MMseqs2 did not compile on FreeBSD. Please let us know about free continuous integration options to make sure it will keep working in the future
  • proteinaln2nucl could return wrong coordinates
  • apply would deadlock when running with multiple threads
  • MPI searches are way more reliable, there were various issues around merging the separate results. MPI logic of split and merge is also integrated into the regression tests suite
  • prefilter splits nucleotide searches if not enough memory is available
  • kmermatcher could corrupt memory
  • rescorediagonal could produce wrong sequence identities when aligning mixed-case sequences
  • macOS builds were not actually static (still dynamically link libsystem however)
  • lca module could corrupt memory and crash
  • createdb does not crash on systems with only 4GB of RAM anymore
  • AVX2 and SSE4.1 builds could produce slightly different results
  • summarizeresults does not crash on empty alignments results anymore
  • fix wrong tophit_report in easy-taxonomy
  • Precompiled Windows builds were broken
  • Precomputed indices of databases with very short sequences could truncate alignments if the query sequences were longer

Developers

  • Tools using MMseqs2 as a framework do not need to export MMseqs2 modules again anymore

  • MMseqs2 uses Azure Pipelines for all platforms to run our regression tests suite and provide precompiled binaries

  • MMseqs2 runs under ASan without any issues. We fixed various small memory leaks

  • The regression suite is directly linked through a submodule

    It can be used by running:

    git submodule update --init
    ./util/regression/run_regression.sh $PATH_TO_MMSEQS/mmseqs $TMP_DIR