Skip to content

Installation&Dependencies

jszarvas edited this page Jul 5, 2024 · 39 revisions

We have designed MIntO installation to be easy, if you follow these instructions.

1. Prerequisites

Some of the tasks in metagenomic or meta-transcriptomic data analysis are memory- and cpu-heavy. Therefore, it is recommended that you have a powerful server, or even better a high-performance compute cluster.

1.1. Hardware requirements

Our recommendation is to use a 64-bit linux server with at least 16 CPUs and 64 GB RAM. More CPUs and memory will result in shorter execution in real time.

We benchmarked MIntO using a 64-bit Linux server with 2x AMD EPYC 7742 64-Core processors and 2TB of memory (Saenz et al, 2022). When restricting Snakemake to use 700GB of memory and 96 CPUs, the running time for processing the tutorial dataset was 160 minutes.

For Danish computerome supercomputer users, we also benchmarked MIntO using a single dedicated node (type: thinnode) that offers 40 CPUs and 188GB memory.

1.2. Software requirements

Basic requirements

MIntO expects that your server has some commonly used linux commands available. Most contemporary linux distributions have these installed. If yours is a specifically lean installation and missing any of these, please contact your system administrator to have them installed. If you are using queuing systems in a cluster, these commands should be available in all the execution nodes. These are:

  1. rsync
  2. wget
  3. curl
  4. gzip
  5. zcat
  6. tar
  7. sed
  8. awk
  9. mkfifo
  10. true

Only special prerequisite

The only non-standard software prerequisite is to have conda installed on your server. If you have conda installed on your server, you can directly proceed to MIntO Installation. Otherwise, Miniconda can be downloaded and installed following these instructions.

Please note:

  • We tested out pipeline on a 64-bit Linux system using Anaconda3.
  • In order to download and install Miniconda, there should be a minimum of 0.5 GB disk space on your disk.
  • After installing conda, please close the console for the changes to be active. You might have to open a new terminal.

2. MIntO Installation

MIntO uses Snakemake as the workflow software.

2.1. Install Snakemake

We recommend that you create a conda environment exclusively for MIntO that includes Snakemake. This way, you can mess with all the configurations within that conda environment.

Please note:

  • We have fixed Snakemake to version v7, as the interface has changed dramatically in v8. MIntO uses some v7-specific interfaces.
  • We have also fixed python to version v3.11, since there is a known issue with v3.12 messing up with f strings' expansion in Snakemake.

You can install Snakemake as follows:

$ conda create -c conda-forge -n MIntO python=3.11 mamba
$ conda activate MIntO
$ mamba install -c conda-forge -c bioconda snakemake=7

Check the official Snakemake v7.32.3 installation instructions if you need further details.

2.2. Choose MIntO installation location

You have to choose an installation location. Please make sure that there is enough space available in that location. When the dependencies are installed and all conda environments are created, it occupied 84GB on our server. Let us assume during this tutorial that you want MIntO to be installed in /server/apps/MIntO.

2.3. Get MIntO

Now, clone MIntO repository into installation location. Since you want MIntO to be installed in /server/apps/MIntO, you can do the following:

$ cd /server/apps
$ git clone https://github.com/arumugamlab/MIntO.git

This will clone MIntO code into /server/apps/MIntO. Now switch to a thoroughly-tested stable release, as the latest release may not be unstable or might have undetected bugs.

$ cd MIntO
$ git checkout 2.1.0

Now you have MIntO stable release 2.1.0 installed in your server.

We will use $MINTO_DIR in the rest of the tutorial to denote this installation location. If you copy-paste commands in Section 3, please set the variable to your installation location. E.g.,

$ MINTO_DIR=/server/apps/MIntO

NOTE: Always use a stable release (currently the latest stable release is 2.1.0). Latest version in github might have undetected bugs and may behave unexpectedly. If you want to use it, please acknowledge the risks involved.

3. Download MIntO dependencies and databases

MIntO uses compartmentalized conda environments with software relevant for the individual steps in the workflow. Therefore, you need all the necessary software dependencies and need to download the databases that they use. Luckily, we provide a script that can do that automatically for you.

3.1. Edit the dependency-installation configuration file

MIntO will install all dependencies automatically using Snakemake, but it needs to know where MIntO itself was installed. This should be given in the minto_dir field in $MINTO_DIR/configuration/dependencies.yaml.

Check the repository-version of the dependencies.yaml file.

The user has to specify the number of threads and GB per thread to use when downloading the databases and indexing the rRNA database. Here is an example of a completed configuration file:

######################
# General settings
######################
minto_dir: /server/apps/MIntO

download_memory: 5
download_threads: 2
rRNA_index_memory: 5
rRNA_index_threads: 2

Note that we did not write $MINTO_DIR in the dependencies.yaml file above, but the actual location.

3.2. Install the dependencies

MIntO uses several dependencies and their databases that can be downloaded using the Snakemake script dependencies.smk and the configuration file dependencies.yaml.
The download is expected to take several hours, depending on the network-speed, so it's recommended to run it in a detachable terminal session (screen, tmux, etc.) or HPC job.

The Snakemake script can be launched like this, for more details see below:

snakemake --use-conda --restart-times 1 --keep-going --latency-wait 60 \
--jobs 4 --cores 16 --conda-prefix $MINTO_DIR/conda_env --shadow-prefix $(pwd) \
--snakefile $MINTO_DIR/smk/dependencies.smk --configfile $MINTO_DIR/configuration/dependencies.yaml

3.2.1. Databases

The database versions are given in dependencies.smk, and by default they are going to be placed into $MINTO_DIR/data.

Check the repository-version of the dependencies.smk file.

3.2.1. Virtual environments with conda

MIntO extensively uses conda environments to manage different tasks that need different software packages.

  1. We specify --use-conda in the commandline to Snakemake to enable that.
  2. We will also specify --conda-prefix to tell Snakemake where all the conda environments within MIntO should be installed.

This part of the arguments to snakemake looks like: --use-conda --conda-prefix $MINTO_DIR/conda_env

NOTE:

The location for --conda-prefix should not be confused with your system conda installation or environment location. This is just a folder where Snakemake will organize all environments on behalf of MIntO. We recommend that you do not use the system directories here. Our recommendation is to use --conda-prefix $MINTO_DIR/conda_env or a project subfolder.

3.2.2. Snakemake shadow directive

We will also use the shadow-directive from Snakemake, which is a very elegant way to run I/O intensive tasks in a local disk of computing server. See Snakemake shadow rule documentation. For this to work, the user should provide a directory in the local disk with enough space using --shadow-prefix option.
Many compute clusters provide a fast disk in /scratch for such purpose. Within the --shadow-prefix specified folder, Snakemake will create a unique temporary directory for each rule. The advantage of using shadow-directive is that you do not need to clean up after you are done - the entire temporary directory will be deleted after the rule completes successfully.

NOTE:

MIntO is not going to run without a --shadow-prefix location. Use the current folder $(pwd) if a local disk is not accessible.

3.2.3. Other Snakemake arguments

We will also use the following Snakemake arguments:

--restart-times 1 --keep-going --latency-wait 60

These options tell Snakemake to:

  1. try re-running a failed rule a second time (just in case!),
  2. continue running the workflow even if some rules fail (imagine going home for the night and coming back the next day to find that workflow quit 10 minutes after you left?),
  3. wait for 60 seconds before giving up on an output file being produced (this could be due to NFS delays).

For downloading dependencies, we will also use:

  • --jobs 4 to ask Snakemake to run maximum 4 jobs at a time,
  • --cores 16 to ask Snakemake to use maximum 16 CPUs at any time (this limit is across entire workflow at any given time).

3.2.4. The final Snakemake commandline

Finally, we will specify the Snakemake script using --snakefile and configuration file using --configfile:

snakemake --use-conda --restart-times 1 --keep-going --latency-wait 60 \
--jobs 4 --cores 16 --conda-prefix $MINTO_DIR/conda_env --shadow-prefix /scratch/johndoe \
--snakefile $MINTO_DIR/smk/dependencies.smk --configfile $MINTO_DIR/configuration/dependencies.yaml

TIP:

You can add --dry-run -p to the snakemake command line for a dry run showing the commands that would be run, but without actually doing anything.

3.3. Snakemake and MIntO can work with some queuing systems

If you have a queuing system like slurm, you could let Snakemake submit jobs to the cluster by specifying --cluster:

snakemake --use-conda --restart-times 1 --keep-going --latency-wait 60 \
--jobs 4 --cluster "sbatch -J {name} --mem={resources.mem}G -c {threads} -e slurm-%x.e%A -o slurm-%x.o%A" \
--conda-prefix $MINTO_DIR/conda_env --shadow-prefix /scratch/johndoe \
--snakefile $MINTO_DIR/smk/dependencies.smk --configfile $MINTO_DIR/configuration/dependencies.yaml

In the above example, we use the sbatch command from slurm. For other queuing systems, please consult the documentation.

The parameters for cluster execution and all other Snakemake arguments could be also provided in a configuration file with the --profile argument, for example --profile $MINTO_DIR/configuration/simple_cluster/

3.4. Enabling the use of GPUs for AVAMB (avamb.yaml)

For generating metagenome-assembled genomes (MAGs), MIntO uses AVAMB (Nissen et al, Nature biotech 2021), which is among the best MAG-generation tools out there. AVAMB uses pytorch and will be significantly faster if you run it on a GPU. But this needs to be set up properly.

By default, MIntO uses a conda environment for the non-GPU version of AVAMB, thus it will run without GPU ('VAMB_GPU: no' in mags_generation.yaml configuration file). If you want to run AVAMB with GPU, you should:

  1. run dependencies.smk on a server that has GPU(s) and CUDA libraries installed, which will in turn install GPU version of pytorch,
  2. change 'VAMB_GPU: yes' in mags_generation.yaml when running MAG generation step.

It worked seamlessly in our case. But you may have to play with this a little bit, as driver-mismatches are common issues with getting things to work.