-
Notifications
You must be signed in to change notification settings - Fork 10
Installation&Dependencies
We have designed MIntO installation to be easy, if you follow these instructions.
Some of the tasks in metagenomic or meta-transcriptomic data analysis are memory- and cpu-heavy. Therefore, it is recommended that you have a powerful server, or even better a high-performance compute cluster.
Our recommendation is to use a 64-bit linux server with at least 16 CPUs and 64 GB RAM. More CPUs and memory will result in shorter execution in real time.
We benchmarked MIntO using a 64-bit Linux server with 2x AMD EPYC 7742 64-Core processors and 2TB of memory (Saenz et al, 2022). When restricting
Snakemake
to use 700GB of memory and 96 CPUs, the running time for processing the tutorial dataset was 160 minutes.
For Danish computerome supercomputer users, we also benchmarked MIntO using a single dedicated node (type:
thinnode
) that offers 40 CPUs and 188GB memory.
MIntO expects that your server has some commonly used linux commands available. Most contemporary linux distributions have these installed. If yours is a specifically lean installation and missing any of these, please contact your system administrator to have them installed. If you are using queuing systems in a cluster, these commands should be available in all the execution nodes. These are:
rsync
wget
curl
gzip
zcat
tar
sed
awk
mkfifo
true
The only non-standard software prerequisite is to have conda
installed on your server. If you have conda
installed on your server, you can directly proceed to MIntO Installation. Otherwise, Miniconda can be downloaded and installed following these instructions.
Please note:
- We tested out pipeline on a 64-bit Linux system using Anaconda3.
- In order to download and install Miniconda, there should be a minimum of 0.5 GB disk space on your disk.
- After installing
conda
, please close the console for the changes to be active. You might have to open a new terminal.
MIntO uses Snakemake
as the workflow software.
We recommend that you create a conda
environment exclusively for MIntO that includes Snakemake
. This way, you can mess with all the configurations within that conda
environment.
Please note:
- We have fixed
Snakemake
to versionv7
, as the interface has changed dramatically inv8
. MIntO uses somev7
-specific interfaces.- We have also fixed python to version
v3.11
, since there is a known issue withv3.12
messing up with f strings' expansion inSnakemake
.
You can install Snakemake
as follows:
$ conda create -c conda-forge -n MIntO python=3.11 mamba
$ conda activate MIntO
$ mamba install -c conda-forge -c bioconda snakemake=7
Check the official Snakemake
v7.32.3 installation instructions if you need further details.
You have to choose an installation location. Please make sure that there is enough space available in that location. When the dependencies are installed and all conda
environments are created, it occupied 84GB
on our server. Let us assume during this tutorial that you want MIntO to be installed in /server/apps/MIntO
.
Now, clone MIntO repository into installation location. Since you want MIntO to be installed in /server/apps/MIntO
, you can do the following:
$ cd /server/apps
$ git clone https://github.com/arumugamlab/MIntO.git
This will clone MIntO code into /server/apps/MIntO
. Now switch to a thoroughly-tested stable release, as the latest release may not be unstable or might have undetected bugs.
$ cd MIntO
$ git checkout 2.1.0
Now you have MIntO stable release 2.1.0
installed in your server.
We will use $MINTO_DIR
in the rest of the tutorial to denote this installation location. If you copy-paste commands in Section 3, please set the variable to your installation location. E.g.,
$ MINTO_DIR=/server/apps/MIntO
NOTE: Always use a stable release (currently the latest stable release is
2.1.0
). Latest version in github might have undetected bugs and may behave unexpectedly. If you want to use it, please acknowledge the risks involved.
MIntO uses compartmentalized conda environments with software relevant for the individual steps in the workflow. Therefore, you need all the necessary software dependencies and need to download the databases that they use. Luckily, we provide a script that can do that automatically for you.
MIntO will install all dependencies automatically using Snakemake
, but it needs to know where MIntO itself was installed. This should be given in the minto_dir
field in $MINTO_DIR/configuration/dependencies.yaml
.
Check the repository-version of the dependencies.yaml file.
The user has to specify the number of threads and GB per thread to use when downloading the databases and indexing the rRNA database. Here is an example of a completed configuration file:
######################
# General settings
######################
minto_dir: /server/apps/MIntO
download_memory: 5
download_threads: 2
rRNA_index_memory: 5
rRNA_index_threads: 2
Note that we did not write
$MINTO_DIR
in thedependencies.yaml
file above, but the actual location.
MIntO uses several dependencies and their databases that can be downloaded using the Snakemake
script dependencies.smk
and the configuration file dependencies.yaml
.
The download is expected to take several hours, depending on the network-speed, so it's recommended to run it in a detachable terminal session (screen
, tmux
, etc.) or HPC job.
The Snakemake
script can be launched like this, for more details see below:
snakemake --use-conda --restart-times 1 --keep-going --latency-wait 60 \
--jobs 4 --cores 16 --conda-prefix $MINTO_DIR/conda_env --shadow-prefix $(pwd) \
--snakefile $MINTO_DIR/smk/dependencies.smk --configfile $MINTO_DIR/configuration/dependencies.yaml
The database versions are given in dependencies.smk
, and by default they are going to be placed into $MINTO_DIR/data
.
Check the repository-version of the dependencies.smk file.
MIntO extensively uses conda
environments to manage different tasks that need different software packages.
- We specify
--use-conda
in the commandline toSnakemake
to enable that. - We will also specify
--conda-prefix
to tellSnakemake
where all the conda environments within MIntO should be installed.
This part of the arguments to snakemake looks like: --use-conda --conda-prefix $MINTO_DIR/conda_env
NOTE:
The location for
--conda-prefix
should not be confused with your systemconda
installation or environment location. This is just a folder whereSnakemake
will organize all environments on behalf of MIntO. We recommend that you do not use the system directories here. Our recommendation is to use--conda-prefix $MINTO_DIR/conda_env
or a project subfolder.
We will also use the shadow-directive from Snakemake
, which is a very elegant way to run I/O intensive tasks in a local disk of computing server. See Snakemake
shadow rule documentation. For this to work, the user should provide a directory in the local disk with enough space using --shadow-prefix
option.
Many compute clusters provide a fast disk in /scratch
for such purpose.
Within the --shadow-prefix
specified folder, Snakemake
will create a unique temporary directory for each rule. The advantage of using shadow-directive is that you do not need to clean up after you are done - the entire temporary directory will be deleted after the rule completes successfully.
NOTE:
MIntO is not going to run without a
--shadow-prefix
location. Use the current folder$(pwd)
if a local disk is not accessible.
We will also use the following Snakemake
arguments:
--restart-times 1 --keep-going --latency-wait 60
These options tell Snakemake
to:
- try re-running a failed rule a second time (just in case!),
- continue running the workflow even if some rules fail (imagine going home for the night and coming back the next day to find that workflow quit 10 minutes after you left?),
- wait for 60 seconds before giving up on an output file being produced (this could be due to NFS delays).
For downloading dependencies, we will also use:
-
--jobs 4
to askSnakemake
to run maximum 4 jobs at a time, -
--cores 16
to askSnakemake
to use maximum 16 CPUs at any time (this limit is across entire workflow at any given time).
Finally, we will specify the Snakemake
script using --snakefile
and configuration file using --configfile
:
snakemake --use-conda --restart-times 1 --keep-going --latency-wait 60 \
--jobs 4 --cores 16 --conda-prefix $MINTO_DIR/conda_env --shadow-prefix /scratch/johndoe \
--snakefile $MINTO_DIR/smk/dependencies.smk --configfile $MINTO_DIR/configuration/dependencies.yaml
TIP:
You can add
--dry-run -p
to the snakemake command line for a dry run showing the commands that would be run, but without actually doing anything.
If you have a queuing system like slurm
, you could let Snakemake
submit jobs to the cluster by specifying --cluster
:
snakemake --use-conda --restart-times 1 --keep-going --latency-wait 60 \
--jobs 4 --cluster "sbatch -J {name} --mem={resources.mem}G -c {threads} -e slurm-%x.e%A -o slurm-%x.o%A" \
--conda-prefix $MINTO_DIR/conda_env --shadow-prefix /scratch/johndoe \
--snakefile $MINTO_DIR/smk/dependencies.smk --configfile $MINTO_DIR/configuration/dependencies.yaml
In the above example, we use the sbatch
command from slurm
. For other queuing systems, please consult the documentation.
The parameters for cluster execution and all other Snakemake
arguments could be also provided in a configuration file with the --profile
argument, for example --profile $MINTO_DIR/configuration/simple_cluster/
For generating metagenome-assembled genomes (MAGs), MIntO uses AVAMB (Nissen et al, Nature biotech 2021), which is among the best MAG-generation tools out there. AVAMB uses pytorch
and will be significantly faster if you run it on a GPU. But this needs to be set up properly.
By default, MIntO uses a conda environment for the non-GPU version of AVAMB, thus it will run without GPU ('VAMB_GPU: no'
in mags_generation.yaml
configuration file). If you want to run AVAMB with GPU, you should:
- run
dependencies.smk
on a server that has GPU(s) and CUDA libraries installed, which will in turn install GPU version ofpytorch
, - change
'VAMB_GPU: yes'
inmags_generation.yaml
when running MAG generation step.
It worked seamlessly in our case. But you may have to play with this a little bit, as driver-mismatches are common issues with getting things to work.