A Pegasus Workflow for running Alphafold model's inference pipeline regarding protein structure
prediction. The current workflow is regarding the Multiple Sequence Alignment (MSA) and
Feature Generation steps, which produce a features.pkl
file that can be later used in protein structure inference
stage using the Alphafold model parameters. The workflow is currently limited to the Alphafold monomer-system
model preset by default.
The workflow is set to run in sharedfs
mode with no input staging and symlinking turned on.
If you are planning to run the workflow on ACCESS resources with PSC Bridges as the resource allocation provider, please follow the steps shown below :
- To get started, point your browser to https://access.pegasus.isi.edu and log in using the ACCESS Pegasus credentials. Use the ACCESS Pegasus Documentation to configure a basic setup for ACCESS Pegasus workflows. It's recommended that you try to execute the sample workflows first listed in the documentation in order to avoid any errors, they are simple and easy to execute.
- After the setup is complete and you are able to run the sample ACCESS-Pegasus workflows successfully, next we need to configure a new SSH key to be
used for file transfers(by
scp
protocol) in our Alphafold workflow. Go to homepage on https://access.pegasus.isi.edu, then open up a shell navigatingClusters --> Shell Access
. Generate a new SSH key as follows:Make note of absolute path to the private SSH key file$ cd .ssh $ ssh-keygen -t rsa
id_rsa
, as this path will be used our Alphafold workflow. Since PSC Bridges doesn't allow to configure new SSH keys simply by saving the public key in a file using the shell, the public key has to be submitted using PSC Bridges Key Management system. Copy the contents of fileid_rsa.pub
and login to PSC SSH Key Manager using the PSC Bridges login credentials. Click onSubmit New Key
and paste the public key copied fromid_rsa.pub
file in thePaste Key
section and then submit it. For more info on PSC SSH Key Management, you can refer to their website : https://www.psc.edu/about-using-ssh/. It takes a couple of hours for the SSH key to be configured on their system. - The next step is to setup a container to be used in the workflow. Open up shell on PSC Bridges by logging into
PSC Bridges, then navigatingClusters --> Shell Access
. Go to your username directory in the RM storage on PSC and clone the Alphafold repository there :Then create a singularity container as follows:$ cd /ocean/projects/<groupname>/<username>/ $ git clone https://github.com/pegasus-isi/alphafold-pegasus.git $ cd alphafold-pegasus
Make note of the absolute path to the conatiner as it will be used in the Alphafold workflow later. More information and notes regarding the container step can be found in the Container section below.$ docker build -t local/alphafold_container . $ singularity build alphafold_container.sif docker-daemon://local/alphafold_container
- The next step is to setup a data directory and download all the genetic databases there as follows:
This is a time consuming step and takes somewhere between 6 hours to 8 hours, so please make sure that the session doesn't time out. More information and notes regarding data download step can be found in the Genetic Databases section below.
$ cd /ocean/projects/<groupname>/<username>/ $ mkdir data $ /ocean/projects/<groupname>/<username>/alphafold-pegasus/data/download_all_data.sh -d /ocean/projects/<groupname>/<username>/data
- Finally we submit and run the workflow from ACCESS-pegasus. Login using your credentials on https://access.pegasus.isi.edu, then open up a
shell navigating
Clusters --> Shell Access
, clone this repository and run thealphafold_workflow.py
script as follows :The$ git clone https://github.com/pegasus-isi/alphafold-pegasus.git $ cd alphafold-pegasus $ python3 alphafold_workflow.py \ --psc \ --input-fasta-file=/ocean/projects/<groupname>/<username>/alphafold-pegasus/input/GA98.fasta \ --uniref90-db-path=/ocean/projects/<groupname>/<username>/data/uniref90 \ --pdb70-db-dir=/ocean/projects/<groupname>/<username>/data/pdb70 \ --mgnify-db-path=/ocean/projects/<groupname>/<username>/data/mgnify \ --bfd-db-path=/ocean/projects/<groupname>/<username>/data/bfd
--psc
option is used because the compute site for this steup is PSC Bridges, this option is not required when running on a local machine. Please make sure that paths to the genetic databases are entered correctly. Some workflow statistics have been shown below for reference.
The workflow uses a singularity container in order to execute all jobs. It is recommended to build a local container (in a .sif
file) using the
Alphafold's provided Dockerfile
which has all the required libraries and tools. It can be done in the following steps :
$ git clone https://github.com/deepmind/alphafold.git
$ cd alphafold
$ docker build -t local/alphafold_container .
$ singularity build alphafold_container.sif docker-daemon://local/alphafold_container
The container comes with the following main tools along with other common libraries :
- alphafold==2.2.0
- hmmer==3.3.2
- hhsuite==3.3.0
- kalign2==2.04
- absl-py==0.13.0
- biopython==1.79
- chex==0.0.7
- dm-haiku==0.0.4
- dm-tree==0.1.6
- immutabledict==2.0.0
📒 Note: If you are running the workflow on ACCESS, it's recommended to build the container on execute site for reduced execution time.
For example, if your execute site is PSC Bridges2
, you can $ cd
into your project directory and build the container following
the above steps. Then set complete path to the container in alphafold_workflow.py
or alphafold_workflow_main.ipynb
files.
If your machine has aria2c
installed in it, then it's recommended to use Alphafold's provided database download scripts over
here.
Otherwise the database download scripts provided in this repository (/data/download_all_data.sh
) use readily available command line utilities.
The following databases are used in the workflow :
$ /data/download_all_data.sh -d <DOWNLOAD_DIRECTORY>
📒 Note: If you are running the workflow on ACCESS, please download the data on execute site in order to avoid staging of data.
For example, if your execute site is PSC Bridges2
, you can $ cd
into your project directory and download the databases there following
the above steps.
📒 Note: By default the download_all_data.sh
script is set to download the reduced version of databases (of size 600 GB).
If you want to download the full version of databases (of size 2.2 TB), full_dbs
option can be entered as follows :
$ /data/download_all_data.sh -d <DOWNLOAD_DIRECTORY> full_dbs
📒 Note: The download directory <DOWNLOAD_DIR>
should not be a
subdirectory in the AlphaFold repository directory. If it is, the Docker build
will be slow as the large databases will be copied during the image creation.
The jobs and tools used in the workflow are explained below:
sequence_features
– produces the sequence features from the input fasta filejackhmmer_uniref90
– runs jackhmmer tool on the UniRef90 database to produce MSAsjackhmmer_mgnify
– runs jackhmmer tool on the MGnify database to produce MSAshhblits_bfd
– runs hhblits tool on the BFD database to produce MSAshhsearch_pdb70
- runs hhsearch tool on PDB70 database to produce search templatesmsa_features
– turns the MSA results into dicts of featuresfeatures_summary
– contains a summary of info reagarding all MSAs producedcombine_features
– combines all MSA features, sequence features and templates into features filefeatures.pkl
The workflow is set to run on a local HTCondor Pool in the default Condorio data configuration mode, where each job is run in a Singularity container. To submit a workflow run :
python3 alphafold_workflow.py \
--input-fasta-file=/path/to/input/fasta/file \
--uniref90-db-path=/path/to/uniref90_db \
--pdb70-db-dir=/path/to/pdb70_db \
--mgnify-db-path=/path/to/mgnify_db \
--bfd-db-path=/path/to/bfd_db
📒 Note: It's recommended to first test run the workflow on very small partial databases UniRef90
,Mgnify
and BFD
, some samples
are already included in /data/small_data
directory with 500 sequences in each file. Small partial databases can be created using the generate_small_data.py
script in /data/small_data
as follows:
$ /data/generate_small_data.py <FASTA_FILE> <NO_OF_SEQUENCES> <NEW_FILE_NAME>
For example:
$ /data/generate_small_data.py uniref90.fasts 5000 small_uniref90.fasta
📒 Note: Workflow statistics have been shown in the alphafold_workflow_main.ipynb
notebook, this sample workflow run used GA98
as input sequence rather than T1050
sequence used originally in the CASP14 by Alphafold. Thus, workflow execution time may vary depending upon
the input sequence used.
The following table shows workflow wall time corresponding to different setup and size of databases:
Setup | Partial Database (~70GB) | Complete Database (~600GB) |
---|---|---|
Local machine | 18 min, 23 secs | -- |
PSC Bridges 2 | 4 min, 51 secs | 1 hr, 7 mins |
📒 Note: In both cases workflow is set to run in sharedfs
configuration with no input staging and symlinking is turned on.