SIMSApiper is a Nextflow pipeline that enables users to create structure informed multiple sequence alignments simply from a set of protein sequences. Structural information may be provided by the user or directly retrieved by the pipeline (AlphaFold Database or ESMFold). The process is significantly sped up by using sequence identity-based subsets and aligning them in parallel. Conserved secondary structure elements are used to reduce gaps for a high-quality final alignment.
Read more here or read our publication!
- Nextflow
- Singularity/Apptainer or Docker
- Sufficient amount of scratch space and RAM (300 Sequences of 400 residues with 30% sequence identity need 30GB disk space and 32GB RAM)
- Copy of this repository
git clone https://github.com/Bio2Byte/simsapiper.git
Use directory toy_example
to test installation.
SIMSAPiper will automatically recognize directories called data
if none is specified.
The directory contains:
- Subdirectory
seqs
with fasta-formatted protein sequences - Optional: subdirectory
structures
with 3D protein structure models
Enable recommended settings using --magic
nextflow run simsapiper.nf -profile server,withsingularity --data $PWD/toy_example/data --magic
or use
./magic_align.sh
This file can also be double-clicked to run the toy_example dataset.
Use absolute files paths (/Users/me/workspace/simsapiper/toy_example/data
).
By default most flags are set to False. Adding a flag to the command line will set it to True and activate it. Some flags can carry additional information, such as percentages or filenames. The complete list can be found below.
--magic flag is equivalent to
nextflow run simsapiper.nf
-profile server,withsingularity
--data $PWD/toy_example/data
--seqFormat fasta
--seqQC 5
--dropSimilar 90
--outFolder "simsa_time_of_execution"
--outName "magicMsa"
--minSubsetID "min"
--createSubsets 30
--retrieve
--model
--strucQC 5
--dssp
--squeeze "H,E"
--squeezePerc 80
--reorder
Flag | Function | Default | Recommendation |
---|---|---|---|
-resume | Retry the last run, no rerun of completed jobs -resume [hash] to retry specific run |
||
-profile standard | Local execution Use multiple profiles: -profile server,withconda |
||
-profile server | Linux server execution | ||
-profile hpc | HPC execution using SLURM | ||
-profile withdocker | Dependencies via docker container | ||
-profile withsingularity | Dependencies via apptainer images | ||
-profile withconda | Dependencies via conda (except T-Coffee) | ||
--condaEnvPath | Path to conda environment (if –profile withconda) | false | create with .yml file for ARM-Apple (-profile standard)/ Linux (-profile server) automatically |
--data | Path to data directory | data | |
--structures | Path to structure files directory | --data/structures | |
--seqs | Path to sequence files directory | --data/seqs | |
--seqFormat | Input sequence format according to biopython formats | fasta | |
--seqQC | Ignore sequences with % non-standard amino acids | 5 | |
--dropSimilar | Collapse sequences with % sequence identity | false | 90 |
--favoriteSeqs | Select sequence labels that need to stay in the alignment | false | "SeqLabel1,SeqLabel2" |
--outFolder | Set directory name for output files | results/simsa_time_of_execution | |
--outName | Set final MSA file name | finalmsa | |
--createSubsets | Creates subsets of maximally % sequence identity | false | 30 |
--minSubsetID | Sets minimal % sequence identity for sequences to be in a subset | 20 | "min" to collate small |
CD-Hit clusters | |||
--maxSubsetSize | Sets maximal number of sequences in a subset | true | <400AA: --maxSubsetSize 100, >400AA: --maxSubsetSize 50 |
--useSubsets | User provides multiple sequence files corresponding to subsets Provide sequences not fitting any subset in a file containing 'orphan' in filename |
false | |
--retrieve | Retrieve protein structure models from AFDB | false | |
--model | Predict protein structure models with ESM Atlas | false | |
--strucQC | Maximal % of sequences not matched to a 3D structure | 5 | |
--tcoffeeParams | Additional parameters for Tcoffee | false | "--help" |
--mafftParams | Additional parameters for MAFFT | false | "--localpair --maxiterate 100" |
--dssp | Map DSSP code to alignment | false | |
--squeeze | Squeeze alignment towards conserved 2nd structure categories | false | "H,E" |
--squeezePerc | Set minimal occurence % of anchor element in MSA | 80 | |
--reorder | Order final MSA by input file order | false | |
--convertMSA | Covert final MSA file from fasta to selected file format | false | "clustal" |
--magic | Launch a run with recommended settings for all parameters | false |
- Nextflow
- Java V11
- Python3
- Singularity/Apptainer (-profile withsingularity)
- Set common singularity image directory in nextflow.config
- Docker (-profile withdocker)
- Start Docker before launching the pipeline
- Conda (-profile withconda)
- Select Linux or ARM-Apple .yml file in nextflow.config
- Install T-Coffee, SAP and TMalign manually
- Local execution (-profile standard) Install - CD-Hit
- Creates log file
*.nflog
with working directories for all subjobs, error messages, and execution hash for resuming - Add more flags to standardize runs, alternatively create profiles in
nextflow.config
file
#!/bin/bash
#Adapt this line:
house=full/path/to/your/data/directory
#Adapt this line:
data=data_folder_containing_seqs_and_structures
now=`date +"%Y_%m_%d_%H_%M_%S"`
#Adapt this line:
output_name=${data}_${now}_description
output_folder=$house/results/$output_name
mkdir -p $house/results/
mkdir -p $house/results/$output
#Adapt here:
/path/to/software/nextflow run /path/to/simsapiper.nf \
-profile select_infrastructure,select_container \
--data $house/$data \
--any_flag value_followed_by_\
--magic \
--outName magicMsa \
--outFolder $output_folder \
|& tee $output_folder/run_report_$output_name.nflog
sessionName=$(sed -n '2s/.*\[\(.*\)\].*/\1/p' $output_folder/run_report_$output_name.nflog)
#Adapt this line:
/path/to/software/nextflow log | grep $sessionName >> $output_folder/run_report_$output_name.nflog
#remove all comment lines before launching
- Holds all variables needed to adapt SIMSApiper to your system
- Local, server and HPC-SLURM execution profiles are prepared
- Edit to change standard parameters (as listed under available flags)
- Learn how to make Nextflow profiles here
Input: data/seqs/
- Sequence file(s):
- One file with all sequences
- Multiple sequence files (can be subsets for T-Coffee with --use_subsets)
- Sequence format: If not FASTA formatted, provide file format with --seqFormat according to biopython nomenclature
- Sequence IDs: If --retrieve: use Uniprot ID as sequence ID to enable retrieval from AlphaFold Database
Notes: No spaces in the filename(s) allowed, use '_' for example instead
Input: data/structures/
- Structural input is optional
- Must be in .pdb format
- Sequence labels and the structure filenames must match exactly!
>P25106 will only match P25106.pdb - Can be experimentally generated structures, from the PDB or modeled structures
Good to know:
- Multiple chains are not allowed by T-Coffee. Extract the chain of interest using: pdb-tools (also make sure you extract your chain of interest in the SEQRES section)
- Mutations in 3D structure will not impact MSA quality if mutations do not impact the overall organisation of the protein (pdb_min_sim parameter of T-Coffee set to 1 by default)
- The sequence you want to align can be a section of your chain of interest (pdb_min_cov parameter of T-Coffee set to 1 by default) BUT if your section is not at the beginning or end of your chain of interest, the step squeeze will fail
- Step squeeze will fail if more than 5% of your protein structure is mutated.
- Omit these issues by computing your own AlphaFold2 model using ColabFold (very user friendly) or directly the AlphaFold2 software instead of relying on ESMFold if:
- Proteins are very different from proteins previously solved in the PDB
- Proteins are larger than 400 residues
- Modeled structure information has been shown to improve the quality of MSAs
- Nextflow creates directory
results
in Nextflow directory - Every run creates a unique subdirectory
- Set name using --outFolder
- Select name for the structure informed MSA using --outName
Output: results/outFolder/seqs/
- Parse sequence files using biopython to provide sequences in fasta-2-line format
Output: results/outFolder/seqs/CD-HIT/
Why?
- Calculation time scales exponentially with number of input sequences
- Recommendations:
- 90% to remove redundant sequences
- 70% to have a more general overview of the protein family and address sampling bias
How?
- Use CD-Hit with high sequence identify cutoff
- Keep cluster representatives
- Proteins inside clusters will be excluded from the downstream processes
- Add important proteins to the representatives with --favoriteSeqs
- Find in which cluster a specific protein is in
results/outFolder/seqs/CD-HIT/*.clstr
- Find excluded similar sequences in
results/outFolder/seqs/CD-HIT/removed_*_perc_similar.fasta
Output: results/outFolder/seqs/too_many_unknown_characters.fasta
Why?
- Too many non-standard amino acids can result in a poor alignment
How?
- Remove all sequences containing more than --seqQC % non-standard amino acids
- Match sequence ID and structure model filenames
Common Issues:
- Models not matched to a sequence present in input file can be caused by spelling or exclusion by --dropSimilar or --seqQC
Output: data/structures/uniprot_id.pdb
Why?
- AlphaFold2 models are considered the most accurate protein structure predictions
How?
- Only possible if sequence ID starts with the Uniprot ID
>UniprotID_more_information
Common Issues:
- Entries in Uniprot that are not on the AlphaFold Database: generate model yourself and add manually to
data/structures/
(same as step 2.1)
Output: data/structures/sequence_id.pdb
Why?
- Very fast prediction of novel protein structure models from sequence
- No need to know the Uniprot ID
How?
- Submit sequence to ESMFold API for modelling, retrieve model on completion
Common Issues:
- Sequences are longer then 400 residues
- Sequence contains any characters not representing an amino acid
- API rate limit causes jobs to randomly fail: rerun SIMSApiper a few times to ensure maximal amount of models retrieved
(same as step 2.1)
Output: results/outFolder/seqs/seqs_to_align.fasta
, results/outFolder/seqs/structureless_seqs.fasta
Why?
- Alignment quality decreases if too many sequences are not matched to a model
How?
- Job fails if more than --strucQC % of the sequences are structureless before starting the alignment step
- Sequences matched to a model:
seqs_to_align.fasta
- Sequences not matched to a model:
structureless_seqs.fasta
Why?
- Calculation time and space requirements of T-Coffee alignments scale exponentially with number of sequences
How?
- Sequences shorter than 400 residues: maximally 100 sequences per subset
- Sequences longer than 400 residues: maximally 50 sequences per subset
- Sequences that do not fit a subset: label "orphan", not aligned with T-Coffee but with MAFFT based on sequence
Input: multiple sequence files in data/seqs/
Output: results/outFolder/seqs/CD-HIT_for_t-coffee/*.fasta
- Use phylogenetic relationships or sequence identity to manually create subsets
- Sequences that do not fit in a subset can be collected in one file with “orphan” in filename
- Input for T-Coffee will be filtered for sequences matched to a protein structure model
Output: results/outFolder/seqs/CD-HIT_for_tcoffee/subset*.fasta
- Automatically cluster sequences with PSI-CD-HIT --createSubsets % sequence identity into a subset
- Set appropriate maximum number of members per cluster based on sequence length --maxSubsetSize
- Singleton clusters will be labeled with “orphan“
Common Issues:
- Subsets are too large:
- Increase initial clustering threshold --createSubsets
- Clusters get split randomly to attain evenly distributed clusters smaller then --maxSubsetSize
- More then 5% of the clusters are singletons:
- Threshold will interactively decrease by 5% until --minSubsetIdentity
- Set --minSubsetIdentity "min" to reduce overall number of subsets by collating small clusters and singletons until --maxSubsetSize reached
Output: results/outFolder/msas/t-coffee/aligned_subset_*.aln
Why?
- Shown to be the state-of-the-art structure-informed alignment method
- Combination of algorithms TMalign and SAP have shown to provide the best results
How?
- Each subset is aligned individually by T-Coffee
- This is the step that takes the longest (likely hours) but can run in parallel depending on available resources
- In -profile hpc, each subset is submitted up to 3 times with dynamically increasing resource requests to minimize queue time
- Subsets labelled “orphan” will not be aligned with T-Coffee but added to the structure informed MSA with MAFFT based on their sequence
- Temporary files will be created in
work/.t_coffee/tmp/tco/
- T-Coffee parameters in this pipeline:
t_coffee -in=subset_0.fasta -method TMalign_pair
-evaluate_mode=t_coffee_slow -mode=3dcoffee -pdb_min_sim=1 -pdb_min_cov=1 -thread 0
-outfile="output/t-coffee/subset_0_aligned.aln"
- Mode 3DCoffee uses method sap_pair by default
- Append any other T-Coffee flag with --tcoffeeParams “-quiet”
Output: results/outFolder/msas/merged_finalmsa_alignment.fasta
Why?
- Combine T-Coffee subalignments
- Add orphan and/or structureless sequences
How?
- Mafft merge mode is conserving the subalignment structure, mainly adds gaps
- Mafft parameters in this pipeline:
mafft --merge tableMAFFT.txt input > prefinalMSA.fasta
- Append any other mafft flag or alignment modes with --mafftParams “--localpair --maxiterate 100”
- Find appropriate flags here
Output: results/data/dssp/model_name.dssp
- Translate structure models into 2D secondary structure nomenclature using the DSSP codes
Output: results/outFolder/msas/dssp_merged_finalmsa_alignment.fasta
- Map DSSP codes on sequences of the MSA, conserving the gaps
- Sequence of the model and the sequence to align have:
- Same length but more than 5% different (point mutations)
- Different lengths (deletions/insertions) in the middle of one of the sequences (allowed at extremeties)
- Your structure file contains more than one chain: extract a specific chain using pdb-tools or use predicted models
- Find these troublesome sequences in
results/outFolder/msas/unmappable_dssp.fasta
and unconverted in the mapped alignment
Output: results/outFolder/msas/squeezed_dssp_merged_finalmsa_alignment.fasta
Why?
- MSA of unstructured or less conserved regions (e.g. loops) are usually very disperse
How?
- Identify conserved secondary structure categories such as helices and β-sheets with --squeeze “H,E”, and squeeze MSA towards these regions
- DSSP codes representing helices, i.e, H, G and I, are considered the same by SIMSApiper
- Select threshold for region to be considered 'conserved' with --squeezePerc
- Must have at least 3 consecutive conserved columns to be considered a conserved region
Output: results/outFolder/msas/dssp_squeezed_dssp_merged_finalmsa_alignment.fasta
- Map DSSP codes on sequences of the squeezed MSA, conserving the gaps.
Common Issues: see above
Output: results/outFolder/reordered_*_merged_finalmsa_alignment.fasta
- Order MSA according to the order of sequences in the input files. If more then one sequence input file is provided, order MSQ based on the order given in the files organized alphabetically. Instead of alphabetically, can also select explicitly how to organize the files with --reorder "gamma.fasta,delta.fasta”
Output: results/outFolder/seqs/converted_*_merged_finalmsa_alignment.fasta
- Convert MSA using biopython from fasta-2-line format
Output: results/outFolder/sequence_report.text
- Contains number of sequences
- in the input files
- excluded because invalid
- matched to a protein structure model
- not matched to a protein structure model
- in the final aligned file
Output: results/outFolder/nextflow_report_outName.html
, resources_outName.txt
- Log file created by Nextflow pipeline
- Shows execution times and resource usage per job
Output: results/outFolder/run_id_time.nflog
- Created ONLY when using the launch file
- Reports all --flag settings
- Captures terminal output of Nextflow pipeline for error tracing
- Captures unique execution hash and flags for resuming the specific job
- Sequence labels and the structure filenames must match exactly!
- T-Coffee only accepts sequence labels that are less than 20 characters long. Make sure to respect this or T-Coffee will fail.
- When running the pipeline with experimentally resolved structures: T-Coffee can handle partially resolved structures BUT, the SEQRES and ATOMS records must be only of the chain of interest (i.e. the sequence you want to align). You can use tools such as pdb-tools to fix your pdb files.
- Cancel a running Nextflow job: Crtl + C
- Pipeline failed to complete:
- to rerun the last job: append -resume to the launch command
- to rerun a specific job: check the
*.nflog
files last line to get the unique hashnextflow run simsapiper.nf -resume 9ae6b81a-47ba-4a37-a746-cdb3500bee0f
- All intermediate results are unique subdirectory of the directory
work
Find directory hash for each step in*.nflog
- Run in the background: launch SIMSApiper in a screen
Hit Crtl + A and Crtl + D to put it in the background
screen -S nextflowalign bash -c ./magic_align.sh
- Launch file does not work:
- try:
and rerun.
chmod +x magic_align.sh
- check for spaces behind
\
in the launch file, there can not be any. - On MacOS, you may have to replace
|& tee
with>>
- try:
- Modeling with ESMFold has low yield:
- Sequences longer than 400 residues cannot be modeled: try ColabFold to generate your own models
- ESM Atlas was asked to model too many sequences at once, resume the job
- SIMSApiper crashes at CD-Hit stage:
- try:
and rerun.
chmod +x bin/psi-cd-hit.pl chmod +x bin/psi-cd-hit-local.pl
- try:
- SIMSApiper crashes at T-Coffee stage:
- Error contains "sap_pair error" and -model true: possibly ESMFold prediction error
- Remove the ESMFold model of protein in error message from
data/structures
- Set --model false to add protein to the MSA based on sequence
- --retrieve AF model of the protein from AFDB
- Generate model with ColabFold and add manually
- Remove protein from the data/seqs/*.fasta file to removed from the dataset entirely
- Remove the ESMFold model of protein in error message from
- Error contains "proba_pair": T-Coffee could not find the structures and uses a sequence based alignemnt method
- Use complete path to data/structure directory
- Check protein model files
- Error contains "sap_pair error" and -model true: possibly ESMFold prediction error
- Learn how to adapt this pipeline using Nextflow here
@article{crauwels_large-scale_2024,
title = {Large-scale Structure-Informed multiple sequence alignment of proteins with {SIMSApiper}},
issn = {1367-4811},
doi = {10.1093/bioinformatics/btae276},
pages = {btae276},
journaltitle = {Bioinformatics},
author = {Crauwels, Charlotte and Heidig, Sophie-Luise and Díaz, Adrián and Vranken, Wim F},
date = {2024-04-22}
}