Welcome to the repository of NextFlow pipelines created by the Bio2Byte group!
Proteins are the molecular machines that make cells work. They perform a wide variety of functions through interactions with each other and many additional molecules. Traditionally, proteins are described in a single static state (a picture). It is now increasingly recognized that many proteins can adopt multiple states and move between these conformational states dynamically (a movie).
We investigate how the dynamics, conformational states and available experimental data of proteins relates to their amino acid sequence. Underlying physical and chemical principles are computationally unravelled through data integration, analysis and machine learning, so connecting them to biological events and improving our understanding of the way proteins work.
The Bio2Byte group is primarily situated at the Interuniversity Institute of Bioinformatics in Brussels, a collaborative inter-faculty institute between the "Vrije Universiteit Brussel" (VUB) and the "Université Libre de Bruxelles" (ULB). It is located at the the ULB side of the "Pleinlaan/La Plaine" campus on the 6th floor of the C building.
At the VUB, the group is linked to Structural Biology Brussels at the Bioengineering sciences department, as well as the departments of Computer Science and Chemistry at the Faculty of Sciences, plus to the Biomedical sciences at the Faculty of Medicine and Pharmacy.
The b2btools
are a set of in-house developed biophysical predictors that enable the exploration of the 'biophysical variation' of proteins along its sequence. The predictions reflect 'emerging' properties, so what the sequence is capable of, not necessarily what it will do in a particular context, for example when it adopts a specific fold. Studying the biophysical properties of a protein is relevant as these properties, like the dynamics of a protein, are conserved by evolution in order to preserve the protein's function.
- Backbone and sidechain dynamics (DynaMine)
- Conformational propensities (sheet, helix, coil, polyproline II) (DynaMine)
- Early folding propensities (EFoldMine)
- Disorder propensities (DisoMine)
- Beta aggregation propensity (AgMata)
- Phase separation propensity (PPser)
NextFlow is a reactive workflow framework and a programming DSL that eases writing computational pipelines with complex data.
NextFlow must be available on the system where the pipeline is going to be executed. For local environments, please follow the instructions provided by the official docs: NextFlow - Get started.
If you are working on a HPC environment with software dependencies handled as modules, for instance VSC clusters, NextFlow is available by executing:
$ module load Nextflow/22.04.0
To check that everything is ready:
$ nextflow -v
nextflow version 22.04.0.5697
The basic command to run our NextFlow pipelines is:
$ nextflow run Bio2Byte/nf-pipelines -r main -main-script /path/to/script.nf
Given NextFlow able to run pipelines from GitHub, the published ones here are available by executing "nextflow run Bio2Byte/nf-pipelines
" with the flag "-r main
" to indicate NextFlow to fetch the main
branch. By adding the last flag "-main-script
" you choose the pipeline.
NextFlow is a powerful pipeline framework that runs on different systems such as your local workspace or HPC clusters. To aim these different NextFlow executors, we defined profiles inside the configuration files:
- For local development: "
-profile standard,...
" - For Slurm-based HPC clusters (such as VUB's Hydra clusters): "
-profile hydra,...
"
In addition, NextFlow can run processes using either Docker or Singularity images. For instance, VUB-HPC prefers Singularity to Docker when running containerized software (read more). This feature is a game changer because you do not have to install any dependency to run our code. Each step of the pipeline (the "process
" blocks) has a Docker image defined inside the configuration file. In case that using Singularity is a requirement, NextFlow will convert automatically Docker images to Singularity ones.
- Using Docker: "
-profile withdocker,...
" - Using Singularity: "
-profile withsingularity,...
"
If you are working on your local environment combine both standard
and withdocker
profiles:
$ nextflow run Bio2Byte/nf-pipelines \
-r main \
-main-script /path/to/script.nf \
-profile standard,withdocker \
[pipeline parameters]
If you are working on HPC clusters combine both hydra
and withsingularity
profiles:
$ nextflow run Bio2Byte/nf-pipelines \
-r main \
-main-script /path/to/script.nf \
-profile standard,withsingularity \
[pipeline parameters]
This pipeline provides you structural predictions for protein sequences in FASTA format.
- Pipeline accession:
nextflow run Bio2Byte/nf-pipelines -r main -main-script b2btools/fromSingleSequences.nf
- Input:
- At least one valid protein sequence file in FASTA format.
- Output:
- Compressed file in format "
.tar.gz
" that includes:- Biophysical predictions in paginated JSON files (depending on the "
--sequencesPerJson
" parameter) - Ignored sequences in FASTA format of proteins with less than 5 or more than 2000 residues.
- (optional) Structures in PDB format from ESM Atlas API (only for first 400 residues of each sequence)
- (optional) MatPlotLib plots in PNG format:
- "
<Sequence's name>.png
" contains plots of:- Backbone and sidechain dynamics (DynaMine)
- Conformational propensities (sheet, helix, coil, polyproline II) (DynaMine)
- Early folding propensities (EFoldMine)
- Disorder propensities (DisoMine)
- Beta aggregation propensity (AgMata)
- "
<Sequence's name>_psp.png
" contains plots of:- Phase separation propensity (PPser)
- "
- Biophysical predictions in paginated JSON files (depending on the "
- Compressed file in format "
- Input/Output:
- Sequences file in FASTA format:
--targetSequences
- Description: Sequence must contain at least one protein sequence.
- Sequences per results file in JSON format:
--sequencesPerJson
- Description: Biophysical predictions are saved in JSON format. By default each file contains up to 10 sequences.
- Sequences file in FASTA format:
- Predictions:
- Dynamine - Backbone dynamics:
--dynamine
- EFoldMine - Early folding regions:
--efoldmine
- Disomine - Disordered regions:
--disomine
- AgMata - Regions prone to Beta-aggregation:
--agmata
- PSPer - Phase Separating Protein:
--psper
- ESM Atlas structures:
--fetchStructures
- Description: Resource provided by the "ESM Metagenomic Atlas". Fetching the folding of a sequence runs ESMFold (
esm.pretrained.esmfold_v1
) which provides fast and accurate atomic level structure prediction directly from the individual sequence of a protein.
- Description: Resource provided by the "ESM Metagenomic Atlas". Fetching the folding of a sequence runs ESMFold (
- Dynamine - Backbone dynamics:
- Visualizations:
- Biophysical features charts:
--plotBiophysicalFeatures
- Description: Include
MatPlotLib
plots in PNG format.
- Description: Include
- Biophysical features charts:
For these sequences in FASTA format:
>SEQ_BAD
MA
>SEQ_ADR
MAAAALVLVLLAVLAMAAAALVLVLLAVLAMAAAALVLVLLAVLAMAAAALVLVLLAVLA
>SEQ_SOP
MALALSALALLALALAMAAAALVLVLLAVLAMAAAALVLVLLAVLAMAAAALVLVLLAVLAMAAAALVLVLLAVLAMAAAALVLVLLAVLA
The NextFlow command line example is:
$ nextflow run Bio2Byte/nf-pipelines \
-r main \
-main-script b2btools/fromSingleSequences.nf \
-profile standard,withdocker \
--targetSequences ./example.fasta \
--dynamine \
--efoldmine \
--disomine \
--agmata \
--psper \
--fetchStructures \
--plotBiophysicalFeatures
This pipeline will generate a compressed file containing:
example_2023_02_08_11_13_22.tar.gz
├── SEQ_ADR.pdb
├── SEQ_SOP.pdb
├── b2b_results_example.1.index
├── b2b_results_example.1.json
├── b2b_results_example.1_SEQ_ADR.png
├── b2b_results_example.1_SEQ_ADR_psp.png
├── b2b_results_example.1_SEQ_SOP.png
├── b2b_results_example.1_SEQ_SOP_psp.png
└── example_sequences_ignored.fasta
This pipeline provides you predictions of the biophysical features of a MSA, computation of the MSA biophysical and sequence conservation, and 2D visualization of these values.
- Pipeline accession:
nextflow run Bio2Byte/nf-pipelines -r main -main-script b2btools/fromMSA.nf
- Input:
- Either:
- At Multiple Sequence Alignment file in FASTA format
- At least 3 valid protein sequences in a single file in FASTA format
- Either:
- Output:
- Compressed file in format "
.tar.gz
" that includes:- MSA file in FASTA format.
- Biophysical predictions in JSON format
- Ignored sequences in FASTA format of proteins with less than 5 or more than 2000 residues.
- (optional) Structures in PDB format from ESM Atlas API (only for first 400 residues of each sequence)
- (optional) MatPlotLib plots in both PNG and PDF formats
- (optional) WebLogo representation of the MSA
- (optional) Phylogenetic tree in text format
- (optional) Phylogenetic tree rendered in SVG image format
- Compressed file in format "
- Input/Output:
- Sequences file in FASTA format:
--targetSequences
- Description: Sequence must contain at least one protein sequence.
- Should pipeline align sequences?
--alignSequences
- Description: If your input file is a set of unaligned sequences, with this flag the pipeline will build a MSA file using Clustal Omega before starting the regular workflow. Clustal Omega is a new multiple sequence alignment program that uses seeded guide trees and HMM profile-profile techniques to generate alignments between three or more sequences.
- Sequences file in FASTA format:
- Predictions:
- EFoldMine - Early folding regions:
--efoldmine
- Disomine - Disordered regions:
--disomine
- ESM Atlas structures:
--fetchStructures
- Description: Resource provided by the "ESM Metagenomic Atlas". Fetching the folding of a sequence runs ESMFold (
esm.pretrained.esmfold_v1
) which provides fast and accurate atomic level structure prediction directly from the individual sequence of a protein.
- Description: Resource provided by the "ESM Metagenomic Atlas". Fetching the folding of a sequence runs ESMFold (
- MSA Logo:
--buildLogo
- Description: Create a Logo representation using WebLogo. Sequence logos are a graphical representation of an amino acid or nucleic acid multiple sequence alignment developed by Tom Schneider and Mike Stephens.
- Phylogenetic tree:
--buildTree
- Description: FastTree infers approximately-maximum-likelihood phylogenetic trees from alignments of protein sequences.
- Phylogenetic tree image in SVG format:
--plotTree
- Description: Build a SVG image of the phylogenetic tree. It requires the flag
--buildTree
.
- Description: Build a SVG image of the phylogenetic tree. It requires the flag
- EFoldMine - Early folding regions:
- Visualizations:
- Biophysical features charts:
--plotBiophysicalFeatures
- Description: Include
MatPlotLib
plots in PNG format.
- Description: Include
- Biophysical features charts:
Given this pipelines supports two types of input files, the next section will describe both scenarios:
- Running the pipeline for a MSA input file in FASTA format
- Running the pipeline for a set of sequences in FASTA format
For this MSA in FASTA format:
>1ymg_A THE CHANNEL ARCHITE
SASFWRAICAEFFASLFYVFFGLGASLRW-----AG------P---------lHVLQVAL
AFGLALATLVQAVGHISGAHVNPAVTFAFLVGSQMSLLRAICYMVAQLLGAVAGAAVLYS
VT--PPAvRGNlALNTLHPGVSVGQATIVEIFLTLQFVLCIFATYDERRNGRLGSVALAV
GFSLTLGHLFGMYYTGAGMNPARSFAPAILTR------NFTNHWVYWVGPVIGAGLGSLL
YDFLLFPRLKSVSERLSILKG
>2d57_A DOUBLE LAYERED 2D C
TQAFWKAVTAEFLAMLIFVLLSVGSTINW-----GG-SENPLP---------VDMVLISL
CFGLSIATMVQCFGHISGGHINPAVTVAMVCTRKISIAKSVFYITAQCLGAIIGAGILYL
VT--PPSVVGGLGVTTVHGNLTAGHGLLVELIITFQLVFTIFASCDSKRTDVTGSVALAI
GFSVAIGHLFAINYTGASMNPARSFGPAVIMG------NWENHWIYwVGPIIGAVLAGAL
YEYVF--------------CP
>2f2b_A CRYSTAL STRUCTURE O
MVSLTKRCIAEFIGTFILVFFGAGSAAVTLMIASGGTSPNPFNIGIGLLGGLGDWVAIGL
AFGFAIAASIYALGNISGCHINPAVTIGLWSVKKFPGREVVPYIIAQLLGAAFGSFIFLQ
CAGIGAATVGGLGATAPFPGISYWQAMLAEVVGTFLLMITIMGIAvDERAP-KGFAGIII
GLTVAGIITTLGNISGSSLNPARTFGPYLNDMifagtDlWNYYSIYvIGPIVGAVLAALT
YQYL---------------TS
The NextFlow command line example is:
$ nextflow run Bio2Byte/nf-pipelines \
-r main \
-main-script b2btools/fromMSA.nf \
-profile standard,withdocker \
--targetSequences ./simple_alignment.fasta \
--plotBiophysicalFeatures \
--buildLogo \
--buildTree \
--plotTree \
--efoldmine \
--disomine \
--fetchStructures
This pipeline will generate a compressed file containing:
simple_alignment_2023_02_08_12_57_28.tar.gz
├── 1ymg_A__THE_CHANNEL_ARCHITE.pdb
├── 2d57_A__DOUBLE_LAYERED_2D_C.pdb
├── 2f2b_A__CRYSTAL_STRUCTURE_O.pdb
├── b2b_msa_results_simple_alignment_filtered.fasta.json
├── simple_alignment_filtered.fasta.msa
├── simple_alignment_filtered.fasta.msa.tree
├── simple_alignment_filtered.fasta.msa.tree.svg
├── simple_alignment_filtered.fasta_1ymg_A__THE_CHANNEL_ARCHITE_msa_biophysical_conservation.pdf
├── simple_alignment_filtered.fasta_1ymg_A__THE_CHANNEL_ARCHITE_msa_biophysical_conservation.png
├── simple_alignment_filtered.fasta_2d57_A__DOUBLE_LAYERED_2D_C_msa_biophysical_conservation.pdf
├── simple_alignment_filtered.fasta_2d57_A__DOUBLE_LAYERED_2D_C_msa_biophysical_conservation.png
├── simple_alignment_filtered.fasta_2f2b_A__CRYSTAL_STRUCTURE_O_msa_biophysical_conservation.pdf
├── simple_alignment_filtered.fasta_2f2b_A__CRYSTAL_STRUCTURE_O_msa_biophysical_conservation.png
└── simple_alignment_filtered_logo.png
For these sequences in FASTA format:
>random_sequence_1 consisting of 25 residues.
RGGMSIQGTFVR
>random_sequence_2 consisting of 25 residues.
CE
>random_sequence_3 consisting of 25 residues.
LKLPSFD
>random_sequence_4 consisting of 25 residues.
PKMCQMTDHKEYQGSALGSGS
>random_sequence_5 consisting of 25 residues.
REDTWATASAACLITFNVSPDCMQV
>random_sequence_6 consisting of 25 residues.
IHYTTTPDVICLW
>random_sequence_7 consisting of 25 residues.
AAMQLCPQAYMKQPWTNVMIQE
>random_sequence_8 consisting of 25 residues.
FSSFHMHTMHLLSPLNTD
>random_sequence_9 consisting of 25 residues.
IIKYGAV
>random_sequence_10 consisting of 25 residues.
FPDHDDCGCLWFYQATWATKCLKEL
The NextFlow command line example is:
$ nextflow run Bio2Byte/nf-pipelines \
-r main \
-main-script b2btools/fromMSA.nf \
-profile standard,withdocker \
--targetSequences ./10x25.fasta \
--plotBiophysicalFeatures \
--alignSequences \
--buildLogo \
--buildTree \
--plotTree \
--efoldmine \
--disomine \
--fetchStructures
This pipeline will generate a compressed file containing:
10x25_2023_02_08_13_20_44.tar.gz
├── 10x25_filtered.fasta.msa
├── 10x25_filtered.fasta.msa.tree
├── 10x25_filtered.fasta.msa.tree.svg
├── 10x25_filtered.fasta_random_sequence_10_consisting_of_25_residues__msa_biophysical_conservation.pdf
├── 10x25_filtered.fasta_random_sequence_10_consisting_of_25_residues__msa_biophysical_conservation.png
├── 10x25_filtered.fasta_random_sequence_1_consisting_of_25_residues__msa_biophysical_conservation.pdf
├── 10x25_filtered.fasta_random_sequence_1_consisting_of_25_residues__msa_biophysical_conservation.png
├── 10x25_filtered.fasta_random_sequence_3_consisting_of_25_residues__msa_biophysical_conservation.pdf
├── 10x25_filtered.fasta_random_sequence_3_consisting_of_25_residues__msa_biophysical_conservation.png
├── 10x25_filtered.fasta_random_sequence_4_consisting_of_25_residues__msa_biophysical_conservation.pdf
├── 10x25_filtered.fasta_random_sequence_4_consisting_of_25_residues__msa_biophysical_conservation.png
├── 10x25_filtered.fasta_random_sequence_5_consisting_of_25_residues__msa_biophysical_conservation.pdf
├── 10x25_filtered.fasta_random_sequence_5_consisting_of_25_residues__msa_biophysical_conservation.png
├── 10x25_filtered.fasta_random_sequence_6_consisting_of_25_residues__msa_biophysical_conservation.pdf
├── 10x25_filtered.fasta_random_sequence_6_consisting_of_25_residues__msa_biophysical_conservation.png
├── 10x25_filtered.fasta_random_sequence_7_consisting_of_25_residues__msa_biophysical_conservation.pdf
├── 10x25_filtered.fasta_random_sequence_7_consisting_of_25_residues__msa_biophysical_conservation.png
├── 10x25_filtered.fasta_random_sequence_8_consisting_of_25_residues__msa_biophysical_conservation.pdf
├── 10x25_filtered.fasta_random_sequence_8_consisting_of_25_residues__msa_biophysical_conservation.png
├── 10x25_filtered.fasta_random_sequence_9_consisting_of_25_residues__msa_biophysical_conservation.pdf
├── 10x25_filtered.fasta_random_sequence_9_consisting_of_25_residues__msa_biophysical_conservation.png
├── 10x25_filtered_logo.png
├── 10x25_sequences_ignored.fasta
├── b2b_msa_results_10x25_filtered.fasta.json
├── random_sequence_10_consisting_of_25_residues_.pdb
├── random_sequence_1_consisting_of_25_residues_.pdb
├── random_sequence_3_consisting_of_25_residues_.pdb
├── random_sequence_4_consisting_of_25_residues_.pdb
├── random_sequence_5_consisting_of_25_residues_.pdb
├── random_sequence_6_consisting_of_25_residues_.pdb
├── random_sequence_7_consisting_of_25_residues_.pdb
├── random_sequence_8_consisting_of_25_residues_.pdb
└── random_sequence_9_consisting_of_25_residues_.pdb
- Bio2Byte
- Bio2Byte online predictors
- Feedback or Questions
- Bio2Byte tools package
- Bio2Byte Online Notebooks
Implementation of the b2btools to study the protein biophysical features and their conservation
Kagami, L. P., Orlando, G., Raimondi, D., Ancien, F., Dixit, B., Gavaldá-García, J., Ramasamy, P., Roca-Martínez, J., Tzavella, K., & Vranken, W. (2021). b2bTools: Online predictions for protein biophysical features and their conservation. Nucleic Acids Research, 49(W1), W52–W59. https://doi.org/10.1093/nar/gkab425
DynaMine
Cilia, E., Pancsa, R., Tompa, P., Lenaerts, T., & Vranken, W. F. (2013). From protein sequence to dynamics and disorder with DynaMine. Nature Communications, 4(1), 2741–2741. https://doi.org/10.1038/ncomms3741
EFoldMine
Raimondi, D., Orlando, G., Pancsa, R., Khan, T., & Vranken, W. F. (2017). Exploring the Sequence-based Prediction of Folding Initiation Sites in Proteins. Scientific Reports, 7(1), 8826–8826. https://doi.org/10.1038/s41598-017-08366-3
Disomine
Orlando, G., Raimondi, D., Codicè, F., Tabaro, F., & Vranken, W. (2022). Prediction of Disordered Regions in Proteins with Recurrent Neural Networks and Protein Dynamics. Journal of Molecular Biology, 434(12), 167579. https://doi.org/10.1016/j.jmb.2022.167579
AgMata
Orlando, G., Silva, A., Macedo-Ribeiro, S., Raimondi, D., & Vranken, W. (2020). Accurate prediction of protein beta-aggregation with generalized statistical potentials. Bioinformatics, 36(7), 2076–2081. https://doi.org/10.1093/bioinformatics/btz912
PSPer
Orlando, G., Raimondi, D., Tabaro, F., Codicè, F., Moreau, Y., & Vranken, W. F. (2019). Computational identification of prion-like RNA-binding proteins that form liquid phase-separated condensates. Bioinformatics, 35(22), 4617–4623. https://doi.org/10.1093/bioinformatics/btz274
© Wim Vranken, Bio2Byte group, Vrije Universiteit Brussel (VUB), Brussels, Belgium.