Skip to content

Latest commit

 

History

History
157 lines (130 loc) · 8.67 KB

README.md

File metadata and controls

157 lines (130 loc) · 8.67 KB

MS²PIP

MS²PIP is a tool to predict MS² signal peak intensities from peptide sequences. It employs the XGBoost machine learning algorithm and is written in Python.

You can install MS²PIP on your machine by following the instructions below or the extended install instructions. For a more user friendly experience, we created a web server . There, you can easily upload a list of peptide sequences, after which the corresponding predicted MS² spectra can be downloaded in multiple file formats. The web server can also be contacted through the RESTful API.

To generate a predicted spectral library starting from a FASTA file, we developed a pipeline called fasta2speclib. Usage of this pipeline is described in fasta2speclib_config.md. Fasta2speclib was developed in collaboration with the ProGenTomics group for the MS²PIP for DIA project.

If you use MS²PIP for your research, please cite the following articles:

  • Gabriels, R., Martens, L., & Degroeve, S. (2019). Updated MS²PIP web server delivers fast and accurate MS² peak intensity prediction for multiple fragmentation methods, instruments and labeling techniques. Nucleic Acids Research https://doi.org/10.1093/nar/gkz299
  • Degroeve, S., Maddelein, D., & Martens, L. (2015). MS²PIP prediction server: compute and visualize MS² peak intensity predictions for CID and HCD fragmentation. Nucleic Acids Research, 43(W1), W326–W330. https://doi.org/10.1093/nar/gkv542
  • Degroeve, S., & Martens, L. (2013). MS²PIP: a tool for MS/MS peak intensity prediction. Bioinformatics (Oxford, England), 29(24), 3199–203. https://doi.org/10.1093/bioinformatics/btt544

Please also take note of and mention the MS²PIP-version and model-version you used.

Installation

Download the latest release and unzip. MS2PIPc runs on Python 3.5 or greater and the required Python packages are listed in requirements.txt. MS2PIPc requires machine specific compilation of the C-code:

sh compile.sh

Check out the extended install instructions for a more detailed explanation.

Predicting MS2 peak intensities

MS2PIPc comes with pre-trained models for a variety of fragmentation methods and modifications. These models can easily be applied by configuring MS2PIPc in the config.txt file and providing a list of peptides in the form of a PEPREC file.

MS2PIPc command line interface

usage: ms2pipC.py [-h] [-c FILE] [-s FILE] [-w FILE] [-m INT] <peptide file>

positional arguments:
  <peptide file>  list of peptides

optional arguments:
  -h, --help      show this help message and exit
  -c FILE         config file (by default config.txt)
  -s FILE         .mgf MS2 spectrum file (optional)
  -w FILE         write feature vectors to FILE.{pkl,h5} (optional)
  -m INT          number of cpu's to use

Config file

Several MS2PIPc options need to be set in this config file.

The models that should be used are set as model=X where X is one of the currently supported MS2PIP models (see MS2PIP Models).

The fragment ion error tolerance is set as frag_error=X where is X is the tolerance in Da.

PTMs (see further) are set as ptm=X,Y,opt,Z for each internal PTM where X is a string that represents the PTM, Y is the difference in Da associated with the PTM, opt is a required for compatibility with other CompOmics projects, and Z is the amino acid that is modified by the PTM. For N- and C-terminal modifications, Z should be N-term or C-term, respectively.

PEPREC file

To apply the pre-trained models you need to pass only a <peptide file> to ms2pipC.py. This file contains the peptide sequences for which you want to predict the b- and y-ion peak intensities. The file is space separated and contains four columns with the following header names:

  • spec_id: an id for the peptide/spectrum
  • modifications: a string indicating the modified amino acids
  • peptide: the unmodified amino acid sequence
  • charge: charge state to predict

The spec_id column is a unique identifier for each peptide that will be used in the TITLE field of the predicted MS2 .mgf file. The modifications column is a string that lists the PTMs in the peptide. Each PTM is written as A|B where A is the location of the PTM in the peptide (the first amino acid has location 1, location 0 is used for n-term modifications, while -1 is used for c-term modifications) and B is a string that represent the PTM as defined in the config file (-c command line argument). Multiple PTMs in the modifications column are concatenated with '|'.

As an example, suppose the config file contains the line

ptm=Cam,57.02146,opt,C
ptm=Ace,42.010565,opt,N-term
ptm=Glyloss,-58.005479,opt,C-term

then a modifications string could like 0|Ace|2|Cam|5|Cam|-1|Glyloss which means that the second and fifth amino acid is modified with Cam, that there is an N-terminal modification Ace, and that there is a C-terminal modification Glyloss.

In the conversion_tools folder, we provide a host of Python scripts to convert common search engine output files to a PEPREC file.

The predictions are saved in a .csv file with the name <peptide_file>_predictions.csv. If you want the output to be in the form of an .mgf file, replace the variable mgf in line 716 of ms2pipC.py.

MS²PIP models

Currently the following models are supported in MS²PIP: HCD, CID, TTOF5600, TMT, iTRAQ, iTRAQphospho, HCDch2 and CIDch2. The last two "ch2" models also include predictions for doubly charged fragment ions (b++ and y++), next to the predictions for singly charged b- and y-ions.

If you use MS²PIP for your research, always mention the MS²PIP-version (see releases page) and model-version (see table below) you used.

Models, version numbers, and the train and test datasets used to create each model

Model Current version Train-test dataset (unique peptides) Evaluation dataset (unique peptides) Median Pearson correlation on evaluation dataset
HCD v20190107 MassIVE-KB (1 623 712) PXD008034 (35 269) 0.903786
CID v20190107 NIST CID Human (340 356) NIST CID Yeast (92 609) 0.904947
iTRAQ v20190107 NIST iTRAQ (704 041) PXD001189 (41 502) 0.905870
iTRAQphospho v20190107 NIST iTRAQ phospho (183 383) PXD001189 (9 088) 0.843898
TMT v20190107 Peng Lab TMT Spectral Library (1 185 547) PXD009495 (36 137) 0.950460
TTOF5600 v20190107 PXD000954 (215 713) PXD001587 (15 111) 0.746823
HCDch2 v20190107 MassIVE-KB (1 623 712) PXD008034 (35 269) 0.903786 (+) and 0.644162 (++)
CIDch2 v20190107 NIST CID Human (340 356) NIST CID Yeast (92 609) 0.904947 (+) and 0.813342 (++)

MS² acquisition information and peptide properties of the models' training datasets

For optimal results, your experimental data should match the properties of the MS²PIP model.

Model Fragmentation method MS² mass analyzer Peptide properties
HCD HCD Orbitrap Tryptic digest
CID CID Linear ion trap Tryptic digest
iTRAQ HCD Orbitrap Tryptic digest, iTRAQ-labeled
iTRAQphospho HCD Orbitrap Tryptic digest, iTRAQ-labeled, enriched for phosphorylation
TMT HCD Orbitrap Tryptic digest, TMT-labeled
TTOF5600 CID Quadrupole Time-of-Flight Tryptic digest
HCDch2 HCD Orbitrap Tryptic digest
CIDch2 CID Linear ion trap Tryptic digest

To train custom MS2PIPc models, please refer to Training new MS2PIP models on our Wiki pages.