Skip to content

Generate protoegenomic search databases form variant-calling output (VCF files) and run 2-stage or multi-round FragPipe searches.

Notifications You must be signed in to change notification settings

BackusLab/chemoproteogenomics

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Chemoproteogenomics

This repository provides code for generating missense-variant containing proteomic search databases (FASTA).
Command line MSFragger 2-stage searches are also provided (GUI is highly recommended).

Table of Contents:

Instructions on running 2-stage searches with the GUI are located here https://fragpipe.nesvilab.org/docs/tutorial_two_pass_search.html
and in bioXiv manuscript: Multi-omic stratification of the missense variant cysteinome in supplementary information.

Model

Important

The updated GUI is recommended over command-line scripts provided here.

Custom Database Generation

Generate sample-matched peptide variant-containing databases from VCFs with option to generate all combinations of variants. Outputs full protein sequence and trpytic peptide databases with detailed FASTA headers or minimal headers that can be used in searching mass-spectromtery based proteomics data.

VCF processing:

Model

Before Running

  1. Download or clone the repo
git clone https://github.com/hdesai17/chemoproteogenomics.git
  1. Move VCF file into the root directory (../chemoproteogenomics) or make sure the working directory contains VCF, Annotations/Tools folders and GenerateBD.sh script
    A sample VCF file is here

Important

The VCF has should have file name format SAMPLE_NAME_RNA.vcf or SAMPLE_NAME_Exome.vcf
Prior alignment to hg38

mv *.vcf chemoproteogenomics
  1. Download Genocode v28 protein coding translations and GTF annotation files as well as RData file of common missense SNPs from this link and move all three into Annotations directory
mv *gencode /path/to/working/directory/Annotations/
mv *common /path/to/working/directory/Annotations/

Important

Several requirements necessary to run

  • R packages VariantAnnotation, BSgenome.Hsapiens.UCSC.hg38, svMisc, pbapply and any other dependencies
  • Python v3.0+
  • Standard system RAM and hardware

Running

./GenerateDB.sh [sample name] [TRUE/FALSE]

or

sh GenerateDB.sh [sample name] [TRUE/FALSE]

The arguments are sample name (no spaces) followed by TRUE or FALSE if generating all combination of variants (Default = FALSE)

Warning

Combinations of variants require R packages for parallel computing to minimize time:
doParallel, foreach, doSNOW
Uses all cores - 2
Recommended use without combos

Outputs

In the Custom_Databases folder, there are variations of FASTA databases:

  • 2TS = two tryptic sites flanking variant sites, otherwise, they are whole protein sequences
  • rev = contains reverse sequences specified as REV
  • dedup = redundant peptide sequences are removed, regardless of transcript ID
  • simple = only Uniprot ID (minimal) headers

Example header with annotations:

sp|P01116-1|ENSG00000133703.11|KRAS|ENST00000256078.8|122572-H358_RARE_RNA-G12C|12c|

sp|Uniprot-ID|Ensembl Gene ID|Gene Name|Ensembl Transcript ID|Internal-TxID_Sample_Rare/Common-NGS-Source|AA Mutation|

The internal TxID is used in makeDB.R to match missense mutations to proteins

Note

In FragPipe search outputs, the 'Protein' column will contain the fasta header information; though, it is recommended to re-map peptides to reference databases for any possible tryptic peptides that fall outisde of the variant region (these will be limited in the case of 2-stage searches).

Warning

For minimal (simple) FASTA headers, additional post-processing is required to obtain variant IDs after FragPipe searches.

MSFragger command-line 2-stage search

Note

The updated GUI is recommended over command-line scripts.

Process .raw MS files with an MSFragger pipeline using Philospher and Peptide Prophet for post-processing with optional IonQuant quantitation.

Before Running

  1. Download or clone the repo
  git clone https://github.com/hdesai17/chemoproteogenomics.git
  1. 2-stage_cmd-line folder contains run script and helper scripts.

Important

Several files require path updates (see notes 'pipeline' and 'run' scripts)

Running

./2-stage-run.sh

or

sh 2-stage-run.sh

About

Generate protoegenomic search databases form variant-calling output (VCF files) and run 2-stage or multi-round FragPipe searches.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • R 57.0%
  • Shell 21.8%
  • Python 21.2%