Chemoproteogenomics

This repository provides code for generating missense-variant containing proteomic search databases (FASTA).
Command line MSFragger 2-stage searches are also provided (GUI is highly recommended).

Custom Database Generation

Generate sample-matched peptide variant-containing databases from VCFs with option to generate all combinations of variants. Outputs full protein sequence and trpytic peptide databases with detailed FASTA headers or minimal headers that can be used in searching mass-spectromtery based proteomics data.

VCF processing:

Before Running

Download or clone the repo

git clone https://github.com/hdesai17/chemoproteogenomics.git

Move VCF file into the root directory (../chemoproteogenomics) or make sure the working directory contains VCF, Annotations/Tools folders and GenerateBD.sh script
A sample VCF file is here

Important

The VCF has should have file name format SAMPLE_NAME_RNA.vcf or SAMPLE_NAME_Exome.vcf
Prior alignment to hg38

mv *.vcf chemoproteogenomics

Download Genocode v28 protein coding translations and GTF annotation files as well as RData file of common missense SNPs from this link and move all three into Annotations directory

mv *gencode /path/to/working/directory/Annotations/
mv *common /path/to/working/directory/Annotations/

Important

Several requirements necessary to run

R packages VariantAnnotation, BSgenome.Hsapiens.UCSC.hg38, svMisc, pbapply and any other dependencies
Python v3.0+
Standard system RAM and hardware

Running

./GenerateDB.sh [sample name] [TRUE/FALSE]

or

sh GenerateDB.sh [sample name] [TRUE/FALSE]

The arguments are sample name (no spaces) followed by TRUE or FALSE if generating all combination of variants (Default = FALSE)

Warning

Combinations of variants require R packages for parallel computing to minimize time:
doParallel, foreach, doSNOW
Uses all cores - 2
Recommended use without combos

Outputs

In the Custom_Databases folder, there are variations of FASTA databases:

2TS = two tryptic sites flanking variant sites, otherwise, they are whole protein sequences
rev = contains reverse sequences specified as REV
dedup = redundant peptide sequences are removed, regardless of transcript ID
simple = only Uniprot ID (minimal) headers

Example header with annotations:

sp|P01116-1|ENSG00000133703.11|KRAS|ENST00000256078.8|122572-H358_RARE_RNA-G12C|12c|

The internal TxID is used in makeDB.R to match missense mutations to proteins

Note

In FragPipe search outputs, the 'Protein' column will contain the fasta header information; though, it is recommended to re-map peptides to reference databases for any possible tryptic peptides that fall outisde of the variant region (these will be limited in the case of 2-stage searches).

Warning

For minimal (simple) FASTA headers, additional post-processing is required to obtain variant IDs after FragPipe searches.

MSFragger command-line 2-stage search

Note

The updated GUI is recommended over command-line scripts.

Process .raw MS files with an MSFragger pipeline using Philospher and Peptide Prophet for post-processing with optional IonQuant quantitation.

Before Running

Download or clone the repo

  git clone https://github.com/hdesai17/chemoproteogenomics.git

2-stage_cmd-line folder contains run script and helper scripts.

Important

Several files require path updates (see notes 'pipeline' and 'run' scripts)

Running

./2-stage-run.sh

or

sh 2-stage-run.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Chemoproteogenomics

Table of Contents:

Custom Database Generation

Before Running

Running

Outputs

MSFragger command-line 2-stage search

Before Running

Running

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 379 Commits
2-stage_cmd-line		2-stage_cmd-line
Annotations		Annotations
Tools		Tools
GenerateDB.sh		GenerateDB.sh
README.md		README.md
detailed.png		detailed.png
scheme_2.png		scheme_2.png

BackusLab/chemoproteogenomics

Folders and files

Latest commit

History

Repository files navigation

Chemoproteogenomics

Table of Contents:

Custom Database Generation

Before Running

Running

Outputs

MSFragger command-line 2-stage search

Before Running

Running

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages