This repository provides code for generating missense-variant containing proteomic search databases (FASTA).
Command line MSFragger 2-stage searches are also provided (GUI is highly recommended).
- Generating sample-specifc variant containing FASTA databases
- option for including all combinations of variants
- MSFragger command line pipelines for 2-stage searches
Instructions on running 2-stage searches with the GUI are located here https://fragpipe.nesvilab.org/docs/tutorial_two_pass_search.html
and in bioXiv manuscript: Multi-omic stratification of the missense variant cysteinome in supplementary information.
Important
❗ The updated GUI is recommended over command-line scripts provided here.
Generate sample-matched peptide variant-containing databases from VCFs with option to generate all combinations of variants. Outputs full protein sequence and trpytic peptide databases with detailed FASTA headers or minimal headers that can be used in searching mass-spectromtery based proteomics data.
VCF processing:
- Download or clone the repo
git clone https://github.com/hdesai17/chemoproteogenomics.git
- Move VCF file into the root directory (../chemoproteogenomics) or make sure the working directory contains VCF, Annotations/Tools folders and GenerateBD.sh script
A sample VCF file is here
Important
The VCF has should have file name format SAMPLE_NAME_RNA.vcf or SAMPLE_NAME_Exome.vcf
Prior alignment to hg38
mv *.vcf chemoproteogenomics
- Download Genocode v28 protein coding translations and GTF annotation files as well as RData file of common missense SNPs from this link and move all three into Annotations directory
mv *gencode /path/to/working/directory/Annotations/
mv *common /path/to/working/directory/Annotations/
Important
Several requirements necessary to run
- R packages VariantAnnotation, BSgenome.Hsapiens.UCSC.hg38, svMisc, pbapply and any other dependencies
- Python v3.0+
- Standard system RAM and hardware
./GenerateDB.sh [sample name] [TRUE/FALSE]
or
sh GenerateDB.sh [sample name] [TRUE/FALSE]
The arguments are sample name (no spaces) followed by TRUE or FALSE if generating all combination of variants (Default = FALSE)
Warning
Combinations of variants require R packages for parallel computing to minimize time:
doParallel, foreach, doSNOW
Uses all cores - 2
Recommended use without combos
In the Custom_Databases folder, there are variations of FASTA databases:
- 2TS = two tryptic sites flanking variant sites, otherwise, they are whole protein sequences
- rev = contains reverse sequences specified as REV
- dedup = redundant peptide sequences are removed, regardless of transcript ID
- simple = only Uniprot ID (minimal) headers
Example header with annotations:
sp|P01116-1|ENSG00000133703.11|KRAS|ENST00000256078.8|122572-H358_RARE_RNA-G12C|12c|
sp|Uniprot-ID|Ensembl Gene ID|Gene Name|Ensembl Transcript ID|Internal-TxID_Sample_Rare/Common-NGS-Source|AA Mutation|
The internal TxID is used in makeDB.R to match missense mutations to proteins
Note
In FragPipe search outputs, the 'Protein' column will contain the fasta header information; though, it is recommended to re-map peptides to reference databases for any possible tryptic peptides that fall outisde of the variant region (these will be limited in the case of 2-stage searches).
Warning
For minimal (simple) FASTA headers, additional post-processing is required to obtain variant IDs after FragPipe searches.
Note
The updated GUI is recommended over command-line scripts.
Process .raw MS files with an MSFragger pipeline using Philospher and Peptide Prophet for post-processing with optional IonQuant quantitation.
- Download or clone the repo
git clone https://github.com/hdesai17/chemoproteogenomics.git
- 2-stage_cmd-line folder contains run script and helper scripts.
Important
Several files require path updates (see notes 'pipeline' and 'run' scripts)
./2-stage-run.sh
or
sh 2-stage-run.sh