Skip to content

Latest commit

 

History

History
10246 lines (8690 loc) · 313 KB

index.md

File metadata and controls

10246 lines (8690 loc) · 313 KB

About

Quantitative mass spectrometry is a formidiable approach to identify differentially abundance proteins from shotgun proteomics experiments (Liu et al. 2013, (Mann et al. 2013)). A common problem for shotgun proteomics experiment has been low reproducibility due to the complexity of samples analyzed (Liu et al. 2004,Li et al. 2009, Amodei et al. 2019, Michalski et al. 2011, (Li et al. 2009) as well as the available method for analyzing data. For example Michalski et al. 2011 has found that only about 16 % of the detectable peptides are typically fragmented using data dependent LC-MS/MS methods, and there is typically low reproducibility (35-60% overlap of peptide identification) between experiment (Tabb et al. 2010). This problem is especially prominent in data dependent acquisition (DDA) methods. Attempts at solving these issues have been made by the introduction of data independent acquisition (DIA) methods (Venable et al. 2004, Plumb et al. 2006, Distler et al. 2014, Moran et al. 2014, Pak et al. 2013, Geiger et al. 2010, Panchaud et al. 2011, Weisbrod et al. 2012, Carvalho et al. 2010, Egertson et al. 2013), where spectras are acquired to a predefined fixed window. These methods have been shown to have superior reproducibility to DDA. Navarro et al. 2016 benchmarked five DIA protein quantification methods; four peptide-centric query tools (OpenSwath, SWATH2.0, Skuline, Spectronaut) and one data-centric approach DIA-umpire. The peptide-centric methods use MS/MS libraries to match peptides followed by statistical analysis to distinguise true from false discoveries. While the data-centric approaches generates "pseudo" MS/MS spectra that can be identified and quantified using convential database searching and protein inference tools (without assay libraries, but requiring FASTA database similar to conventional DDA search engines) Zhong Li et al. 2009iTsou et al. 2015. These DIA protein quantification processes are multistep procedures often incorporating imputation, filtering and tresholding procedures and various stages. These procedures often ignore the error sources. These imputations causes errors to propagate down the processing pipeline The et al. 2018. To solve this issue Triqler a novel software for protein quantification has been developed. It uses probabilistic graphical models to generate posterior distributions for fold changes between treatment groups. By using probabilistical graphical model triqler eliminates the need for data imputation. This project aims to benchmark Triqler against commonly used softwares for DIA protein quantification.

Problem

Triqler is a novel software that uses a probabilistical graphipcal models for protein quantification, effectively eliminating the need for filtering, tresholding and imputational procedures. Previous efforts for DIA protein quantification has been dependent on spectral libraries for peptide and protein identification, as well as the construction of pseudo-MS/MS which are similar to DDA spectras and therefore allows for peptide and protein identification with conventional search-engine algorithms, which often use the procedures.

The use of probabilistical graphical models for protein quantification has not yet been shown better than existing methods. There are strong theorethical reason to believe that bayesian models outperform frequentist models which require filtering, tresholding and imputationi. Further, bayesian methods for treating the missing value problem as proved effective in other biological areas Halme et al. 2018Shah et al. 2019Luo et al. 2014. Moreover, Triqler has shown to be able to distinguise more protein for DDA data than other commonly used DDA protien quantification methods The et al. 2018. It is therefore interesting to investigate the performance of Triqler as a protein quantification tool for DIA data.

Preliminary Research Question

The research aims to answer the following questions:

  • How does triqler protein quantification compare to existing DIA protein quantification pipelines?

This question will be answered by investigating following sub-questions:

  • How do we benchmark Triqler, a bayesian model, against existing methods, such as Spectronaut?
  • What performance metrics is relevant to benchmark against?
  • How to we show the comparison in a fair manner?

Data

A DIA benchmarking sample with hybrid proteome samples consisting of tryptic digests of human, yeast and E.coli protein is used for this study. There are two samples, A and B, the expected ratio of protein for A:B comparison are 1:1 for human, 1:2 for yeast and 4:1 for E.coli. Sample A is composed of 65% human, 30% yeast, 5% E.Coli, while sample B was composed of 65% human, 15% yeast and 20% E.coli. The absolut amount of proteins are unknown, but the dataset contain a sufficiently large number of peptides to enable the in-depth evaluation of both precision and accuracy of relative label-free quantification (Kuharev et al. 2015). This sample is referred to as HYE124. The biological samples A and B both have three technical replicates on two different platforms (TripleTOF 5600 and TripleTOF 6600) using two different SWATH-MS acquisition modes i.e. we have four benchmark datasets. Further there are an additional four datasets reffered to as HYE110 for benchmarking the effect of different window numbers (32 vs 64 windows) and window size (fixed vs variable). HYE110 sample A has 67% human, 30% yeast and 3% E.coli, while sample B has 67% human, 3% yeast and 30% E.coli. For retention time alignment of the samples, a retention time kit(iRT kit from Biognosys, GmbH)[https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3918884/](Escher et al. 2012) was used at a concentration of 1:20 in all samples. The data set has the ProteomeXchange accession number PXD002952.

Method

As Triqler takes peptide quantity, intensity and searching score to determine relative differences between groups. It is dependent on peptide quantification from other DIA quantification softwares. This section will outline the procedures in the processing pipeline for DIA analysis and subsequently the Triqler analysis.

[Process is specific to OpenSwath] Library based DIA search engine require a DDA-search to construct a spectral library. This process is described thoroughly in Schubert et al. 2015. This library is used to match the peptides found in the multiplexed windows of DIA SWATH:s. A DDA run often yield a vendor format such as .raw or .wiff. These format can be converted to standard format such as .mzML and .mzXML for further processing. To match the DDA run a conventional database searches is used (for example with X!Tandem, Mascot or Comet). Amonst other database files are available from UniProt, MassIVE, PRIDE (a common database standard format is FASTA). The DDA search output is then statistically scored to distinguish false discoveries from true ones (using for example PeptideProphet with iProphetsource or Percolatorsource). Spectral libraries can then be constructed from the DDA output with for example SpectraST and decoy peptides can be generated for this spectral library with openMS Schubert et al. 2015. The same process is repeated for to create a spectral library for iRT peptides, to construct a spectral library for retention time normalization Röst et al. 2016. Similarly, for a DIA run a vendor format needs to be converted to a standard format. A DIA peptide identification and quantification (using either method specified above) is performed with the generated spectra library (DDA peptide identification) to identify peptides from peak groups in the DIA data. PyProphet or Percolator can then be used to statistically score thses outputs and lastly TRIC can be used for cross-run alignment to reduce the missing values. Differential protein analysis can then be performed by conventional statistical analysis or Triqler.

This study aims firstly to investigate how Triqler performs in analysing DIA differential expression against conventional statistical methods. Where conventional statistical methods consist of filtering, tresholding and imputational steps performed in other studies. One such example is the usage of top three peptides as protein quantification in Navarro et al. 2016. Where the top three peptides method includes filtering of any protein with less than three peptides. Secondly it also aims to identify how errors in the various steps effects the end-results of conventional statistical analysis and Triqler.

Related studies and papers

Integrated identification and quantification error probabilities for shotgun proteomics

Integrating identification and quantification uncertainty for differential protein abundance analysis with Triqler

A new era in proteomics: spectral library free data independent acquisition (directDIA)

Missing Value chapter of Course Handouts for Bayesian Data Analysis Class

A foray into Bayesian handling of missing data

ToDO:

  • make a write-up about details about the projects.

Blog

(line 52 as reference)

2022-01-24

Fixing the parser msqrob2 for diann and osw.

2022-01-18

Off - until 2022-01-24

2022-01-17

08:00

Trying to get all the packages installed on my local computer.

13:02

statOmics/msqrob2#32

Issues solves: The example vignette on msqrob2 github repo was outdated.

Trying to figure out data format to parse my data to msqrob2.

16:00

Finished diann -> msqrob parser

1.load diann (result.tsv) 2. Select top PSM as peptides 3. Recompute FDR with CScores for filtered list. 4. Filter on peptide-level fdr. 5. Construct dataframe with df.pivot()

2022-01-16

11:00

Continue trying to fix msqrob2

22:00

Posted issue on msqrob2 issues.

2022-01-14

07:30

Check msqrob2 and continue trying to fix package error

09:30

Meeting

10:15

Thinking about the simulations probelm.

IF we have 3 proteoforms, with 10 unique peptides.

We can have 3 different distributions for the proteoforms, and we can have 1 distribution, where we rename the last peptide 10A, 10B and 10C. We assume that it is some simple PTM where we add on some simple amino acids.

We use the 3 distributions the vary the ratios of the proteoforms.

Using this data we can perform PCA.

2022-01-13

08:00

PCA on the PXD003704 data, reading paper about PXD003704. Check formatting on data.

14:50

Finished parallell coordinates plot on PXD003704. Seems very hard to read? How do we proceed?

start checking on msqrob2

r 3.6 is outdated. Rebuilding everything for r 4.0

17:40

Still trying to rebuild. I think It did not work because I had not used sudo.

This error:

Error: package or namespace load failed for ‘cfcdae’: package ‘cfcdae’ was installed before R 4.0.0: please re-install it

Trying this method.

https://stackoverflow.com/questions/63390194/package-xxx-was-installed-before-r-4-0-0-please-re-install-it

18:10

Sudo seemed to have solved it... with above link.

20:33

msqrob2 vignette example. https://statomics.github.io/msqrob2Examples/cptac.html

Cannot solve this error for normalize function from preprocessCore. Error in preprocessCore::normalize.quantiles(x, ...) : ERROR; return code from pthread_create() is 22

2022-01-12

06:15

Look at PXD003704 maxquant processed data.

16:00

Formatting PXD003704 data and reading up on the data.

What is Reporter Intensity Count What is Reporter Intensity CTRL_R1...CTRL_R4, and Pablo_R1... Pablo_R4.

2022-12-11

05:00

Working on fix on paper.

14:20

Working on formplot and making slides.

20:25

Reading up on proteoforms.

22:05

Looked into PXD003704 MCF-7 data. Finished ppt and reading about CETSA, proteoform, thermal shift assay, Thermal Proteome Profiling.

2021-11-17

Getting confused. Is mscore from pyprophet FDR?

It is stated here

https://assets.researchsquare.com/files/rs-893982/v1/a3afe574-fa8c-44e9-9ea1-e46b1579d558.pdf?c=1636997981

that it is?

2021-11-16

MSqRobSum should be which feature preptide id belongs to.

MSstats uses

Summary: MSstats is an R package for statistical relative quantification of proteins and peptides in mass spectrometry-based proteomics. Version 2.0 of MSstats supports label-free and label-based experimental workflows and data-dependent, targeted and data-independent spectral acquisition. It takes as input identified and quantified spectral peaks, and outputs a list of differentially abundant peptides or proteins, or summaries of peptide or protein relative abundance. MSstats relies on a flexible family of linear mixed models.

2021-11-15 investingating input format for msstats, msqrobsum and triqler

MSstats uses MaxQuant output format. It seems to be PSM-level FDRs. Looking the input converter from OpenSwathWorkFlow it also seems to be PSMs level.

MSstats uses feature-level data. This should be peptide-features right? Also, I have run this exactly as the tutorial. So this should be fine.

MsqRobSum this should be done using summarized peptide intensities. This paper also have some information about MSstats run, claiming peptide-level run.

MSqRobSum should have peptide-level data. So we need to find peptide-level FDR:s.

Top3 Should have q-value recalculations based on target-decoy method. (check assess_fdr_overall)

ToDo:

  • check how to compute m-score to q-value? or describe the problems of mscore more thoroughly in the paper
  • describe the psm Q.Value from DIANN in paper.
  • Feature level data (PSMs) seems to be ok for MSstats, is it ok for msqrobsum or do we need to rerun msqrobsum?
  • Can we use m_score treshold for Top3 computation. How did LFQbench Navarro et al. do ?

2021-04-27 cont. working on report... check .jupyter from same date..

2021-04-23 number of DE protein

Check jupter notebook from today...

2021-04-22 Cannot do the same top3 as in life and death.

There is no id for the peptides that are coherent across files.

We have to do column-wise top 3.

lets dubbel check this tomorrow. nos this is late.

2021-03-30

m_score now working. I get the unfiltered data and can get it through triqler.

Note: triqler expects larger searchScore to be more significant so we need to -np.log10(searcScore) if searchScore is m_score.

Why do we have multiple peptides of the same in one run? Can it be because the same m/z is in 3 different windows?

2021-03-26-28

Fragpipe installed.

Fragpipe running with bug in MSFragger part (Cannot find isolation window.) Ticket sent to MSFragger github.

Fragpipe run tested on subset of tezner data, and Galaxy tutorial data set.

Fragpipe is currently running without MSFragger... only DIA-Umpire. I need to figure out what part of DIA-Umpire is being used, perhaps the whole workflow is being used.

Reconverted galaxy tutorial subset with MS-level 1-2 peakpicking (MSconvert)

Also, notice the fragpipe consumes a lot of memory. I am nor only processing parts of 32fix ttof6600 hye124 set. (4 .mzML files) otherwise I get memory errors.

Dia-Umpire seem to complete successfully, when MS Fragger is not selected... but still the consolde gives exit code 1 and 30 cancelled processes.

Running MSFragger without DIAUmpire to check if this works.

2021-03-25 Run triqler on all data.

Run triqler on all data sets with default settings. There seem to be some issues with run and condition balancing... Some conditions dont have enough run.

Making analysis script, Triqler has much lower proteins in it than tric aligned... I suspect that the fdr-tresholding before triqler causes the issue. I have found that pyprophet already does m_score filtering of 0.05 by default.

Rerunning pyprophet with ss_iteration_fdr (Iteration FDR cutoff for best scoring targets) set to 1.0, but --ss_initial_fdr could also be a potential culprit.

2021-03-22 de analysis script.

Redoing DE analysis script for the hye124 dataset.

Learn to build hierarchical dataframes.

Check how to name levels so we can select levels and iterate trough samples.

2021-03-21 Week todos.

  • Ask Markus about Fragpipe and easyPQP for spectral library building.
  • Ask Lukas if he managed to get the spectraST spectral library building working?
  • Parse PASS00788.
  • Rebuild R for LFQBench.
  • Triqler parser.
  • Run triqler on data set.
  • Find comparative statistics (from LFQBench paper)
  • Ask Lukas if the statistics make sens.

2021-03-18 tryptic pepties

Read this about tryptic peptides. https://unipept.ugent.be/clidocs/casestudies/tpa

Looking into Lukas script about tryptic peptides.

created script for spectral lib build with both percolator and peptideprophet.

2021-03-16 spectral lib build.

Run crux bullseye on DDA data.

cruxbullseye:

bullseye.log.txt bullseye.no-pid.ms2 bullseye.params.txt bullseye.pid.ms2 hardklor.mono.txt

crux tide-index:

bullseye.log.txt bullseye.no-pid.ms2 bullseye.params.txt bullseye.pid.ms2 hardklor.mono.txt tide-index.decoy.fasta tide-index.log.txt tide-index.params.txt

crux tide-search:

�bullseye.log.txt �bullseye.no-pid.ms2 �bullseye.params.txt �bullseye.pid.ms2 �hardklor.mono.txt �tide-index.decoy.fasta �tide-index.log.txt �tide-index.params.txt �tide-search.decoy.txt �tide-search.log.txt �tide-search.params.txt �tide-search.target.txt

crux percolator �bullseye.log.txt �bullseye.no-pid.ms2 �bullseye.params.txt �bullseye.pid.ms2 �hardklor.mono.txt �make-pin.pin �percolator.decoy.peptides.txt �percolator.decoy.psms.txt �percolator.log.txt �percolator.params.txt �percolator.target.peptides.txt �percolator.target.psms.txt �tide-index.decoy.fasta �tide-index.log.txt �tide-index.params.txt �tide-search.decoy.txt �tide-search.log.txt �tide-search.params.txt �tide-search.target.txt

Mailed Lukas about issue with exlusion of ions.

2021-03-15 Check SpectraST and library build also checking Hannes Röst set.

Hannes röst set PASS01508 is the same as PASS00788. Tried to unzip one file in PASS00788. It is unzipable.

Checking LFQBench.

Building LFQBench for r.

Getting error at installing LFQBench.

Submitted following ticket.

Trying to undestand Ion Libraries Statistics in supplementary material.

  • how do they choose the ions to be excluded.
  • Are they just using Comet and Mascot... recheck this with openswath tutorial. I think we use comet/mascot first then use spectraST.

Checked out LGQBench paper. It has some useful information about database search..

I guess I downloaded the correct .fasta for human, yeast and ecoli.

(I need to check this with Lukas, how do I know which human, yeast and ecoli to choose?)

2021-03-09

Coded and modified the pyprophet_score.sh script to all data set with correct library.

Added a test for different windowed library (64var spectral library for a 32fix data set in /hdd_14T/data/PXD002952/osw_res_20210303/hye124/ttof5600/32fix_w_64v_library_test). Investigate the merged_score diagnostics for this one to see the decoy and target distribution.

  • Diagnostics plot seem to show same diagonistics plot regardless of how the spectral library was build. I should check up on how spectral library was build... (!!!!)

Coded TOP3 peptide for protein quantification in dia_sum/scripts/PXD002952/working_script_experiment_analysis.py

TOP3 approach: the three most intense peptide quantitative values of each individual run are averaged (a minimum of two peptides is required). Produced peptide and protein reports can be directly used in the main LFQbench module. FSWE filters results at the protein level, only proteins having quantification values in at least two technical replicates in at least one of the samples are considered for further analysis.

Code for diagnostics summary and protein quantification using top3... working on it, functions defined.

At this moment hye110_ttof6600/32fix seem to have:

experiment_id sample_id proteins 0 001-Pedro [A] 2005 1 002-Pedro [B] 2008 2 004-Pedro [B] 2025 3 003-Pedro [A] 1941 4 010-Pedro [A] 1887

Todo:

  • Think about how to diff. exp on proteins... what about non-overlapping proteins? How should i differentially express these?

  • How do I compute and compare, impute on protein?

  • Perhaps I should build a dataframe with protein as index and and protein quantity for each sample as...?

  • Perhaps Averageing over sample A proteins and sample B proteins would work? for sample A vs B comparison but not quantification comparison.

  • I need to make a 5x5 matrix with comparisons. What do I do with NaN proteins... they just cant compare... I guess we drop the nan comparisons...?

  • What about TRIC?

2021-03-08

Performing pyprophet merging, scoring and legacy export on all data set components.

ToDo.

  • Check up on library serach method used in tenzer
  • make diff exp. analysis on legacy tsv.
  • converter to triqler
  • Read up on proteoform.

2021-03-04

Checked computation from last log... forgot to make tmp file. Need to rerun computation.

2021-02-26

Meeting.

Reading on compositional data (links from meeting).

Think about:

  • Why is Life and Death data compositional data.
  • Check mail how is the data normalized?

MSconvert centroided .wiff files completed...

centroided files are much smaller (7.5GB - 13.7GB) vs (7.8GB - 74.4GB).

Run openSwath_run_centroided.sh on centroided dataset.

2021-02-25

Got Tenzer set working.

Whole set not working.

Need to check subsampling process for pyprophet, but first write about process.

Started msconvert with peakpicking (centroided - default option) for tenzer set.

Tenzer retention time

Outliers havebeen defined as producing a standard deviation of the peak retention time greater than0.2 minutes relative to all other software tools detecting that precursor, after removingambiguous cases, in which more than one software tool produce a greater standard deviationin the peak retention time

Note from Röst et al. 2014 OpenSwath analysis require iRT calibration.

Updated project description.

2021-02-22

Just got the iRT file from Tenzer. I should start the computations.

Check Tenzer article about how protein summarization is made.

Started PXD002952 computations with hroest_iRT.TraML file.

2021-02-19

Worked very late... Jupyter notebook created and has log.

2021-02-18

...also filling in log from 2021-02-07 evening session.

Downloaded https://training.galaxyproject.org/training-material/topics/proteomics/tutorials/DIA_Analysis_OSW/tutorial.html#get-data

It is now in "\hdd_14T\galaxyTutorial"

Converted file with msconvert using instructions from the webpage.

Started OpenSwathWorkflow with the parameters specified in webpage.

Parsed PASS01508 data to triqler format and ran triqler.

Output was a matrix of 1.0

Matthew suggested log-transforming m_score or using percolator to concatenate multiple m_scores to one consensus score before running triqler for better results.

OSW output results log here

2021-02-17

Can we use docker for end-results?

Can we interface triqler from multiple methods quickly?

converting in PASS01508/results/pyProphetOutput/... log here

I suspect the data files are corrupt from the ftp download. I cannot convert the .mzML.gz to .mzML and I cannot convert all .wiff files. (PASS01508)

I will try to convert the results files in the folder while I'm waiting for files to download and convert.

2021-02-15 Creating sh script.

Triqler did not work because PASS00779 had only one condition. It requires at least two samples per condition and at least two conditions.

The PXD014956 set I found before has all the spectralLibraries with normalized iRT build in them(?), but it does not have window size files.

I do not really understand the window size of 49 between 350-1400mz range, but I will create a script to try to run OpenSwathAnalyzer on both settings

Created swath_window generation script.

Started running script for PXD014956.

Started downloading PASS01508

Just found another OpenSwath Tutorial paper with a data sets with 2 condition (S. pyogenes frown in 0% and 10% human plasma grown in duplicates), 4 technical replicates for each sample. Thus, the dataset has 15 runs in total.

2021-02-13 Looking for dataset.

Looking for DIA data set.

Investigating OpenSWATH**.tsv files in PXD002952. These seem to be the results from tric for the plot used in the paper. I can not find a iRT-normazliation file in the repo.

Just found Barkovitz et al. 2020 with the dataset PXD014956. I will try quantify these proteins using OpenSWATH to see if I get better results than before.

link to instructions to run openswath on PXD014956

2021-02-12 Creating triqler parser for TRIC output.

Meeting.

I should check the difference between rt for different peptides in PXD002952 and PASS00779.

I'll start with creating a triqler parser.

Compare:

  • aligned_osw.tsv (PASS00779)
  • ecolihumanyeast_concat_mayu_IR... (PXD002952)
  • iRTassays.tsv (PASS00779)
  • Mtb_TubercuList-R27_iRT_UPS_dec... (PASS00779)
  • osw_output.tsv (PXD002952)

To get a feeling for what iRT normaliztion does.

2021-02-11 Experiment cont.

Potential errors to check.

OpenSwathDecoyGenerator

.tsv output files

Potential errors:

  • missing a file hroest_DIA_iRT.tsv
  • OpenSwathDecoyGenerator method.

Just realized the pyprophet for py27 and py3 is not the same.

Tutorial is using the legacy pyprophet for py27. This could, but should not matter...

Generating new results for a random sample overview statistics:

qvalue svalue TP FP TN FN FDR sens cutoff 0 0.722734178065634 4917 7379 29586.5112258576 1886.48877414245 0.600107229722898 0.722717441482039 0.848865877072642

note -TP:4917 -FP:7379

Lets look at another data set and se if we can find a dataset with iRT.TraMLva

Checking the pyprophet with iRT.TraML file for PASS00779.

Experiment_sampleA_sampleA results:

PXD002952 results:

pyprophet results

pyprophet results

summary stats:

pyprophet results

osw results

summary stats:

We notice the difference that PXD002952 histograms perfectly overlap, PASS00779 histograms have a slight shift between target and decoy.

Looking at the pyprophet results, we see that the results from PASS00779 have a much larger shift in Top Peak Groups d_score density for all results than the results shown in "test report" for PXD002952. The "test report" however does have a slight shift in d_score.

Also looking at the summary statistics we see that there is more FP and TP for PXD002952. Indicating something is wrong.

According to Röst et al. 2016. The report step provides an opportunity to identify common errors, such as using an assay library without decoys or using an assay library unsuited for the measured sample (e.g., from another organism).

The library for PXD002952 used is the one provided in the PXD002952 repo. Therefore, it should be correct. The only file I have not used is the hroest_DIA_iRT.TraML for rt normalization in the openswath run, which could be the culprit for the error in pyprophet statistical validation for PXD002952. In successful runs, the target and decoy d_score should be clearly seperated as in the PASS00779 results.

The tric cross-run alignment worked for PASS00779 log.

IMPORTANT NOTES:

  • the pyprophet used in tutorials are the py27 (parameter args are different for py36).
  • msproteomicstools (tric) can only be used with py36

Triqler input mapping (thoughts): {run: run_id, condition: filename, charge: Charge, searchScore: var_dotprod_score or var_library_corr or var_library_dotprod or var_library_manhattan or var_library_rmsd or var_library rootmeansq or var_library_sangle., m_score (mProphet q-value score, d_score - discriminant score.), intensity: intensity, peptide: FullPeptideName, protein: ProteinName }

2021-02-10 Experiments in experiment folder conducted.

Experiments and experiment log in folder...

2021-02-09 PercolatorAdapter running and want to get TRIC working.

openSwath_run.sh script works now. Errors from before:

  • Had not matched correct spectral library to correct data sets. Used one for alla data sets.
  • had not created a tmp folder for tempCache

Some files in the run is crashing... I will link to log when the run is finished.

The following error message is shown:

Thread 2_0 will analyze 6733 compounds and 40398 transitions from SWATH 4 (batch 6 out of 6)
      18.18 %               Thread 1_0 will analyze 6573 compounds and 39438 transitions from SWATH 7 (batch 3 out of 6)
      21.21 %               Thread 1_0 will analyze 6573 compounds and 39438 transitions from SWATH 7 (batch 4 out of 6)
Thread 1_0 will analyze 6573 compounds and 39438 transitions from SWATH 7 (batch 5 out of 6)
Thread 5_0 will analyze 6041 compounds and 36246 transitions from SWATH 8 (batch 0 out of 6)
./openSwath_run.sh: line 6: 346530 Killed

Not sure why it is killed, but it might not matter because I have a full set from the experiment with 64 fixed windows DIA run.

Using 64 fixed windows DIA run and running percolator on that set with the scripts.

  • percolator_on_osw_results.sh - (NOTE: for loop prefix of files need to be specified in the script for new files)

Creating a log folder in github, so I can store and make this log more compact.

percolator_on_osw_results_log

Next up... convert .osw file to .tsv using PyProphet.

This seemed to fail... I did not finish the log because it froze the terminal. The log seemed to indicate something about failed SQL queries. I guess it has to do with PercolatorAdapter not using the same columns as mProphet. I will use PyProphet for statistical validation.

Produced script pyprophet_on_osw_results.sh to perform PyProphet statistical validation on .osw results.

According to docs pyprophet_on_osw_results.sh methods was wrong. The .sh file is removed.

We should merge all the .osw file for scoring.

New scripts produced:

  • pyprophet_merge.sh - for merging all the .osw files (should be without merged.osw output in folder)
  • pyprophet_score.sh - for scoring merged.osw and outputting merged.scored.osw

About pyprophet run: http://openswath.org/en/latest/docs/pyprophet.html

About converting sqlite .osw format to .tsv (only for pyprophet output) http://openswath.org/en/latest/docs/sqlite.html

PyProphet scoring seemed to have worked for ms1 and ms2, but not transition(?). (log here)

Results pdf: ms1 results ms2 results

Converting merged.scored.osw to .tsv with pyprophet_export_tsv.sh.

Check this warning

By default, IPF results on peptidoform-level will be used if available. This can be disabled by setting --ipf=disable. The IPF results require different properties for TRIC. Please ensure that you want to analyze the results in the context of IPF, else, use the --ipf=disable or --ipf=augmented settings.

Running TRIC. (see this)

~/tools/msproteomicstools/analysis/alignment/feature_alignment.py --in MERGED.tsv --out aligned.csv --method LocalMST --realign_method lowess_cython --max_rt_diff 60

tric_default_20200209_log.txt

~/tools/msproteomicstools/analysis/alignment/feature_alignment.py --in MERGED.tsv --out aligned.csv --method LocalMST --realign_method lowess_cython --max_rt_diff 60 --fdr_cutoff 1 > tric_fdr_1_20200209.txt

tric fdrcutoff 1

~/tools/msproteomicstools/analysis/alignment/feature_alignment.py --in MERGED.tsv --out aligned.csv --method LocalMST --realign_method lowess_cython --max_rt_diff 60 --alignment_score 1 > tric_alignmentscore_1_20200209.txt

tric run alignment score 1.0

~/tools/msproteomicstools/analysis/alignment/feature_alignment.py --in MERGED.tsv --out aligned.csv --method LocalMST --realign_method lowess_cython --max_rt_diff 60 --alignment_score 1 --fdr_cutoff 1 > tric_alignmentscore_1_fdrcutoff_1_20200209.txt

tric alignment 1.0 fdrcutoff 1.0

~/tools/msproteomicstools/analysis/alignment/feature_alignment.py --in MERGED.tsv --out aligned.csv --method LocalMST --realign_method lowess_cython --max_rt_diff 60 --alignment_score 1 --fdr_cutoff 1 > tric_alignmentscore_1_fdrcutoff_1_cylowess_installed_20200209.tx

tric alignment 1.0 fdrcutoff 1.0

Success!

Now, the question is why do i have to have so high alignment_score, fdr_cutoff and what is d-score?

d-score p.8 of OpenSWATH enabled automated...; says it is the discriminating meta score of mProphet (the score from the semi-supervised algorithm used in mProphet Rosenberger et al. 2017).

I should perhaps read the mProphet paper (Reiter et al. 2011)... I will check this in group meeting tomorrow.

What is next step:

  • How do I set alignment score?
  • How do I set fdr cutoff?
  • Why do I need to have the fdr cutoff so high? (Have I messed up some parameter?).
  • Parse the triq-results to triqler?
  • What is the next step for results after TRIC?
  • Check description and continue rewriting it.

2021-02-08 openSwath.sh run crashed.

Forgot to make tmp folder, runs crashed... created tmp folder. Rerunning openSwath.sh.

Copied from project description and slightly rewrote About section.

Reordered blog entries so that we do not need to scroll all the way down.

2021-02-07 Making script for doing the same procedure for the whole dataset.

Greated scripts for running the whole pipeline with all files so that all files have different spectral libraries.

Note 64w_fixed spectral library is in .csv file, and need to be changed to .txt before running this.

Scripts should be run in order.

  • spectral_library_txt_to_csv.sh - converts .txt to tsv.
  • convert_spectral_lib.sh - created .TraML files from .tsv files.
  • make_decoy_spectral_library.sh - appends decoys with pseudo-reverse method to .TraML files.
  • make_pqp_from_decoy_spectral_library.sh - converts the .TraML with decoys to .pqp files.
  • openSwath_run.sh - runs openswath with correct spectral library for different window size.

openSwath_run.sh run in progress...

Next step edit about text and refresh the project description.

2021-02-05

Bugg report... rerunning the whole pipeline to get all the commands and outputs in one log.

  1. Converting the spectral library to .TraML and .PQP

(base) ptruong@planck:/hdd_14T/data/PXD002952/bugg_report_20210205$ TargetedFileConverter -in ecolihumanyeast_concat_mayu_IRR_cons_openswath_32w_var_curated.tsv -out ecolihumanyeast_concat_mayu_IRR_cons_openswath_32w_var_curated.TraML Progress of 'conversion to internal data representation': -- done [took 0.93 s (CPU), 0.94 s (Wall)] -- TargetedFileConverter took 8.23 s (wall), 8.17 s (CPU), 0.67 s (system), 7.50 s (user); Peak Memory Usage: 503 MB. (base) ptruong@planck:/hdd_14T/data/PXD002952/bugg_report_20210205$ TargetedFileConverter -in ecolihumanyeast_concat_mayu_IRR_cons_openswath_32w_var_curated.tsv -out ecolihumanyeast_concat_mayu_IRR_cons_openswath_32w_var_curated.pqp Progress of 'conversion to internal data representation': -- done [took 0.98 s (CPU), 0.98 s (Wall)] -- TargetedFileConverter took 24.13 s (wall), 22.41 s (CPU), 0.53 s (system), 21.88 s (user); Peak Memory Usage: 662 MB.

  1. Append decoy assays to the TraML file for OpenSWATH

(base) ptruong@planck:/hdd_14T/data/PXD002952/bugg_report_20210205$ OpenSwathDecoyGenerator -in ecolihumanyeast_concat_mayu_IRR_cons_openswath_32w_var_curated.TraML -out ecolihumanyeast_concat_mayu_IRR_cons_openswath_32w_var_curated_decoy.TraML -method pseudo-reverse -separate Loading targets from file: ecolihumanyeast_concat_mayu_IRR_cons_openswath_32w_var_curated.TraML Generate decoys Progress of 'Generating decoy peptides': -- done [took 0.13 s (CPU), 0.13 s (Wall)] -- Progress of 'Generating decoy transitions': -- done [took 9.08 s (CPU), 9.28 s (Wall)] -- Number of target peptides: 46000 Number of decoy peptides: 45992 Number of target proteins: 6921 Number of decoy proteins: 6921 Writing only decoys to file: ecolihumanyeast_concat_mayu_IRR_cons_openswath_32w_var_curated_decoy.TraML OpenSwathDecoyGenerator took 37.49 s (wall), 36.95 s (CPU), 0.90 s (system), 36.05 s (user); Peak Memory Usage: 698 MB.

(base) ptruong@planck:/hdd_14T/data/PXD002952/bugg_report_20210205$ OpenSwathDecoyGenerator -in ecolihumanyeast_concat_mayu_IRR_cons_openswath_32w_var_curated.TraML -out ecolihumanyeast_concat_mayu_IRR_cons_openswath_32w_var_curated_decoy.TraML -method pseudo-reverse Loading targets from file: ecolihumanyeast_concat_mayu_IRR_cons_openswath_32w_var_curated.TraML Generate decoys Progress of 'Generating decoy peptides': -- done [took 0.12 s (CPU), 0.12 s (Wall)] -- Progress of 'Generating decoy transitions': -- done [took 8.61 s (CPU), 8.65 s (Wall)] -- Number of target peptides: 46000 Number of decoy peptides: 45992 Number of target proteins: 6921 Number of decoy proteins: 6921 Writing targets and decoys to file: ecolihumanyeast_concat_mayu_IRR_cons_openswath_32w_var_curated_decoy.TraML OpenSwathDecoyGenerator took 42.90 s (wall), 39.96 s (CPU), 1.45 s (system), 38.51 s (user); Peak Memory Usage: 952 MB.

According to:

https://abibuilder.informatik.uni-tuebingen.de/archive/openms/Documentation/release/latest/html/TOPP_OpenSwathDecoyGenerator.html

-seperate creates a seperate target file.

2.1 Checking the file size.

362M ecolihumanyeast_concat_mayu_IRR_cons_openswath_32w_var_curated_decoy_sep.TraML 721M ecolihumanyeast_concat_mayu_IRR_cons_openswath_32w_var_curated_target_decoy.TraML 36M ecolihumanyeast_concat_mayu_IRR_cons_openswath_32w_var_curated.pqp 359M ecolihumanyeast_concat_mayu_IRR_cons_openswath_32w_var_curated.TraML 59M ecolihumanyeast_concat_mayu_IRR_cons_openswath_32w_var_curated.tsv

target_decoy.TraML seem to be about double the size of decoy and target files. So this looks correct.

  1. OpenSwathAnalyser 3.1 .tsv output

(base) ptruong@planck:/hdd_14T/data/PXD002952/bugg_report_20210205$ OpenSwathWorkflow -in HYE110_TTOF6600_32var_lgillet_I160309_001_Pedro_Sample_A.mzML -tr ecolihumanyeast_concat_mayu_IRR_cons_openswath_32w_var_curated_target_decoy.TraML -out_tsv osw_output.tsv -tempDirectory tmp -readOptions cacheWorkingInMemory -batchSize 1000 -Scoring:stop_report_after_feature -1 -min_upper_edge_dist 1 -extra_rt_extraction_window 100 -min_rsq 0.95 -min_coverage 0.6 -Scoring:Scores:use_dia_scores true -rt_extraction_window 600 -mz_extraction_window 30 -threads 6 Since neither rt_norm nor tr_irt is set, OpenSWATH will not use RT-transformation (rather a null transformation will be applied) Progress of 'Load TraML file': -- done [took 26.06 s (CPU), 26.14 s (Wall)] -- Loaded 13842 proteins, 91992 compounds with 551952 transitions. Loading mzML file HYE110_TTOF6600_32var_lgillet_I160309_001_Pedro_Sample_A.mzML using readoptions cache Progress of 'Loading metadata file HYE110_TTOF6600_32var_lgillet_I160309_001_Pedro_Sample_A.mzML': Will analyze the metadata first to determine the number of SWATH windows and the window sizes. Determined there to be 32 SWATH windows and in total 2197 MS1 spectra Determined there to be 32 SWATH windows and in total 2197 MS1 spectra -- done [took 01:25 m (CPU), 02:23 m (Wall)] -- Progress of 'Loading data file HYE110_TTOF6600_32var_lgillet_I160309_001_Pedro_Sample_A.mzML': Read chromatogram while reading SWATH files, did not expect that!

-- done [took 21:58 m (CPU), 17:00 m (Wall)] -- Will analyze 551952 transitions in total.

Progress of 'Extracting and scoring transitions': Use non-nested loop with 6 threads. 3.03 % Thread 3_0 will analyze 2421 compounds and 14526 transitions from SWATH 2 (batch 0 out of 2) Thread 0_0 will analyze 1828 compounds and 10968 transitions from SWATH 1 (batch 0 out of 1) Thread 3_0 will analyze 2421 compounds and 14526 transitions from SWATH 2 (batch 1 out of 2) Thread 1_0 will analyze 3443 compounds and 20658 transitions from SWATH 6 (batch 0 out of 3) Thread 2_0 will analyze 3945 compounds and 23670 transitions from SWATH 5 (batch 0 out of 3) Thread 5_0 will analyze 3192 compounds and 19152 transitions from SWATH 4 (batch 0 out of 3) Thread 4_0 will analyze 3261 compounds and 19566 transitions from SWATH 3 (batch 0 out of 3) Thread 0_0 will analyze 1828 compounds and 10968 transitions from SWATH 1 (batch 1 out of 1) Thread 3_0 will analyze 2421 compounds and 14526 transitions from SWATH 2 (batch 2 out of 2) Thread 1_0 will analyze 3443 compounds and 20658 transitions from SWATH 6 (batch 1 out of 3) Thread 4_0 will analyze 3261 compounds and 19566 transitions from SWATH 3 (batch 1 out of 3) Thread 2_0 will analyze 3945 compounds and 23670 transitions from SWATH 5 (batch 1 out of 3) Thread 5_0 will analyze 3192 compounds and 19152 transitions from SWATH 4 (batch 1 out of 3) 6.06 % Thread 4_0 will analyze 3261 compounds and 19566 transitions from SWATH 3 (batch 2 out of 3) Thread 1_0 will analyze 3443 compounds and 20658 transitions from SWATH 6 (batch 2 out of 3) Thread 2_0 will analyze 3945 compounds and 23670 transitions from SWATH 5 (batch 2 out of 3) Thread 5_0 will analyze 3192 compounds and 19152 transitions from SWATH 4 (batch 2 out of 3) Thread 4_0 will analyze 3261 compounds and 19566 transitions from SWATH 3 (batch 3 out of 3) Thread 5_0 will analyze 3192 compounds and 19152 transitions from SWATH 4 (batch 3 out of 3) Thread 1_0 will analyze 3443 compounds and 20658 transitions from SWATH 6 (batch 3 out of 3) Thread 2_0 will analyze 3945 compounds and 23670 transitions from SWATH 5 (batch 3 out of 3) 18.18 % Thread 0_0 will analyze 3476 compounds and 20856 transitions from SWATH 7 (batch 0 out of 3) Thread 3_0 will analyze 3872 compounds and 23232 transitions from SWATH 8 (batch 0 out of 3) 21.21 % Thread 0_0 will analyze 3476 compounds and 20856 transitions from SWATH 7 (batch 1 out of 3) Thread 3_0 will analyze 3872 compounds and 23232 transitions from SWATH 8 (batch 1 out of 3) Thread 3_0 will analyze 3872 compounds and 23232 transitions from SWATH 8 (batch 2 out of 3) Thread 0_0 will analyze 3476 compounds and 20856 transitions from SWATH 7 (batch 2 out of 3) Thread 3_0 will analyze 3872 compounds and 23232 transitions from SWATH 8 (batch 3 out of 3) Thread 0_0 will analyze 3476 compounds and 20856 transitions from SWATH 7 (batch 3 out of 3) 27.27 % Thread 5_0 will analyze 4088 compounds and 24528 transitions from SWATH 9 (batch 0 out of 4) Thread 4_0 will analyze 4131 compounds and 24786 transitions from SWATH 10 (batch 0 out of 4) Thread 1_0 will analyze 3759 compounds and 22554 transitions from SWATH 11 (batch 0 out of 3) Thread 2_0 will analyze 3636 compounds and 21816 transitions from SWATH 12 (batch 0 out of 3) Thread 5_0 will analyze 4088 compounds and 24528 transitions from SWATH 9 (batch 1 out of 4) Thread 4_0 will analyze 4131 compounds and 24786 transitions from SWATH 10 (batch 1 out of 4) Thread 1_0 will analyze 3759 compounds and 22554 transitions from SWATH 11 (batch 1 out of 3) Thread 3_0 will analyze 3239 compounds and 19434 transitions from SWATH 14 (batch 0 out of 3) Thread 2_0 will analyze 3636 compounds and 21816 transitions from SWATH 12 (batch 1 out of 3) Thread 1_0 will analyze 3759 compounds and 22554 transitions from SWATH 11 (batch 2 out of 3) Thread 0_0 will analyze 3615 compounds and 21690 transitions from SWATH 13 (batch 0 out of 3) Thread 5_0 will analyze 4088 compounds and 24528 transitions from SWATH 9 (batch 2 out of 4) Thread 4_0 will analyze 4131 compounds and 24786 transitions from SWATH 10 (batch 2 out of 4) Thread 3_0 will analyze 3239 compounds and 19434 transitions from SWATH 14 (batch 1 out of 3) Thread 2_0 will analyze 3636 compounds and 21816 transitions from SWATH 12 (batch 2 out of 3) Thread 1_0 will analyze 3759 compounds and 22554 transitions from SWATH 11 (batch 3 out of 3) Thread 5_0 will analyze 4088 compounds and 24528 transitions from SWATH 9 (batch 3 out of 4) Thread 4_0 will analyze 4131 compounds and 24786 transitions from SWATH 10 (batch 3 out of 4) Thread 0_0 will analyze 3615 compounds and 21690 transitions from SWATH 13 (batch 1 out of 3) Thread 3_0 will analyze 3239 compounds and 19434 transitions from SWATH 14 (batch 2 out of 3) 30.30 % Thread 2_0 will analyze 3636 compounds and 21816 transitions from SWATH 12 (batch 3 out of 3) Thread 5_0 will analyze 4088 compounds and 24528 transitions from SWATH 9 (batch 4 out of 4) Thread 4_0 will analyze 4131 compounds and 24786 transitions from SWATH 10 (batch 4 out of 4) 36.36 % Thread 3_0 will analyze 3239 compounds and 19434 transitions from SWATH 14 (batch 3 out of 3) 39.39 % Thread 0_0 will analyze 3615 compounds and 21690 transitions from SWATH 13 (batch 2 out of 3) 42.42 % Thread 0_0 will analyze 3615 compounds and 21690 transitions from SWATH 13 (batch 3 out of 3) 45.45 % Thread 1_0 will analyze 3104 compounds and 18624 transitions from SWATH 15 (batch 0 out of 3) Thread 1_0 will analyze 3104 compounds and 18624 transitions from SWATH 15 (batch 1 out of 3) Thread 1_0 will analyze 3104 compounds and 18624 transitions from SWATH 15 (batch 2 out of 3) Thread 1_0 will analyze 3104 compounds and 18624 transitions from SWATH 15 (batch 3 out of 3) 48.48 % Thread 4_0 will analyze 3319 compounds and 19914 transitions from SWATH 17 (batch 0 out of 3) Thread 2_0 will analyze 2765 compounds and 16590 transitions from SWATH 18 (batch 0 out of 2) Thread 3_0 will analyze 2951 compounds and 17706 transitions from SWATH 19 (batch 0 out of 2) Thread 5_0 will analyze 3623 compounds and 21738 transitions from SWATH 16 (batch 0 out of 3) Thread 2_0 will analyze 2765 compounds and 16590 transitions from SWATH 18 (batch 1 out of 2) Thread 4_0 will analyze 3319 compounds and 19914 transitions from SWATH 17 (batch 1 out of 3) Thread 3_0 will analyze 2951 compounds and 17706 transitions from SWATH 19 (batch 1 out of 2) Thread 0_0 will analyze 3081 compounds and 18486 transitions from SWATH 20 (batch 0 out of 3) Thread 5_0 will analyze 3623 compounds and 21738 transitions from SWATH 16 (batch 1 out of 3) Thread 2_0 will analyze 2765 compounds and 16590 transitions from SWATH 18 (batch 2 out of 2) Thread 4_0 will analyze 3319 compounds and 19914 transitions from SWATH 17 (batch 2 out of 3) Thread 1_0 will analyze 3017 compounds and 18102 transitions from SWATH 21 (batch 0 out of 3) Thread 3_0 will analyze 2951 compounds and 17706 transitions from SWATH 19 (batch 2 out of 2) Thread 5_0 will analyze 3623 compounds and 21738 transitions from SWATH 16 (batch 2 out of 3) 51.52 % Thread 0_0 will analyze 3081 compounds and 18486 transitions from SWATH 20 (batch 1 out of 3) Thread 4_0 will analyze 3319 compounds and 19914 transitions from SWATH 17 (batch 3 out of 3) 54.55 % Thread 1_0 will analyze 3017 compounds and 18102 transitions from SWATH 21 (batch 1 out of 3) Thread 5_0 will analyze 3623 compounds and 21738 transitions from SWATH 16 (batch 3 out of 3) 60.61 % Thread 0_0 will analyze 3081 compounds and 18486 transitions from SWATH 20 (batch 2 out of 3) Thread 1_0 will analyze 3017 compounds and 18102 transitions from SWATH 21 (batch 2 out of 3) Thread 0_0 will analyze 3081 compounds and 18486 transitions from SWATH 20 (batch 3 out of 3) Thread 1_0 will analyze 3017 compounds and 18102 transitions from SWATH 21 (batch 3 out of 3) 63.64 % Thread 2_0 will analyze 2888 compounds and 17328 transitions from SWATH 22 (batch 0 out of 2) Thread 2_0 will analyze 2888 compounds and 17328 transitions from SWATH 22 (batch 1 out of 2) Thread 2_0 will analyze 2888 compounds and 17328 transitions from SWATH 22 (batch 2 out of 2) 69.70 % Thread 3_0 will analyze 2756 compounds and 16536 transitions from SWATH 23 (batch 0 out of 2) Thread 3_0 will analyze 2756 compounds and 16536 transitions from SWATH 23 (batch 1 out of 2) Thread 3_0 will analyze 2756 compounds and 16536 transitions from SWATH 23 (batch 2 out of 2) Thread 4_0 will analyze 2377 compounds and 14262 transitions from SWATH 24 (batch 0 out of 2) 72.73 % Thread 4_0 will analyze 2377 compounds and 14262 transitions from SWATH 24 (batch 1 out of 2) Thread 5_0 will analyze 2491 compounds and 14946 transitions from SWATH 25 (batch 0 out of 2) Thread 1_0 will analyze 2172 compounds and 13032 transitions from SWATH 27 (batch 0 out of 2) Thread 0_0 will analyze 2356 compounds and 14136 transitions from SWATH 26 (batch 0 out of 2) Thread 4_0 will analyze 2377 compounds and 14262 transitions from SWATH 24 (batch 2 out of 2) Thread 5_0 will analyze 2491 compounds and 14946 transitions from SWATH 25 (batch 1 out of 2) Thread 1_0 will analyze 2172 compounds and 13032 transitions from SWATH 27 (batch 1 out of 2) 75.76 % Thread 5_0 will analyze 2491 compounds and 14946 transitions from SWATH 25 (batch 2 out of 2) Thread 0_0 will analyze 2356 compounds and 14136 transitions from SWATH 26 (batch 1 out of 2) Thread 1_0 will analyze 2172 compounds and 13032 transitions from SWATH 27 (batch 2 out of 2) 81.82 % Thread 0_0 will analyze 2356 compounds and 14136 transitions from SWATH 26 (batch 2 out of 2) Thread 2_0 will analyze 2029 compounds and 12174 transitions from SWATH 28 (batch 0 out of 2) 84.85 % Thread 2_0 will analyze 2029 compounds and 12174 transitions from SWATH 28 (batch 1 out of 2) Thread 2_0 will analyze 2029 compounds and 12174 transitions from SWATH 28 (batch 2 out of 2) 87.88 % Thread 3_0 will analyze 1529 compounds and 9174 transitions from SWATH 29 (batch 0 out of 1) Thread 3_0 will analyze 1529 compounds and 9174 transitions from SWATH 29 (batch 1 out of 1) 90.91 % Thread 4_0 will analyze 1577 compounds and 9462 transitions from SWATH 30 (batch 0 out of 1) Thread 4_0 will analyze 1577 compounds and 9462 transitions from SWATH 30 (batch 1 out of 1) Thread 1_0 will analyze 1201 compounds and 7206 transitions from SWATH 31 (batch 0 out of 1) Thread 5_0 will analyze 546 compounds and 3276 transitions from SWATH 32 (batch 0 out of 1) 93.94 % Thread 5_0 will analyze 546 compounds and 3276 transitions from SWATH 32 (batch 1 out of 1) 96.97 % Thread 1_0 will analyze 1201 compounds and 7206 transitions from SWATH 31 (batch 1 out of 1) 100.00 %
-- done [took 11:17 m (CPU), 06:14 m (Wall)] -- OpenSwathWorkflow took 26:04 m (wall), 35:08 m (CPU), 01:40 m (system), 33:28 m (user); Peak Memory Usage: 22756 MB.

3.2 .osw output (base) ptruong@planck:/hdd_14T/data/PXD002952/bugg_report_20210205$ OpenSwathWorkflow -in HYE110_TTOF6600_32var_lgillet_I160309_001_Pedro_Sample_A.mzML -tr ecolihumanyeast_concat_mayu_IRR_cons_openswath_32w_var_curated_target_decoy.TraML -out_tsv osw_output.tsv -tempDirectory tmp -readOptions cacheWorkingInMemory -batchSize 1000 -Scoring:stop_report_after_feature -1 -min_upper_edge_dist 1 -extra_rt_extraction_window 100 -min_rsq 0.95 -min_coverage 0.6 -Scoring:Scores:use_dia_scores true -rt_extraction_window 600 -mz_extraction_window 30 -threads 6 Since neither rt_norm nor tr_irt is set, OpenSWATH will not use RT-transformation (rather a null transformation will be applied) Progress of 'Load TraML file': -- done [took 26.06 s (CPU), 26.14 s (Wall)] -- Loaded 13842 proteins, 91992 compounds with 551952 transitions. Loading mzML file HYE110_TTOF6600_32var_lgillet_I160309_001_Pedro_Sample_A.mzML using readoptions cache Progress of 'Loading metadata file HYE110_TTOF6600_32var_lgillet_I160309_001_Pedro_Sample_A.mzML': Will analyze the metadata first to determine the number of SWATH windows and the window sizes. Determined there to be 32 SWATH windows and in total 2197 MS1 spectra Determined there to be 32 SWATH windows and in total 2197 MS1 spectra -- done [took 01:25 m (CPU), 02:23 m (Wall)] -- Progress of 'Loading data file HYE110_TTOF6600_32var_lgillet_I160309_001_Pedro_Sample_A.mzML': Read chromatogram while reading SWATH files, did not expect that!

-- done [took 21:58 m (CPU), 17:00 m (Wall)] -- Will analyze 551952 transitions in total.

Progress of 'Extracting and scoring transitions': Use non-nested loop with 6 threads. 3.03 % Thread 3_0 will analyze 2421 compounds and 14526 transitions from SWATH 2 (batch 0 out of 2) Thread 0_0 will analyze 1828 compounds and 10968 transitions from SWATH 1 (batch 0 out of 1) Thread 3_0 will analyze 2421 compounds and 14526 transitions from SWATH 2 (batch 1 out of 2) Thread 1_0 will analyze 3443 compounds and 20658 transitions from SWATH 6 (batch 0 out of 3) Thread 2_0 will analyze 3945 compounds and 23670 transitions from SWATH 5 (batch 0 out of 3) Thread 5_0 will analyze 3192 compounds and 19152 transitions from SWATH 4 (batch 0 out of 3) Thread 4_0 will analyze 3261 compounds and 19566 transitions from SWATH 3 (batch 0 out of 3) Thread 0_0 will analyze 1828 compounds and 10968 transitions from SWATH 1 (batch 1 out of 1) Thread 3_0 will analyze 2421 compounds and 14526 transitions from SWATH 2 (batch 2 out of 2) Thread 1_0 will analyze 3443 compounds and 20658 transitions from SWATH 6 (batch 1 out of 3) Thread 4_0 will analyze 3261 compounds and 19566 transitions from SWATH 3 (batch 1 out of 3) Thread 2_0 will analyze 3945 compounds and 23670 transitions from SWATH 5 (batch 1 out of 3) Thread 5_0 will analyze 3192 compounds and 19152 transitions from SWATH 4 (batch 1 out of 3) 6.06 % Thread 4_0 will analyze 3261 compounds and 19566 transitions from SWATH 3 (batch 2 out of 3) Thread 1_0 will analyze 3443 compounds and 20658 transitions from SWATH 6 (batch 2 out of 3) Thread 2_0 will analyze 3945 compounds and 23670 transitions from SWATH 5 (batch 2 out of 3) Thread 5_0 will analyze 3192 compounds and 19152 transitions from SWATH 4 (batch 2 out of 3) Thread 4_0 will analyze 3261 compounds and 19566 transitions from SWATH 3 (batch 3 out of 3) Thread 5_0 will analyze 3192 compounds and 19152 transitions from SWATH 4 (batch 3 out of 3) Thread 1_0 will analyze 3443 compounds and 20658 transitions from SWATH 6 (batch 3 out of 3) Thread 2_0 will analyze 3945 compounds and 23670 transitions from SWATH 5 (batch 3 out of 3) 18.18 % Thread 0_0 will analyze 3476 compounds and 20856 transitions from SWATH 7 (batch 0 out of 3) Thread 3_0 will analyze 3872 compounds and 23232 transitions from SWATH 8 (batch 0 out of 3) 21.21 % Thread 0_0 will analyze 3476 compounds and 20856 transitions from SWATH 7 (batch 1 out of 3) Thread 3_0 will analyze 3872 compounds and 23232 transitions from SWATH 8 (batch 1 out of 3) Thread 3_0 will analyze 3872 compounds and 23232 transitions from SWATH 8 (batch 2 out of 3) Thread 0_0 will analyze 3476 compounds and 20856 transitions from SWATH 7 (batch 2 out of 3) Thread 3_0 will analyze 3872 compounds and 23232 transitions from SWATH 8 (batch 3 out of 3) Thread 0_0 will analyze 3476 compounds and 20856 transitions from SWATH 7 (batch 3 out of 3) 27.27 % Thread 5_0 will analyze 4088 compounds and 24528 transitions from SWATH 9 (batch 0 out of 4) Thread 4_0 will analyze 4131 compounds and 24786 transitions from SWATH 10 (batch 0 out of 4) Thread 1_0 will analyze 3759 compounds and 22554 transitions from SWATH 11 (batch 0 out of 3) Thread 2_0 will analyze 3636 compounds and 21816 transitions from SWATH 12 (batch 0 out of 3) Thread 5_0 will analyze 4088 compounds and 24528 transitions from SWATH 9 (batch 1 out of 4) Thread 4_0 will analyze 4131 compounds and 24786 transitions from SWATH 10 (batch 1 out of 4) Thread 1_0 will analyze 3759 compounds and 22554 transitions from SWATH 11 (batch 1 out of 3) Thread 3_0 will analyze 3239 compounds and 19434 transitions from SWATH 14 (batch 0 out of 3) Thread 2_0 will analyze 3636 compounds and 21816 transitions from SWATH 12 (batch 1 out of 3) Thread 1_0 will analyze 3759 compounds and 22554 transitions from SWATH 11 (batch 2 out of 3) Thread 0_0 will analyze 3615 compounds and 21690 transitions from SWATH 13 (batch 0 out of 3) Thread 5_0 will analyze 4088 compounds and 24528 transitions from SWATH 9 (batch 2 out of 4) Thread 4_0 will analyze 4131 compounds and 24786 transitions from SWATH 10 (batch 2 out of 4) Thread 3_0 will analyze 3239 compounds and 19434 transitions from SWATH 14 (batch 1 out of 3) Thread 2_0 will analyze 3636 compounds and 21816 transitions from SWATH 12 (batch 2 out of 3) Thread 1_0 will analyze 3759 compounds and 22554 transitions from SWATH 11 (batch 3 out of 3) Thread 5_0 will analyze 4088 compounds and 24528 transitions from SWATH 9 (batch 3 out of 4) Thread 4_0 will analyze 4131 compounds and 24786 transitions from SWATH 10 (batch 3 out of 4) Thread 0_0 will analyze 3615 compounds and 21690 transitions from SWATH 13 (batch 1 out of 3) Thread 3_0 will analyze 3239 compounds and 19434 transitions from SWATH 14 (batch 2 out of 3) 30.30 % Thread 2_0 will analyze 3636 compounds and 21816 transitions from SWATH 12 (batch 3 out of 3) Thread 5_0 will analyze 4088 compounds and 24528 transitions from SWATH 9 (batch 4 out of 4) Thread 4_0 will analyze 4131 compounds and 24786 transitions from SWATH 10 (batch 4 out of 4) 36.36 % Thread 3_0 will analyze 3239 compounds and 19434 transitions from SWATH 14 (batch 3 out of 3) 39.39 % Thread 0_0 will analyze 3615 compounds and 21690 transitions from SWATH 13 (batch 2 out of 3) 42.42 % Thread 0_0 will analyze 3615 compounds and 21690 transitions from SWATH 13 (batch 3 out of 3) 45.45 % Thread 1_0 will analyze 3104 compounds and 18624 transitions from SWATH 15 (batch 0 out of 3) Thread 1_0 will analyze 3104 compounds and 18624 transitions from SWATH 15 (batch 1 out of 3) Thread 1_0 will analyze 3104 compounds and 18624 transitions from SWATH 15 (batch 2 out of 3) Thread 1_0 will analyze 3104 compounds and 18624 transitions from SWATH 15 (batch 3 out of 3) 48.48 % Thread 4_0 will analyze 3319 compounds and 19914 transitions from SWATH 17 (batch 0 out of 3) Thread 2_0 will analyze 2765 compounds and 16590 transitions from SWATH 18 (batch 0 out of 2) Thread 3_0 will analyze 2951 compounds and 17706 transitions from SWATH 19 (batch 0 out of 2) Thread 5_0 will analyze 3623 compounds and 21738 transitions from SWATH 16 (batch 0 out of 3) Thread 2_0 will analyze 2765 compounds and 16590 transitions from SWATH 18 (batch 1 out of 2) Thread 4_0 will analyze 3319 compounds and 19914 transitions from SWATH 17 (batch 1 out of 3) Thread 3_0 will analyze 2951 compounds and 17706 transitions from SWATH 19 (batch 1 out of 2) Thread 0_0 will analyze 3081 compounds and 18486 transitions from SWATH 20 (batch 0 out of 3) Thread 5_0 will analyze 3623 compounds and 21738 transitions from SWATH 16 (batch 1 out of 3) Thread 2_0 will analyze 2765 compounds and 16590 transitions from SWATH 18 (batch 2 out of 2) Thread 4_0 will analyze 3319 compounds and 19914 transitions from SWATH 17 (batch 2 out of 3) Thread 1_0 will analyze 3017 compounds and 18102 transitions from SWATH 21 (batch 0 out of 3) Thread 3_0 will analyze 2951 compounds and 17706 transitions from SWATH 19 (batch 2 out of 2) Thread 5_0 will analyze 3623 compounds and 21738 transitions from SWATH 16 (batch 2 out of 3) 51.52 % Thread 0_0 will analyze 3081 compounds and 18486 transitions from SWATH 20 (batch 1 out of 3) Thread 4_0 will analyze 3319 compounds and 19914 transitions from SWATH 17 (batch 3 out of 3) 54.55 % Thread 1_0 will analyze 3017 compounds and 18102 transitions from SWATH 21 (batch 1 out of 3) Thread 5_0 will analyze 3623 compounds and 21738 transitions from SWATH 16 (batch 3 out of 3) 60.61 % Thread 0_0 will analyze 3081 compounds and 18486 transitions from SWATH 20 (batch 2 out of 3) Thread 1_0 will analyze 3017 compounds and 18102 transitions from SWATH 21 (batch 2 out of 3) Thread 0_0 will analyze 3081 compounds and 18486 transitions from SWATH 20 (batch 3 out of 3) Thread 1_0 will analyze 3017 compounds and 18102 transitions from SWATH 21 (batch 3 out of 3) 63.64 % Thread 2_0 will analyze 2888 compounds and 17328 transitions from SWATH 22 (batch 0 out of 2) Thread 2_0 will analyze 2888 compounds and 17328 transitions from SWATH 22 (batch 1 out of 2) Thread 2_0 will analyze 2888 compounds and 17328 transitions from SWATH 22 (batch 2 out of 2) 69.70 % Thread 3_0 will analyze 2756 compounds and 16536 transitions from SWATH 23 (batch 0 out of 2) Thread 3_0 will analyze 2756 compounds and 16536 transitions from SWATH 23 (batch 1 out of 2) Thread 3_0 will analyze 2756 compounds and 16536 transitions from SWATH 23 (batch 2 out of 2) Thread 4_0 will analyze 2377 compounds and 14262 transitions from SWATH 24 (batch 0 out of 2) 72.73 % Thread 4_0 will analyze 2377 compounds and 14262 transitions from SWATH 24 (batch 1 out of 2) Thread 5_0 will analyze 2491 compounds and 14946 transitions from SWATH 25 (batch 0 out of 2) Thread 1_0 will analyze 2172 compounds and 13032 transitions from SWATH 27 (batch 0 out of 2) Thread 0_0 will analyze 2356 compounds and 14136 transitions from SWATH 26 (batch 0 out of 2) Thread 4_0 will analyze 2377 compounds and 14262 transitions from SWATH 24 (batch 2 out of 2) Thread 5_0 will analyze 2491 compounds and 14946 transitions from SWATH 25 (batch 1 out of 2) Thread 1_0 will analyze 2172 compounds and 13032 transitions from SWATH 27 (batch 1 out of 2) 75.76 % Thread 5_0 will analyze 2491 compounds and 14946 transitions from SWATH 25 (batch 2 out of 2) Thread 0_0 will analyze 2356 compounds and 14136 transitions from SWATH 26 (batch 1 out of 2) Thread 1_0 will analyze 2172 compounds and 13032 transitions from SWATH 27 (batch 2 out of 2) 81.82 % Thread 0_0 will analyze 2356 compounds and 14136 transitions from SWATH 26 (batch 2 out of 2) Thread 2_0 will analyze 2029 compounds and 12174 transitions from SWATH 28 (batch 0 out of 2) 84.85 % Thread 2_0 will analyze 2029 compounds and 12174 transitions from SWATH 28 (batch 1 out of 2) Thread 2_0 will analyze 2029 compounds and 12174 transitions from SWATH 28 (batch 2 out of 2) 87.88 % Thread 3_0 will analyze 1529 compounds and 9174 transitions from SWATH 29 (batch 0 out of 1) Thread 3_0 will analyze 1529 compounds and 9174 transitions from SWATH 29 (batch 1 out of 1) 90.91 % Thread 4_0 will analyze 1577 compounds and 9462 transitions from SWATH 30 (batch 0 out of 1) Thread 4_0 will analyze 1577 compounds and 9462 transitions from SWATH 30 (batch 1 out of 1) Thread 1_0 will analyze 1201 compounds and 7206 transitions from SWATH 31 (batch 0 out of 1) Thread 5_0 will analyze 546 compounds and 3276 transitions from SWATH 32 (batch 0 out of 1) 93.94 % Thread 5_0 will analyze 546 compounds and 3276 transitions from SWATH 32 (batch 1 out of 1) 96.97 % Thread 1_0 will analyze 1201 compounds and 7206 transitions from SWATH 31 (batch 1 out of 1) 100.00 %
-- done [took 11:17 m (CPU), 06:14 m (Wall)] -- OpenSwathWorkflow took 26:04 m (wall), 35:08 m (CPU), 01:40 m (system), 33:28 m (user); Peak Memory Usage: 22756 MB. (base) ptruong@planck:/hdd_14T/data/PXD002952/bugg_report_20210205$ OpenSwathWorkflow -in HYE110_TTOF6600_32var_lgillet_I160309_001_Pedro_Sample_A.mzML -tr ecolihumanyeast_concat_mayu_IRR_cons_openswath_32w_var_curated_target_decoy.TraML -out_osw osw_output.osw -tempDirectory tmp -readOptions cacheWorkingInMemory -batchSize 1000 -Scoring:stop_report_after_feature -1 -min_upper_edge_dist 1 -extra_rt_extraction_window 100 -min_rsq 0.95 -min_coverage 0.6 -Scoring:Scores:use_dia_scores true -rt_extraction_window 600 -mz_extraction_window 30 -threads 6 Since neither rt_norm nor tr_irt is set, OpenSWATH will not use RT-transformation (rather a null transformation will be applied) Error: Unexpected internal error (OSW output files can only be generated in combination with PQP input files (-tr).)

3.3 Generating the .pqp file from the target_decoy .TraML. (base) ptruong@planck:/hdd_14T/data/PXD002952/bugg_report_20210205$ TargetedFileConverter -in ecolihumanyeast_concat_mayu_IRR_cons_openswath_32w_var_curated_target_decoy.TraML -out ecolihumanyeast_concat_mayu_IRR_cons_openswath_32w_var_curated_target_decoy.pqp TargetedFileConverter took 47.63 s (wall), 45.86 s (CPU), 0.86 s (system), 45.00 s (user); Peak Memory Usage: 1173 MB.

3.4 Generating .osw output with openSwath

(base) ptruong@planck:/hdd_14T/data/PXD002952/bugg_report_20210205$ OpenSwathWorkflow -in HYE110_TTOF6600_32var_lgillet_I160309_001_Pedro_Sample_A.mzML -tr ecolihumanyeast_concat_mayu_IRR_cons_openswath_32w_var_curated_target_decoy.pqp -out_osw osw_output.osw -tempDirectory tmp -readOptions cacheWorkingInMemory -batchSize 1000 -Scoring:stop_report_after_feature -1 -min_upper_edge_dist 1 -extra_rt_extraction_window 100 -min_rsq 0.95 -min_coverage 0.6 -Scoring:Scores:use_dia_scores true -rt_extraction_window 600 -mz_extraction_window 30 -threads 6 Since neither rt_norm nor tr_irt is set, OpenSWATH will not use RT-transformation (rather a null transformation will be applied) Progress of 'Load PQP file': -- done [took 11.81 s (CPU), 11.82 s (Wall)] -- Loaded 13842 proteins, 91992 compounds with 551952 transitions. Loading mzML file HYE110_TTOF6600_32var_lgillet_I160309_001_Pedro_Sample_A.mzML using readoptions cache Progress of 'Loading metadata file HYE110_TTOF6600_32var_lgillet_I160309_001_Pedro_Sample_A.mzML': Will analyze the metadata first to determine the number of SWATH windows and the window sizes. Determined there to be 32 SWATH windows and in total 2197 MS1 spectra Determined there to be 32 SWATH windows and in total 2197 MS1 spectra -- done [took 01:26 m (CPU), 02:23 m (Wall)] -- Progress of 'Loading data file HYE110_TTOF6600_32var_lgillet_I160309_001_Pedro_Sample_A.mzML': Read chromatogram while reading SWATH files, did not expect that!

-- done [took 21:58 m (CPU), 17:01 m (Wall)] -- Will analyze 551952 transitions in total.

Progress of 'Extracting and scoring transitions': Use non-nested loop with 6 threads. 3.03 % Thread 2_0 will analyze 2421 compounds and 14526 transitions from SWATH 2 (batch 0 out of 2) Thread 3_0 will analyze 1828 compounds and 10968 transitions from SWATH 1 (batch 0 out of 1) Thread 0_0 will analyze 3443 compounds and 20658 transitions from SWATH 6 (batch 0 out of 3) Thread 4_0 will analyze 3261 compounds and 19566 transitions from SWATH 3 (batch 0 out of 3) Thread 1_0 will analyze 3945 compounds and 23670 transitions from SWATH 5 (batch 0 out of 3) Thread 5_0 will analyze 3192 compounds and 19152 transitions from SWATH 4 (batch 0 out of 3) Thread 3_0 will analyze 1828 compounds and 10968 transitions from SWATH 1 (batch 1 out of 1) Thread 2_0 will analyze 2421 compounds and 14526 transitions from SWATH 2 (batch 1 out of 2) Thread 4_0 will analyze 3261 compounds and 19566 transitions from SWATH 3 (batch 1 out of 3) Thread 0_0 will analyze 3443 compounds and 20658 transitions from SWATH 6 (batch 1 out of 3) Thread 5_0 will analyze 3192 compounds and 19152 transitions from SWATH 4 (batch 1 out of 3) Thread 1_0 will analyze 3945 compounds and 23670 transitions from SWATH 5 (batch 1 out of 3) 6.06 % Thread 2_0 will analyze 2421 compounds and 14526 transitions from SWATH 2 (batch 2 out of 2) Thread 4_0 will analyze 3261 compounds and 19566 transitions from SWATH 3 (batch 2 out of 3) Thread 0_0 will analyze 3443 compounds and 20658 transitions from SWATH 6 (batch 2 out of 3) Thread 5_0 will analyze 3192 compounds and 19152 transitions from SWATH 4 (batch 2 out of 3) Thread 3_0 will analyze 3476 compounds and 20856 transitions from SWATH 7 (batch 0 out of 3) 9.09 % Thread 1_0 will analyze 3945 compounds and 23670 transitions from SWATH 5 (batch 2 out of 3) Thread 4_0 will analyze 3261 compounds and 19566 transitions from SWATH 3 (batch 3 out of 3) Thread 0_0 will analyze 3443 compounds and 20658 transitions from SWATH 6 (batch 3 out of 3) Thread 2_0 will analyze 3872 compounds and 23232 transitions from SWATH 8 (batch 0 out of 3) Thread 5_0 will analyze 3192 compounds and 19152 transitions from SWATH 4 (batch 3 out of 3) 12.12 % Thread 3_0 will analyze 3476 compounds and 20856 transitions from SWATH 7 (batch 1 out of 3) Thread 1_0 will analyze 3945 compounds and 23670 transitions from SWATH 5 (batch 3 out of 3) 18.18 % Thread 2_0 will analyze 3872 compounds and 23232 transitions from SWATH 8 (batch 1 out of 3) Thread 3_0 will analyze 3476 compounds and 20856 transitions from SWATH 7 (batch 2 out of 3) 21.21 % Thread 2_0 will analyze 3872 compounds and 23232 transitions from SWATH 8 (batch 2 out of 3) Thread 3_0 will analyze 3476 compounds and 20856 transitions from SWATH 7 (batch 3 out of 3) 24.24 % Thread 2_0 will analyze 3872 compounds and 23232 transitions from SWATH 8 (batch 3 out of 3) 27.27 % Thread 4_0 will analyze 4088 compounds and 24528 transitions from SWATH 9 (batch 0 out of 4) Thread 4_0 will analyze 4088 compounds and 24528 transitions from SWATH 9 (batch 1 out of 4) Thread 4_0 will analyze 4088 compounds and 24528 transitions from SWATH 9 (batch 2 out of 4) Thread 0_0 will analyze 4131 compounds and 24786 transitions from SWATH 10 (batch 0 out of 4) Thread 5_0 will analyze 3759 compounds and 22554 transitions from SWATH 11 (batch 0 out of 3) Thread 4_0 will analyze 4088 compounds and 24528 transitions from SWATH 9 (batch 3 out of 4) Thread 1_0 will analyze 3636 compounds and 21816 transitions from SWATH 12 (batch 0 out of 3) Thread 0_0 will analyze 4131 compounds and 24786 transitions from SWATH 10 (batch 1 out of 4) Thread 5_0 will analyze 3759 compounds and 22554 transitions from SWATH 11 (batch 1 out of 3) Thread 2_0 will analyze 3239 compounds and 19434 transitions from SWATH 14 (batch 0 out of 3) Thread 4_0 will analyze 4088 compounds and 24528 transitions from SWATH 9 (batch 4 out of 4) Thread 3_0 will analyze 3615 compounds and 21690 transitions from SWATH 13 (batch 0 out of 3) Thread 1_0 will analyze 3636 compounds and 21816 transitions from SWATH 12 (batch 1 out of 3) 30.30 % Thread 0_0 will analyze 4131 compounds and 24786 transitions from SWATH 10 (batch 2 out of 4) Thread 5_0 will analyze 3759 compounds and 22554 transitions from SWATH 11 (batch 2 out of 3) Thread 2_0 will analyze 3239 compounds and 19434 transitions from SWATH 14 (batch 1 out of 3) Thread 3_0 will analyze 3615 compounds and 21690 transitions from SWATH 13 (batch 1 out of 3) Thread 4_0 will analyze 3104 compounds and 18624 transitions from SWATH 15 (batch 0 out of 3) Thread 1_0 will analyze 3636 compounds and 21816 transitions from SWATH 12 (batch 2 out of 3) Thread 0_0 will analyze 4131 compounds and 24786 transitions from SWATH 10 (batch 3 out of 4) Thread 5_0 will analyze 3759 compounds and 22554 transitions from SWATH 11 (batch 3 out of 3) Thread 2_0 will analyze 3239 compounds and 19434 transitions from SWATH 14 (batch 2 out of 3) Thread 3_0 will analyze 3615 compounds and 21690 transitions from SWATH 13 (batch 2 out of 3) Thread 4_0 will analyze 3104 compounds and 18624 transitions from SWATH 15 (batch 1 out of 3) Thread 1_0 will analyze 3636 compounds and 21816 transitions from SWATH 12 (batch 3 out of 3) 33.33 % Thread 0_0 will analyze 4131 compounds and 24786 transitions from SWATH 10 (batch 4 out of 4) Thread 2_0 will analyze 3239 compounds and 19434 transitions from SWATH 14 (batch 3 out of 3) Thread 3_0 will analyze 3615 compounds and 21690 transitions from SWATH 13 (batch 3 out of 3) 36.36 % Thread 4_0 will analyze 3104 compounds and 18624 transitions from SWATH 15 (batch 2 out of 3) 45.45 % Thread 4_0 will analyze 3104 compounds and 18624 transitions from SWATH 15 (batch 3 out of 3) 48.48 % Thread 5_0 will analyze 3623 compounds and 21738 transitions from SWATH 16 (batch 0 out of 3) Thread 5_0 will analyze 3623 compounds and 21738 transitions from SWATH 16 (batch 1 out of 3) Thread 5_0 will analyze 3623 compounds and 21738 transitions from SWATH 16 (batch 2 out of 3) Thread 1_0 will analyze 3319 compounds and 19914 transitions from SWATH 17 (batch 0 out of 3) Thread 3_0 will analyze 3081 compounds and 18486 transitions from SWATH 20 (batch 0 out of 3) Thread 5_0 will analyze 3623 compounds and 21738 transitions from SWATH 16 (batch 3 out of 3) Thread 0_0 will analyze 2765 compounds and 16590 transitions from SWATH 18 (batch 0 out of 2) Thread 1_0 will analyze 3319 compounds and 19914 transitions from SWATH 17 (batch 1 out of 3) 51.52 % Thread 2_0 will analyze 2951 compounds and 17706 transitions from SWATH 19 (batch 0 out of 2) Thread 3_0 will analyze 3081 compounds and 18486 transitions from SWATH 20 (batch 1 out of 3) Thread 0_0 will analyze 2765 compounds and 16590 transitions from SWATH 18 (batch 1 out of 2) Thread 1_0 will analyze 3319 compounds and 19914 transitions from SWATH 17 (batch 2 out of 3) Thread 2_0 will analyze 2951 compounds and 17706 transitions from SWATH 19 (batch 1 out of 2) Thread 3_0 will analyze 3081 compounds and 18486 transitions from SWATH 20 (batch 2 out of 3) Thread 4_0 will analyze 3017 compounds and 18102 transitions from SWATH 21 (batch 0 out of 3) Thread 0_0 will analyze 2765 compounds and 16590 transitions from SWATH 18 (batch 2 out of 2) Thread 1_0 will analyze 3319 compounds and 19914 transitions from SWATH 17 (batch 3 out of 3) Thread 2_0 will analyze 2951 compounds and 17706 transitions from SWATH 19 (batch 2 out of 2) Thread 3_0 will analyze 3081 compounds and 18486 transitions from SWATH 20 (batch 3 out of 3) Thread 5_0 will analyze 2888 compounds and 17328 transitions from SWATH 22 (batch 0 out of 2) Thread 4_0 will analyze 3017 compounds and 18102 transitions from SWATH 21 (batch 1 out of 3) 63.64 % Thread 5_0 will analyze 2888 compounds and 17328 transitions from SWATH 22 (batch 1 out of 2) Thread 4_0 will analyze 3017 compounds and 18102 transitions from SWATH 21 (batch 2 out of 3) Thread 5_0 will analyze 2888 compounds and 17328 transitions from SWATH 22 (batch 2 out of 2) Thread 4_0 will analyze 3017 compounds and 18102 transitions from SWATH 21 (batch 3 out of 3) 69.70 % Thread 3_0 will analyze 2491 compounds and 14946 transitions from SWATH 25 (batch 0 out of 2) Thread 0_0 will analyze 2756 compounds and 16536 transitions from SWATH 23 (batch 0 out of 2) Thread 2_0 will analyze 2356 compounds and 14136 transitions from SWATH 26 (batch 0 out of 2) Thread 1_0 will analyze 2377 compounds and 14262 transitions from SWATH 24 (batch 0 out of 2) Thread 3_0 will analyze 2491 compounds and 14946 transitions from SWATH 25 (batch 1 out of 2) Thread 2_0 will analyze 2356 compounds and 14136 transitions from SWATH 26 (batch 1 out of 2) Thread 0_0 will analyze 2756 compounds and 16536 transitions from SWATH 23 (batch 1 out of 2) Thread 1_0 will analyze 2377 compounds and 14262 transitions from SWATH 24 (batch 1 out of 2) Thread 3_0 will analyze 2491 compounds and 14946 transitions from SWATH 25 (batch 2 out of 2) Thread 2_0 will analyze 2356 compounds and 14136 transitions from SWATH 26 (batch 2 out of 2) Thread 0_0 will analyze 2756 compounds and 16536 transitions from SWATH 23 (batch 2 out of 2) Thread 4_0 will analyze 2172 compounds and 13032 transitions from SWATH 27 (batch 0 out of 2) Thread 5_0 will analyze 2029 compounds and 12174 transitions from SWATH 28 (batch 0 out of 2) Thread 1_0 will analyze 2377 compounds and 14262 transitions from SWATH 24 (batch 2 out of 2) 81.82 % Thread 4_0 will analyze 2172 compounds and 13032 transitions from SWATH 27 (batch 1 out of 2) Thread 5_0 will analyze 2029 compounds and 12174 transitions from SWATH 28 (batch 1 out of 2) Thread 4_0 will analyze 2172 compounds and 13032 transitions from SWATH 27 (batch 2 out of 2) Thread 5_0 will analyze 2029 compounds and 12174 transitions from SWATH 28 (batch 2 out of 2) 87.88 % Thread 3_0 will analyze 1529 compounds and 9174 transitions from SWATH 29 (batch 0 out of 1) Thread 2_0 will analyze 1577 compounds and 9462 transitions from SWATH 30 (batch 0 out of 1) Thread 3_0 will analyze 1529 compounds and 9174 transitions from SWATH 29 (batch 1 out of 1) Thread 2_0 will analyze 1577 compounds and 9462 transitions from SWATH 30 (batch 1 out of 1) Thread 0_0 will analyze 1201 compounds and 7206 transitions from SWATH 31 (batch 0 out of 1) 90.91 % Thread 1_0 will analyze 546 compounds and 3276 transitions from SWATH 32 (batch 0 out of 1) 93.94 % Thread 0_0 will analyze 1201 compounds and 7206 transitions from SWATH 31 (batch 1 out of 1) Thread 1_0 will analyze 546 compounds and 3276 transitions from SWATH 32 (batch 1 out of 1) -- done [took 12:33 m (CPU), 08:57 m (Wall)] -- OpenSwathWorkflow took 28:35 m (wall), 36:11 m (CPU), 01:39 m (system), 34:31 m (user); Peak Memory Usage: 22485 MB.

3.5 PercolatorAdapter

(base) ptruong@planck:/hdd_14T/data/PXD002952/bugg_report_20210205$ PercolatorAdapter -in_osw osw_output.osw -out percolatorAdapter.osw Prepared percolator input. PercolatorAdapter finished successfully! PercolatorAdapter took 14:35 m (wall), 10:03 m (CPU), 2.63 s (system), 10:00 m (user); Peak Memory Usage: 448 MB. Prepared percolator input. PercolatorAdapter finished successfully! PercolatorAdapter took 14:35 m (wall), 10:03 m (CPU), 2.63 s (system), 10:00 m (user); Peak Memory Usage: 448 MB.

Run every library for specific file.

I think I matched wrong .mzML with wrong spectra library because some have variable window, some have fixed and they should have same amount of SWATHs.

E.g. spectral library for 32var should be used for:

ls HYE110_TTOF6600_32var**.mzML

this set.

The run_osw_output.sh script is adjusted accordingly and run again. The files contained in this should work with percolatorAdapter

2021-02-04

Looking into the data again... what is peptide identification quality.

I think I actually need to run percolator for me to know idenfication quality. I should probabily read the percolator paper again and see if i understand it better.

For PercolatorAdapter to run in the OSW-pipeline I need to have .PQP format for the library.

To obtain this I use

(base) ptruong@planck:/hdd_14T/data/PXD002952$ TargetedFileConverter -in test.TraML -out test.PQP TargetedFileConverter took 22.72 s (wall), 21.38 s (CPU), 0.32 s (system), 21.06 s (user); Peak Memory Usage: 601 MB.

Where "test.TraML" is the supplied spectral library in PXD002952.

Started OpenSwathWorkflow with .out_osw output.

I need the .osw output to run Percolator on the data.

TRIC should be used on percolator output for .osw file.

Looked at a youtube tutorial for svm as a reminder.

Read about Percolator.

(base) ptruong@planck:/hdd_14T/data/PXD002952/percolatorAdapter_test$ PercolatorAdapter -in_osw osw_output_HYE110_TTOF6600_32fix_lgillet_I160308_003_Pedro_Sample_A___32_fixed.mzML.osw -out percolatorAdapter_output.osw Prepared percolator input. Standard output: Running: /usr/bin/percolator -U -m /tmp/20210204_153946_planck_265848_1/20210204_153946_planck_265848_2_target_pout_psms.tab -M /tmp/20210204_153946_planck_265848_1/20210204_153946_planck_265848_2_decoy_pout_psms.tab /tmp/20210204_153946_planck_265848_1/20210204_153946_planck_265848_2_pin.tab

Standard error: Percolator version 3.02.nightly-4-2a55db2, Build Date Nov 27 2018 17:02:04 Copyright (c) 2006-9 University of Washington. All rights reserved. Written by Lukas Käll ([email protected]) in the Department of Genome Sciences at the University of Washington. Issued command: /usr/bin/percolator -U -m /tmp/20210204_153946_planck_265848_1/20210204_153946_planck_265848_2_target_pout_psms.tab -M /tmp/20210204_153946_planck_265848_1/20210204_153946_planck_265848_2_decoy_pout_psms.tab /tmp/20210204_153946_planck_265848_1/20210204_153946_planck_265848_2_pin.tab Started Thu Feb 4 15:42:08 2021 Hyperparameters: selectionFdr=0.01, Cpos=0, Cneg=0, maxNiter=10 Reading tab-delimited input from datafile /tmp/20210204_153946_planck_265848_1/20210204_153946_planck_265848_2_pin.tab Features: VAR_BSERIES_SCORE VAR_DOTPROD_SCORE VAR_ELUTION_MODEL_FIT_SCORE VAR_INTENSITY_SCORE VAR_ISOTOPE_CORRELATION_SCORE VAR_ISOTOPE_OVERLAP_SCORE VAR_LIBRARY_CORR VAR_LIBRARY_DOTPROD VAR_LIBRARY_MANHATTAN VAR_LIBRARY_RMSD VAR_LIBRARY_ROOTMEANSQUARE VAR_LIBRARY_SANGLE VAR_LOG_SN_SCORE VAR_MANHATTAN_SCORE VAR_MASSDEV_SCORE VAR_MASSDEV_SCORE_WEIGHTED VAR_MI_RATIO_SCORE VAR_MI_SCORE VAR_MI_WEIGHTED_SCORE VAR_NORM_RT_SCORE VAR_SONAR_LAG VAR_SONAR_LOG_DIFF VAR_SONAR_LOG_SN VAR_SONAR_LOG_TREND VAR_SONAR_RSQ VAR_SONAR_SHAPE VAR_XCORR_COELUTION VAR_XCORR_COELUTION_WEIGHTED VAR_XCORR_SHAPE VAR_XCORR_SHAPE_WEIGHTED VAR_YSERIES_SCORE Found 307724 PSMs Concatenated search input detected, skipping both target-decoy competition and mix-max. Train/test set contains 307724 positives and 0 negatives, size ratio=inf and pi0=1 Exception caught: Error: no decoy PSMs were provided.

Process '/usr/bin/percolator' did not finish successfully (exit code: ). Please check the log.

PercolatorAdapter took 02:23 m (wall), 02:10 m (CPU), 1.41 s (system), 02:09 m (user); Peak Memory Usage: 161 MB.

2021-02-03

Looking into TRIC, Looking at the data, Reading Deanna paper.

2021-02-01

Trying to mount the docker on the new harddrive using the following command:

(base) ptruong@planck:~/git/dia_sum$ docker run --name osw_tutorial --rm -v /hdd_14T/:/data -i -t openswath/openswath:latest docker: Error response from daemon: error while creating mount source path '/hdd_14T': mkdir /hdd_14T: read-only file system.

There seem to be some error relating to mounting docker on a seperate drive.

moby/moby#34427 - this webpage tells me to remove my snap version of docker and reinstall non-snap version of docker for it to work correctly.

Removed snap version of docker and installed docker according to https://docs.docker.com/engine/install/ubuntu/.

Now the problem is fixed!

Started the conversion process.

Started OpenSwathWorkflow

#!/bin/bash

for filename in *.mzML
do
  OpenSwathWorkflow -in $filename -tr ecolihumanyeast_concat_mayu_IRR_cons_openswath_32w_fixed_curated_decoy.TraML -out_tsv osw_output_$filename.tsv -readOptions cacheWorkingInMemory -batchSize 1000 -min_upper_edge_dist 1 -extra_rt_extraction_window 100 -min_rsq 0.95 -min_coverage 0.6 -rt_extraction_window 600 -mz_extraction_window 30 -threads 6 -ppm

done;

2021-01-27

Installed new harddrive at lab computer, using GParted to create partition.

https://askubuntu.com/questions/662229/what-partition-table-does-ubuntu-create-by-default

This guide suggests using gpt partition.

Tutorial for gparted:

https://linuxhint.com/gparted_ubuntu/

ToDo before bedtime:

  • convert all .wiff files to .mzml in hdd_14T.
  • perform a spectral library run of openMS.
  • check the osw_tsv.csv output data.
  • check TRIC

Added shared folder for virtualBox and started process for converting all .wiff files for .mzml.

2021-01-26

Ran openSWATHWorkFlow yesterday... I choose the wrong input (DDA input). I should use the DIA. I will try again and remake.

Also, there should be a file for irt_normalization hroest_DIA_iRT.TraML. I can't find the file in the ProteomeExchange.

OpenSwathWorkflow -in ../test_HYE124_TTOF6600_32fix_lgillet_I150211_004_test_sample.mzML -tr ecolihumanyeast_concat_mayu_IRR_cons_openswath_32w_fixed_curated_decoy.TraML -out_tsv osw_output.tsv -readOptions cacheWorkingInMemory -batchSize 1000 -min_upper_edge_dist 1 -extra_rt_extraction_window 100 -min_rsq 0.95 -min_coverage 0.6 -rt_extraction_window 600 -mz_extraction_window 30 -threads 6 -ppm

Running this command gives the following error.

Error: Unable to read file (- due to that error of type Parse Error in: /code/OpenMS/src/openms/source/FORMAT/HANDLERS/XMLHandler.cpp@105-void OpenMS::Internal::XMLHandler::fatalError(OpenMS::Internal::XMLHandler::ActionMode, const OpenMS::String&, OpenMS::UInt, OpenMS::UInt) const)

According to nf-core/mhcquant#91

This is some c++ internal error. And It could be fixed by peak_ms_level 1 or other centroiding algorithm.

I try to use msconvert to filter on "msLevel 1" with the following command:

msconvert test_HYE124_TTOF6600_32fix_lgillet_I150211_004_test_sample.mzML --filter "msLevel 1"

2021-01-25

Trying to run .splib --> .mrm --> .TraML file again. I suspect -cI command messed up last spectral library build.

Going through Tenzer paper PXD002952 to check details.

iRT normalization is not needed because same buffer (lösningmedel) is used for all samples and same gradient. I think.. this is found under sample preparation section of the paper.

(base) ptruong@planck:~/git/bayesMS/data/datasets/PXD002952/ftp.pride.ebi.ac.uk/pride/data/archive/2016/09/PXD002952/convert/dda/mzml/HeLa_007/crux-output$ spectrast -cNpercolator.db -cM percolator.target.splib SpectraST started at Mon Jan 25 15:15:40 2021. Creating library from "/home/ptruong/git/bayesMS/data/datasets/PXD002952/ftp.pride.ebi.ac.uk/pride/data/archive/2016/09/PXD002952/convert/dda/mzml/HeLa_007/crux-output/percolator.target.splib" Importing ions...500...1000...1500...2000...2500...3000...3500...4000...4500...5000...5500...6000...DONE!

Library file (BINARY) "percolator.db.splib" created. Library file (TEXT) "percolator.db.sptxt" created. M/Z Index file "percolator.db.spidx" created. Peptide Index file "percolator.db.pepidx" created. MRM Table file "percolator.db.mrm" created.

Total number of spectra in library: 6811 Total number of distinct peptide ions in library: 6375 Total number of distinct stripped peptides in library: 6342

CHARGE +1: 0 ; +2: 5055 ; +3: 1756 ; +4: 0 ; +5: 0 ; >+5: 0 ; Unk: 0 TERMINI Tryptic: 6811 ; Semi-tryptic: 0 ; Non-tryptic: 0 PROBABILITY >0.9999: 109 ; 0.999-0.9999: 789 ; 0.99-0.999: 2448 ; 0.9-0.99: 3464 ; <0.9: 1 NREPS 20+: 0 ; 10-19: 0 ; 4-9: 0 ; 2-3: 0 ; 1: 6811 MODIFICATIONS C,Carbamidomethyl: 938

Total Run Time = 457 seconds. SpectraST finished at Mon Jan 25 15:23:17 2021 without error.

(base) ptruong@planck:~/git/bayesMS/data/datasets/PXD002952/ftp.pride.ebi.ac.uk/pride/data/archive/2016/09/PXD002952/convert/dda/mzml/HeLa_007/crux-output$ TargetedFileConverter -in percolator.db.mrm -out percolator.db.TraML ´Warning: SpectraST was not run in RT normalization mode but the converted list was interpreted to have iRT units. Check whether you need to adapt the parameter -algorithm:retentionTimeInterpretation. You can ignore this warning if you used a legacy SpectraST 4.0 file. Progress of 'conversion to internal data representation': -- done [took 2.00 s (CPU), 2.01 s (Wall)] -- TargetedFileConverter took 06:28 m (wall), 06:28 m (CPU), 3.04 s (system), 06:25 m (user); Peak Memory Usage: 1485 MB.

Converted file to .TraML and .tsv to check the formating.

To find out which input row is used, we check the values of the .tsv from the converted .splib and compare it to the pep.xml used as input to generate the .splib. Since spectrast spectral library creation should just be a tresholding of the pep.xml file.

EXPERIMENT ID. IF PEPTIDE-LEVEL(FDR) or PSM-LEVEL(PEP) probababilities are used for spectrast.

We run the following commands:

spectrast -cP percolator.target.pep.xml spectrast -cNpercolator.db -cM percolator.target.splib TargetedFileConverter -in percolator.db.mrm -out percolator.db.tsv

First we run the commands using -cP0.95.

We look at the .tsv file and find a peptide "AAAAAAALQAK". It has the parameters.

percolator_qvalue = 0.00071234 percolator_PEP = 0.00011322 peptideprophet_prob = 0.99988678

1-percolator_qvalue = 0.99928766 1-percolator_PEP = 0.9998867

We run the commands again with treshold 0.9995 and look for "AAAAAAALQAK". If it is there then PSM-level probabilities are used, otherwise peptide-level probabilities are used.

We check the .tsv and find "AAAAAAALQAK", which means PSM-level probabilities are used. Now to double check we find a peptide "AAAEVNQDYGLDPK" which has percolator_PEP = 0.00032544, and therefore 1-percolator_PEP = 0.99967456.

If we treshold at -cP0.9998, "AAAEVNQDYGLDPK" should be gone from the library .tsv.

We run the same commands with -cP0.9998 and the .tsv file does not contain the "AAAEVNQDYGLDPK" peptide.

Hence we are using the PEP-probabilities.

Generating data to run openSWATH

We run same command as above with -cP0.99.

spectrast -cP0.99 percolator.target.pep.xml spectrast -cNpercolator.db -cM percolator.target.splib TargetedFileConverter -in percolator.db.mrm -out percolator.db.tsv TargetedFileConverter -in percolator.db.mrm -out percolator.db.TraML

2021-01-22

Another link about running spectrast https://www.biorxiv.org/content/10.1101/2020.01.21.914788v2.full.pdf

How does spectral searching work?

How does OpenSWATH work for peptide concentrations?

Today was spent writing about the pipeline in thesis section.

Reading Griss J. 2015 - Spectral library searching in Proteomics

2021-01-21

Trying to get the whole pipeline working. The followning is the output.

root@f2d0dc8baf00:/data/git/bayesMS/data/datasets/PXD002952/ftp.pride.ebi.ac.uk/pride/data/archive/2016/09/PXD002952/convert/dda/mzml/HeLa_007/crux-output# OpenSwathWorkflow -in comet.target.mzML -tr db_assays.TraML -sort_swath_maps -batchSize 1000 -out_tsv osw_output.tsv Since neither rt_norm nor tr_irt is set, OpenSWATH will not use RT-transformation (rather a null transformation will be applied) Progress of 'Load TraML file': -- done [took 1.99 s (CPU), 2.05 s (Wall)] -- Loaded 259 proteins, 318 compounds with 36807 transitions. Loading mzML file comet.target.mzML using readoptions normal Progress of 'Loading metadata file comet.target.mzML': Will analyze the metadata first to determine the number of SWATH windows and the window sizes. Determined there to be 34525 SWATH windows and in total 2212 MS1 spectra Determined there to be 34525 SWATH windows and in total 2212 MS1 spectra -- done [took 22.85 s (CPU), 41.60 s (Wall)] -- Progress of 'Loading data file comet.target.mzML': Killed

root@f2d0dc8baf00:/data/git/bayesMS/data/datasets/PXD002952/ftp.pride.ebi.ac.ukroot@f2d0dc8baf00:/data/git/bayesMS/data/datasets/PXD002952/ftp.pride.ebi.ac.uk/pride/data/archive/2016/09/PXD002952/convert/dda/mzml/HeLa_007/crux-output# OpenSwathWorkflow -in comet.target.mzML -tr db_assays.TraML -sort_swath_maps -batchSize 1000 -out_tsv osw_output.tsv Since neither rt_norm nor tr_irt is set, OpenSWATH will not use RT-transformation (rather a null transformation will be applied) Progress of 'Load TraML file': -- done [took 1.90 s (CPU), 1.94 s (Wall)] -- Loaded 259 proteins, 318 compounds with 36807 transitions. Loading mzML file comet.target.mzML using readoptions normal Progress of 'Loading metadata file comet.target.mzML': Will analyze the metadata first to determine the number of SWATH windows and the window sizes. Determined there to be 34525 SWATH windows and in total 2212 MS1 spectra Determined there to be 34525 SWATH windows and in total 2212 MS1 spectra -- done [took 24.89 s (CPU), 41.30 s (Wall)] -- Progress of 'Loading data file comet.target.mzML': Read chromatogram while reading SWATH files, did not expect that!

-- done [took 03:48 m (CPU), 03:48 m (Wall)] -- Extraction will overlap between 360.521 and 359.522 This will lead to multiple extraction of the transitions in the overlapping regionwhich will lead to duplicated output. It is very unlikely that you want this. Please fix this by providing an appropriate extraction file with -swath_windows_file Extraction windows overlap. Will abort (override with -force) OpenSwathWorkflow took 04:33 m (wall), 04:16 m (CPU), 10.95 s (system), 04:05 m (user). root@f2d0dc8baf00:/data/git/bayesMS/data/datasets/PXD002952/ftp.pride.ebi.ac.uk/pride/data/archive/2016/09/PXD002952/convert/dda/mzml/HeLa_007/crux-output#

It seems that I should be using swath_windows_file.

I will try the -force arg as well.

root@f2d0dc8baf00:/data/git/bayesMS/data/datasets/PXD002952/ftp.pride.ebi.ac.uk/pride/data/archive/2016/09/PXD002952/convert/dda/mzml/HeLa_007/crux-output# OpenSwathWorkflow -in comet.target.mzML -tr db_assays.TraML -sort_swath_maps -batchSize 1000 -force -out_tsv osw_output.tsv

2020-01-19.

... Some .md notes lost because forgot to save .md file and closed terminal.

Have been trying to get spectraST. I can't figure out which file to use from crux-output, so have tried to run comet from tpp.

The process for generating spectral library...

Comet works but on HeLa_007.mzML pep.xml output I get the following error:

(base) ptruong@planck:~/git/bayesMS/data/datasets/PXD002952/ftp.pride.ebi.ac.uk/pride/data/archive/2016/09/PXD002952/convert/dda/mzml/HeLa_007$ PeptideProphetParser Hela_007_Rep1.pep.xml (Comet) init with Comet Trypsin MS Instrument info: Manufacturer: UNKNOWN, Model: UNKNOWN, Ionization: UNKNOWN, Analyzer: UNKNOWN, Detector: UNKNOWN

INFO: Processing standard MixtureModel ... PeptideProphet (TPP v5.2.0 Flammagenitus, Build 201902051127-7887 (Linux-x86_64)) AKeller@ISB read in 8 1+, 26899 2+, 24801 3+, 0 4+, 0 5+, 0 6+, and 0 7+ spectra. Initialising statistical models ... Iterations: .........10.........20......... WARNING: Mixture model quality test failed for charge (1+). model complete after 30 iterations

(base) ptruong@planck:~/git/bayesMS/data/datasets/PXD002952/ftp.pride.ebi.ac.uk/pride/data/archive/2016/09/PXD002952/convert/dda/mzml/HeLa_007$ InterProphetParser Hela_007_Rep1.pep.xml iProphet.pep.xml Running NSS NRS NSE NSI NSM NSP FPKM Model EM: Computing NSS values ... . done Computing NRS values ... .........10%.........20%.........30%.........40%.........50%.........60%.........70%.........80%.........90%.........100% done Computing NSE values ... .........10%.........20%.........30%.........40%.........50%.........60%.........70%.........80%.........90%.........100% done Computing NSI values ... .........10%.........20%.........30%.........40%.........50%.........60%.........70%.........80%.........90%.........100% done Computing NSM values ... .........10%.........20%.........30%.........40%.........50%.........60%.........70%.........80%.........90%.........100% done Computing NSP values ... Creating 1 threads Wait for threads to finish ... 0--------------------------------------------------50------------------------------------------------100% .................................................................................................... done FPKM values are unavailable ... Iterations: .........10.......done

iProphet processed .mzML for probability assignment performed as recommended.

(base) ptruong@planck:~/git/bayesMS/data/datasets/PXD002952/ftp.pride.ebi.ac.uk/pride/data/archive/2016/09/PXD002952/convert/dda/mzml/HeLa_007$ spectrast -cP iProphet.pep.xml SpectraST started at Wed Jan 20 01:40:54 2021. Processing "iProphet.pep.xml"...500...1000...1500...2000...2500...3000...3500...4000...4500...5000...5500...6000...6500...7000...7500...8000...8500...9000...9500...10000...10500...11000...11500...12000...12500...13000...13500...14000...14500...15000...15500...16000...16500...17000...17500...18000...18500...19000...19500...20000...20500...21000...21500...22000...22500...23000...DONE! Importing all spectra with P>=0 ...10%...20%...30%...40%...50%...60%...70%...80%...90%...DONE!

Library file (BINARY) "iProphet.splib" created. Library file (TEXT) "iProphet.sptxt" created. M/Z Index file "iProphet.spidx" created. Peptide Index file "iProphet.pepidx" created.

Total number of spectra in library: 0 Total number of distinct peptide ions in library: 0 Total number of distinct stripped peptides in library: 0

CHARGE +1: 0 ; +2: 0 ; +3: 0 ; +4: 0 ; +5: 0 ; >+5: 0 ; Unk: 0 TERMINI Tryptic: 0 ; Semi-tryptic: 0 ; Non-tryptic: 0 PROBABILITY >0.9999: 0 ; 0.999-0.9999: 0 ; 0.99-0.999: 0 ; 0.9-0.99: 0 ; <0.9: 0 NREPS 20+: 0 ; 10-19: 0 ; 4-9: 0 ; 2-3: 0 ; 1: 0 MODIFICATIONS None

Total Run Time = 48 seconds. SpectraST finished at Wed Jan 20 01:41:42 2021 with 23144 error(s): ...

spectrast performed on the iProphet output... however 0 spectra ssem to have been generated. The output file is empty.

What to do??

I have tried rerunning everything many times.. without results.

I don't know how would be the smartest way to generate spectral lib from my current data set... I'm stuck this is not good.

Just noticed this snippet...

...

PEPXML IMPORT: Cannot find MS2+ scan #71199 in file "/home/ptruong/git/bayesMS/data/datasets/PXD002952/ftp.pride.ebi.ac.uk/pride/data/archive/2016/09/PXD002952/convert/dda/mzml/HeLa_007/crux-output/comet.mzXML". Scan not imported. ...

Perhaps the error occurs because I have not converted my .mzML to mzXML, so the spectrast cannot search?

Converting .mzML to .mzXML worked.

(base) ptruong@planck:~/git/bayesMS/data/datasets/PXD002952/ftp.pride.ebi.ac.uk/pride/data/archive/2016/09/PXD002952/convert/dda/mzml/HeLa_007/crux-output$ spectrast -cP0.9 comet.target.pep.xml SpectraST started at Wed Jan 20 02:48:55 2021. Processing "comet.target.pep.xml"...500...1000...1500...2000...2500...3000...3500...4000...4500...5000...5500...6000...6500...7000...7500...8000...8500...9000...9500...10000...10500...11000...11500...12000...12500...13000...13500...14000...14500...15000...15500...16000...16500...DONE! Importing all spectra with P>=0.9 ...10%...20%...30%...40%...50%...60%...70%...80%...90%...DONE!

Library file (BINARY) "comet.target.splib" created. Library file (TEXT) "comet.target.sptxt" created. M/Z Index file "comet.target.spidx" created. Peptide Index file "comet.target.pepidx" created.

Total number of spectra in library: 8141 Total number of distinct peptide ions in library: 7491 Total number of distinct stripped peptides in library: 7435

CHARGE +1: 0 ; +2: 6122 ; +3: 2019 ; +4: 0 ; +5: 0 ; >+5: 0 ; Unk: 0 TERMINI Tryptic: 8140 ; Semi-tryptic: 1 ; Non-tryptic: 0 PROBABILITY >0.9999: 4670 ; 0.999-0.9999: 1068 ; 0.99-0.999: 1068 ; 0.9-0.99: 1333 ; <0.9: 2 NREPS 20+: 0 ; 10-19: 0 ; 4-9: 0 ; 2-3: 0 ; 1: 8141 MODIFICATIONS C,Carbamidomethyl: 1026

Total Run Time = 380 seconds. SpectraST finished at Wed Jan 20 02:55:15 2021 with 8551 error(s): PEPXML IMPORT: Cannot find MS2+ scan #156 in file "/home/ptruong/git/bayesMS/data/datasets/PXD002952/ftp.pride.ebi.ac.uk/pride/data/archive/2016/09/PXD002952/convert/dda/mzml/HeLa_007/crux-output/comet.mzXML". Scan not imported. PEPXML IMPORT: Cannot find MS2+ scan #185 in file "/home/ptruong/git/bayesMS/data/datasets/PXD002952/ftp.pride.ebi.ac.uk/pride/data/archive/2016/09/PXD002952/convert/dda/mzml/HeLa_007/crux-output/comet.mzXML". Scan not imported. PEPXML IMPORT: Cannot find MS2+ scan #263 in file "/home/ptruong/git/bayesMS/data/datasets/PXD002952/ftp.pride.ebi.ac.uk/pride/data/archive/2016/09/PXD002952/convert/dda/mzml/HeLa_007/crux-output/comet.mzXML". Scan not imported. ...

The missed PEPXML IMPORT: are mslevel=1 scans in the .mzXML file.

2021-01-20 A lot of time spent tryin to get spectral lib file conversion.

Fixed all steps for the other two mzML steps.

How do we convert .splib to TraML, tsv, pqp format?

Spent a couple of hours (~3hrs) on trying to looking into ConvertTSVToTraML to convert .splib to .TraML as specified in http://openswath.org/en/latest/docs/ipf_legacy.html. It turns out that ConvertTSVToTraML has deprecated and is removed since openms2.2 (https://www.openms.de/openms220/). The current version is openms2.6.

TargetFileConverter should be used instead to convert.

The process is following..

This will generate the file db_assays.mrm

spectrast -cNdb_assays -cICID-QTOF -cM db_consensus.splib

TargetedFileConverter -in db_assays.mrm -out db_assays.TraML

OpenSwathWorkflow -in comet.target.mzML -tr db_assays.TraML -sort_swath_maps -batchSize 1000 -out_tsv osw_output.tsv

The above command worked; two thoughts - 1) iRT normalization is this needed (I should read the Tenzer paper to determine this and old notes from what is discussed with Lukas), 2) I am not using swath_windows_file at this moment. Is this needed?

2021-01-18

sudo cpan install FindBin::libs

Installed FindBin::libs

Now installation of TPP should be done.

In http://openswath.org/en/latest/docs/tpp.html.

Following line:

Please follow the tutorial until the generation of SpectraST spectral libraries. If you are using DIA data, follow the DIA-Umpire tutorial to generate pseudo-spectra first, which can then be processed using the TPP.

Does this mean that we can generate spectral library from DIA data as well... I should check this.

According to Yang et al. 2020 spectral library can be generated from DIA-data. SHould check this laters.

... Some .md notes lost because forgot to save .md file and closed terminal.

2021-01-15

Meeting with Lukas.

Looking into how to build spectra libraries again.

How to build spectral library from SPECTRAL LIBRARY GENERATION segment in this tutorial (http://openswath.org/en/latest/docs/tpp.html#id4).

Looking into Trans-Proteomic Pipeline.

Building the Trans-Proteomic Pipeline by following this tutorial (http://tools.proteomecenter.org/wiki/index.php?title=TPP_5.2.0:_Installing_on_Ubuntu_18.04_LTS).

Everything worked until

export PERL_MM_USE_DEFAULT=1
yes | sudo cpan install CGI
sudo cpan install XML::Parser
sudo cpan install FindBin::libs
yes | sudo cpan install JSON


cd /usr/local/tpp/bin
export PERL5LIB=/usr/local/tpp/lib/perl
./test_tpi.pl

(base) ptruong@planck:/usr/local/tpp/bin$ ./test_tpi.pl
Testing CGI.............................ok
Testing CGI::Carp.......................ok
Testing Cwd.............................ok
Testing Data::Dumper....................ok
Testing Digest::MD5.....................ok
Testing DirHandle.......................ok
Testing Fcntl...........................ok
Testing File::Basename..................ok
Testing File::Copy......................ok
Testing File::Path......................ok
Testing File::Spec......................ok
Testing File::Spec::Functions...........ok
Testing File::Spec::Win32...............ok
Testing File::Temp......................ok
Testing FindBin.........................ok
Testing FindBin::libs...................NOT FOUND!  Please check installation.
Testing Getopt::Long....................ok
Testing Getopt::Std.....................ok
Testing HTML::Entities..................ok
Testing IO::Compress::Zip...............ok
Testing JSON............................ok
Testing LWP::Simple.....................ok
Testing Pod::Usage......................ok
Testing POSIX...........................ok
Testing Test::More......................ok
Testing Tie::File.......................ok
Testing TPP.............................ok
Testing tpplib_perl.....................ok
Testing TPP::StatUtilities..............ok
Testing Win32...........................skipped
Testing Win32::GUI......................skipped
Testing Win32::GuiTest..................skipped
Testing Win32::Service..................skipped
Testing Win32::SysTray..................skipped
Testing XML::Parser.....................ok
29 modules installed, 5 skipped, 1 NOT installed


We don't have FindBin::libs. I will need to check how to install this tonight.


## 2021-01-14

Have gotten OpenSWATH to work on PASS00779 dataset.

Looking into BiblioSpec for spectral library build (How does this work?)
Looking into TPP Spectrast for spectra library build (Could not get TPP working correctly)
Looking into Mascot, SEQUEST for PSM (They seems to be proprietary)

Looking into the data set: I have choosen HYE124_TTOF6600_32fix data set because they seem to be the smallest data set with least dissimilar proportion between mixtures (65 % HUMAN, 20 % E.Coli, 15 % Yeast). 


Steps for building spectral library
.raw --> convert [MSConvert] to .mzML --> peptide spectral matching and protein identification (Crux-pipeline [Crux Bullseye --> Crux Comet/Tide search --> Crux Percolator]) --> BiblioSpec.

NOTE: files need to be renamed so that they do not have spaces before Crux can be run properly (I got stuck in this for a while).

I am currently running the Crux Bullseye on .mzML data. It seems to take about 2.5-3h for each .mzML file. 

Some Thoughts for meeting tomorrow:

- Tenzer data is big, Bruderer data is even bigger. Steps in pipeline take quite some time. How would be a smart way to work with the data, which data set to chose? Does it even matter?

Does it matter if TTOF6600, TTOF5600+ machine? 64 variable window size for SWATH vs 32 fixed window sized?

Does it matter which database search engine is used for peptide spectrum matches and protein identification (PSM)? 

Skyline can build spectral library (https://skyline.ms/_webdav/home/software/Skyline/%40files/tutorials/MethodEdit-20_1.pdf). But does not support tide-search format. So will use the comet formatting.

When this is done, everything else should work. 

How do i get BlibBuild working.






## 2021-01-13
http://openswath.org/en/latest/docs/docker.html#running-openswath-in-docker

To run from docker: This is the command.

docker run --name osw_tutorial --rm -v ~/:/data -i -t openswath/openswath:latest

How to run pyprophet.

Install pyprophet on python 2 environment. 

anaconda3/envs/py27/bin/pyprophet --help

## 2021-01-10

https://www.biorxiv.org/content/10.1101/044552v2.full.pdf

This website contains good tutorial on how to perform openSWATH procedure. 




### Looking at the Tenzer dataset

Questions:
- What is .wiff.scan files
- What is htrms files

.wiff.scan file seems to be to raw-files that should be converted to .mzml files according to (this)[https://www.researchgate.net/post/What_are_free_tools_to_convert_a_wiff_file_into_an_mzXML_file_in_an_MS_experiment] it should be able to be converted to .mzml with msconvert. 

Some more questions:
- What is important to focus on?
- What data set should I start with?
- What about datasets without proteinId?
- How should we approach this problem?
- Should I use the .raw files?
- What is the library (spectral library)
- What is the background file (background of encyclopeDIA?)
- When should I use PECAN? (One benchmarking paper used normal peakpicking from msconvert, is PECAN not for peak detection?)


Looking into 
Spectronaut_onlyHumanPeptidesDetectedByDIAUmpire_Report_ProtHUMAN.tsv


## 2021-01-06

https://bitbucket.org/searleb/encyclopedia/wiki/FAQs


Q. How can I use EncyclopeDIA in a non-English speaking environment?

Sorry, EncyclopeDIA is not currently internationalized for different languages. Aside from translation issues, EncyclopeDIA also assumes a decimal point separator (common to North America and most of Asia), rather than a decimal comma separator (common to mainland Europe). You can change the default settings for Java to force your computer to operate in this manner by specifying "user.language" and "user.region" to "US". For example, you can start EncyclopeDIA using the command:

java -Xmx12g -jar encyclopedia-0.9.0-executable.jar -Duser.language=en-US -Duser.region=US

Q. How do I run EncyclopeDIA or Walnut at the command line?

EncyclopeDIA can also be used at the command line. For example, to run EncyclopeDIA across all mzMLs in a directory, you can execute the shell command:

for i in directory/to/my/files/*.mzML; do 
    java -Xmx12g -jar encyclopedia-0.9.0-executable.jar -i $i -l library.dlib -f sequences.fasta; 
done

Here, Java is run with 12 GB of RAM with the flag "-Xmx12g". Additional flags are "-i" to specify the input mzML, "-l" to specify the library, and "-f" to specify the FASTA database. You can run the help function to get additional command line options:

java -Xmx12g -jar encyclopedia-0.9.0-executable.jar -h

The "-libexport" function can be used to generate global quantitative results.

java -Xmx12g -jar encyclopedia-0.9.0-executable.jar -libexport -a true -l library.dlib -f sequences.fasta -i directory/to/my/files/ -o output.elib

With the "-a true" flag, this command performs retention time alignment (match-between-runs) and global FDR estimation. After you run the -libexport function, all of the results are filtered to a 1% FDR with Percolator at the peptide (in the .peptides.txt table) and protein level (in the .proteins.txt table). With the "-a false" flag, the -libexport function can be used to generate a chromatogram library as "output.elib".


### 2020-12-18 Todo after friday meeting.

- Check Tenzer data set.
- Download ENSEMBL (Uniprot) data set.
- Download SWISS (Uniprot) data set.
- Check proteoform inference in Matthew's article about iPRG2016.
- Get EncyclopeDIA running on data sets.
- Get Triqler running on data sets.
- Think about what database Roland should use perform experiments for a fair comparison with triqler.
- Make project description more neutral. 
- Change project description from protein quantification to protein summarization (?)

Meeting notes

Project description is too loaded. It needs to be more neutral. 

Problems in benchmarking is not protein inference. We need to remove this part from any Triqler benchmarking. There are two approaches to how to treat proteoforms.

- If any proteoform is prevalent; the protein is assumed to exist (will id. too many proteins).
- If require more evidence for protein for it to be counted as identified (will be more conservative).

Triqler works on the conservative principle. Therefore, we need to have a database with less proteoforms for any comparison to be fair. 

Get encyclopeDIA working, get results and remove protein inference step from encyclopedia and run Triqler from this step.

I need to run the whole thing for my own to know the process. 

TIP on log-formatting: It will be much easier for everybody else to read if I chunk it up to experient scripts, run scripts and write about results and discuss the results in the log. Have the experiment scripts in experiment-folder.

On the FC diff. exp. comparison (2020-12-11 Cont. on diff expression). Still a good idea to do the matrix of which comparison are better for triqler and spectronaut for all cases and run, to get an overview of which methods perform better. But this is not priority at this moment.







### 2020-12-18 article for DIA data
https://www.nature.com/articles/s41467-019-13866-z#Sec16

Spectral library
https://www.ebi.ac.uk/pride/spectrumlibrary

Figure out how to run encyclopeDIA, what is background, library files etc...

File formats commonly used in proteomics. This is good to know.
https://pubmed.ncbi.nlm.nih.gov/22956731/


### 2020-12-11 Cont. on diff expression


```python
# refresh imports 

import os 
import time

import pandas as pd
import numpy as np 


os.chdir("/home/ptruong/git/bayesMS/bin")
from read_triqler_output import read_triqler_protein_output_to_df
from triqler_output_df_melter import melt_spectronaut_triqler_formatted, melt_triqler_output
from normalize_melted import normalize_within_sample, get_ratios_from_normalized_melted_df
from constants import get_sample_ratios, get_log2FC_ratio_matrix, get_log2FC_ratio_matrices
from extract_from_melt import get_protein_abundance, get_proteins
from log2fc_from_melt import get_log2FC_matrix
from log2fc_from_melt_computation import get_differentially_expressed_proteins_from_log2FC_df
os.chdir("/home/ptruong/git/bayesMS/data/log2FC_2020-12-10")

Differentially expressed per sample all for the same ratio (=0.8) of the true FC are counted as differentially expressed.

##########################################################################################
# WE DONT NEED TO LOOK AT THIS CODE IT IS JUST USED TO GENERATE SOME SAMPLE COMPARISONS. #
##########################################################################################
#S02 vs S06
triq_s02_s06_HUMAN_diff_exp = get_differentially_expressed_proteins_from_log2FC_df("triq", "S02", "S06", "HUMAN", 0.8)
spec_s02_s06_HUMAN_diff_exp = get_differentially_expressed_proteins_from_log2FC_df("spec", "S02", "S06", "HUMAN", 0.8)
triq_s02_s06_CAEEL_diff_exp = get_differentially_expressed_proteins_from_log2FC_df("triq", "S02", "S06", "CAEEL", 0.8)
spec_s02_s06_CAEEL_diff_exp = get_differentially_expressed_proteins_from_log2FC_df("spec", "S02", "S06", "CAEEL", 0.8)
triq_s02_s06_ARATH_diff_exp = get_differentially_expressed_proteins_from_log2FC_df("triq", "S02", "S06", "ARATH", 0.8)
spec_s02_s06_ARATH_diff_exp = get_differentially_expressed_proteins_from_log2FC_df("spec", "S02", "S06", "ARATH", 0.8)

triq_s02_s06_HUMAN_diff_exp_tot = np.sum(get_differentially_expressed_proteins_from_log2FC_df("triq", "S02", "S06", "HUMAN", 0.8))
spec_s02_s06_HUMAN_diff_exp_tot = np.sum(get_differentially_expressed_proteins_from_log2FC_df("spec", "S02", "S06", "HUMAN", 0.8))
triq_s02_s06_CAEEL_diff_exp_tot = np.sum(get_differentially_expressed_proteins_from_log2FC_df("triq", "S02", "S06", "CAEEL", 0.8))
spec_s02_s06_CAEEL_diff_exp_tot = np.sum(get_differentially_expressed_proteins_from_log2FC_df("spec", "S02", "S06", "CAEEL", 0.8))
triq_s02_s06_ARATH_diff_exp_tot = np.sum(get_differentially_expressed_proteins_from_log2FC_df("triq", "S02", "S06", "ARATH", 0.8))
spec_s02_s06_ARATH_diff_exp_tot = np.sum(get_differentially_expressed_proteins_from_log2FC_df("spec", "S02", "S06", "ARATH", 0.8))

#S03 vs S04
triq_s03_s04_HUMAN_diff_exp = get_differentially_expressed_proteins_from_log2FC_df("triq", "S03", "S04", "HUMAN", 0.8)
spec_s03_s04_HUMAN_diff_exp = get_differentially_expressed_proteins_from_log2FC_df("spec", "S03", "S04", "HUMAN", 0.8)
triq_s03_s04_CAEEL_diff_exp = get_differentially_expressed_proteins_from_log2FC_df("triq", "S03", "S04", "CAEEL", 0.8)
spec_s03_s04_CAEEL_diff_exp = get_differentially_expressed_proteins_from_log2FC_df("spec", "S03", "S04", "CAEEL", 0.8)
triq_s03_s04_ARATH_diff_exp = get_differentially_expressed_proteins_from_log2FC_df("triq", "S03", "S04", "ARATH", 0.8)
spec_s03_s04_ARATH_diff_exp = get_differentially_expressed_proteins_from_log2FC_df("spec", "S03", "S04", "ARATH", 0.8)

triq_s03_s04_HUMAN_diff_exp_tot = np.sum(get_differentially_expressed_proteins_from_log2FC_df("triq", "S03", "S04", "HUMAN", 0.8))
spec_s03_s04_HUMAN_diff_exp_tot = np.sum(get_differentially_expressed_proteins_from_log2FC_df("spec", "S03", "S04", "HUMAN", 0.8))
triq_s03_s04_CAEEL_diff_exp_tot = np.sum(get_differentially_expressed_proteins_from_log2FC_df("triq", "S03", "S04", "CAEEL", 0.8))
spec_s03_s04_CAEEL_diff_exp_tot = np.sum(get_differentially_expressed_proteins_from_log2FC_df("spec", "S03", "S04", "CAEEL", 0.8))
triq_s03_s04_ARATH_diff_exp_tot = np.sum(get_differentially_expressed_proteins_from_log2FC_df("triq", "S03", "S04", "ARATH", 0.8))
spec_s03_s04_ARATH_diff_exp_tot = np.sum(get_differentially_expressed_proteins_from_log2FC_df("spec", "S03", "S04", "ARATH", 0.8))

#S04 vs S09
triq_s04_s09_HUMAN_diff_exp = get_differentially_expressed_proteins_from_log2FC_df("triq", "S04", "S09", "HUMAN", 0.8)
spec_s04_s09_HUMAN_diff_exp = get_differentially_expressed_proteins_from_log2FC_df("spec", "S04", "S09", "HUMAN", 0.8)
triq_s04_s09_CAEEL_diff_exp = get_differentially_expressed_proteins_from_log2FC_df("triq", "S04", "S09", "CAEEL", 0.8)
spec_s04_s09_CAEEL_diff_exp = get_differentially_expressed_proteins_from_log2FC_df("spec", "S04", "S09", "CAEEL", 0.8)
triq_s04_s09_ARATH_diff_exp = get_differentially_expressed_proteins_from_log2FC_df("triq", "S04", "S09", "ARATH", 0.8)
spec_s04_s09_ARATH_diff_exp = get_differentially_expressed_proteins_from_log2FC_df("spec", "S04", "S09", "ARATH", 0.8)

triq_s04_s09_HUMAN_diff_exp_tot = np.sum(get_differentially_expressed_proteins_from_log2FC_df("triq", "S04", "S09", "HUMAN", 0.8))
spec_s04_s09_HUMAN_diff_exp_tot = np.sum(get_differentially_expressed_proteins_from_log2FC_df("spec", "S04", "S09", "HUMAN", 0.8))
triq_s04_s09_CAEEL_diff_exp_tot = np.sum(get_differentially_expressed_proteins_from_log2FC_df("triq", "S04", "S09", "CAEEL", 0.8))
spec_s04_s09_CAEEL_diff_exp_tot = np.sum(get_differentially_expressed_proteins_from_log2FC_df("spec", "S04", "S09", "CAEEL", 0.8))
triq_s04_s09_ARATH_diff_exp_tot = np.sum(get_differentially_expressed_proteins_from_log2FC_df("triq", "S04", "S09", "ARATH", 0.8))
spec_s04_s09_ARATH_diff_exp_tot = np.sum(get_differentially_expressed_proteins_from_log2FC_df("spec", "S04", "S09", "ARATH", 0.8))

#S05 vs S08
triq_s05_s08_HUMAN_diff_exp = get_differentially_expressed_proteins_from_log2FC_df("triq", "S05", "S08", "HUMAN", 0.8)
spec_s05_s08_HUMAN_diff_exp = get_differentially_expressed_proteins_from_log2FC_df("spec", "S05", "S08", "HUMAN", 0.8)
triq_s05_s08_CAEEL_diff_exp = get_differentially_expressed_proteins_from_log2FC_df("triq", "S05", "S08", "CAEEL", 0.8)
spec_s05_s08_CAEEL_diff_exp = get_differentially_expressed_proteins_from_log2FC_df("spec", "S05", "S08", "CAEEL", 0.8)
triq_s05_s08_ARATH_diff_exp = get_differentially_expressed_proteins_from_log2FC_df("triq", "S05", "S08", "ARATH", 0.8)
spec_s05_s08_ARATH_diff_exp = get_differentially_expressed_proteins_from_log2FC_df("spec", "S05", "S08", "ARATH", 0.8)

triq_s05_s08_HUMAN_diff_exp_tot = np.sum(get_differentially_expressed_proteins_from_log2FC_df("triq", "S05", "S08", "HUMAN", 0.8))
spec_s05_s08_HUMAN_diff_exp_tot = np.sum(get_differentially_expressed_proteins_from_log2FC_df("spec", "S05", "S08", "HUMAN", 0.8))
triq_s05_s08_CAEEL_diff_exp_tot = np.sum(get_differentially_expressed_proteins_from_log2FC_df("triq", "S05", "S08", "CAEEL", 0.8))
spec_s05_s08_CAEEL_diff_exp_tot = np.sum(get_differentially_expressed_proteins_from_log2FC_df("spec", "S05", "S08", "CAEEL", 0.8))
triq_s05_s08_ARATH_diff_exp_tot = np.sum(get_differentially_expressed_proteins_from_log2FC_df("triq", "S05", "S08", "ARATH", 0.8))
spec_s05_s08_ARATH_diff_exp_tot = np.sum(get_differentially_expressed_proteins_from_log2FC_df("spec", "S05", "S08", "ARATH", 0.8))
print("S02 vs S06")
# S02 vs S06
print("Triqler S02 vs S06, HUMAN")
print(triq_s02_s06_HUMAN_diff_exp) 
print("Total different expression: " + str(triq_s02_s06_HUMAN_diff_exp_tot)) 
print("")
print("Spectronaut S02 vs S06, HUMAN")    
print(spec_s02_s06_HUMAN_diff_exp)
print("Total different expression: " + str(spec_s02_s06_HUMAN_diff_exp_tot)) 
print("")
print("Triqler S02 vs S06, CAEEL")
print(triq_s02_s06_CAEEL_diff_exp) 
print("Total different expression: " + str(triq_s02_s06_CAEEL_diff_exp_tot)) 
print("")
print("Spectronaut S02 vs S06, CAEEL")    
print(spec_s02_s06_CAEEL_diff_exp) 
print("Total different expression: " + str(spec_s02_s06_CAEEL_diff_exp_tot)) 
print("")
print("Triqler S02 vs S06, ARATH")
print(triq_s02_s06_ARATH_diff_exp) 
print("Total different expression: " + str(triq_s02_s06_ARATH_diff_exp_tot)) 
print("")
print("Spectronaut S02 vs S06, ARATH")    
print(spec_s02_s06_ARATH_diff_exp) 
print("Total different expression: " + str(spec_s02_s06_ARATH_diff_exp_tot)) 
print("")

print("S03 vs S04")
#S03 vs S04
print("Triqler S03 vs S04, HUMAN")
print(triq_s03_s04_HUMAN_diff_exp) 
print("Total different expression: " + str(triq_s03_s04_HUMAN_diff_exp_tot)) 
print("")
print("Spectronaut S03 vs S04, HUMAN")    
print(spec_s03_s04_HUMAN_diff_exp)
print("Total different expression: " + str(spec_s03_s04_HUMAN_diff_exp_tot)) 
print("")
print("Triqler S03 vs S04, CAEEL")
print(triq_s03_s04_CAEEL_diff_exp) 
print("Total different expression: " + str(triq_s03_s04_CAEEL_diff_exp_tot)) 
print("")
print("Spectronaut S03 vs S04, CAEEL")    
print(spec_s03_s04_CAEEL_diff_exp) 
print("Total different expression: " + str(spec_s03_s04_CAEEL_diff_exp_tot)) 
print("")
print("Triqler S03 vs S04, ARATH")
print(triq_s03_s04_ARATH_diff_exp) 
print("Total different expression: " + str(triq_s03_s04_ARATH_diff_exp_tot)) 
print("")
print("Spectronaut S03 vs S04, ARATH")    
print(spec_s03_s04_ARATH_diff_exp) 
print("Total different expression: " + str(spec_s03_s04_ARATH_diff_exp_tot)) 
print("")


#S04 vs S09
print("S04 vs S09")
print("Triqler S04 vs S09, HUMAN")
print(triq_s04_s09_HUMAN_diff_exp) 
print("Total different expression: " + str(triq_s04_s09_HUMAN_diff_exp_tot)) 
print("")
print("Spectronaut S04 vs S09, HUMAN")    
print(spec_s04_s09_HUMAN_diff_exp)
print("Total different expression: " + str(spec_s04_s09_HUMAN_diff_exp_tot)) 
print("")
print("Triqler S04 vs S09, CAEEL")
print(triq_s04_s09_CAEEL_diff_exp) 
print("Total different expression: " + str(triq_s04_s09_CAEEL_diff_exp_tot)) 
print("")
print("Spectronaut S04 vs S09, CAEEL")    
print(spec_s04_s09_CAEEL_diff_exp) 
print("Total different expression: " + str(spec_s04_s09_CAEEL_diff_exp_tot)) 
print("")
print("Triqler S04 vs S09, ARATH")
print(triq_s04_s09_ARATH_diff_exp) 
print("Total different expression: " + str(triq_s04_s09_ARATH_diff_exp_tot)) 
print("")
print("Spectronaut S04 vs S09, ARATH")    
print(spec_s04_s09_ARATH_diff_exp) 
print("Total different expression: " + str(spec_s04_s09_ARATH_diff_exp_tot)) 
print("")

#S05 vs S08
print("S05 vs S08")
print("Triqler S05 vs S08, HUMAN")
print(triq_s05_s08_HUMAN_diff_exp) 
print("Total different expression: " + str(triq_s05_s08_HUMAN_diff_exp_tot)) 
print("")
print("Spectronaut S05 vs S08, HUMAN")    
print(spec_s05_s08_HUMAN_diff_exp)
print("Total different expression: " + str(spec_s05_s08_HUMAN_diff_exp_tot)) 
print("")
print("Triqler S05 vs S08, CAEEL")
print(triq_s05_s08_CAEEL_diff_exp) 
print("Total different expression: " + str(triq_s05_s08_CAEEL_diff_exp_tot)) 
print("")
print("Spectronaut S05 vs S08, CAEEL")    
print(spec_s05_s08_CAEEL_diff_exp) 
print("Total different expression: " + str(spec_s05_s08_CAEEL_diff_exp_tot)) 
print("")
print("Triqler S05 vs S08, ARATH")
print(triq_s05_s08_ARATH_diff_exp) 
print("Total different expression: " + str(triq_s05_s08_ARATH_diff_exp_tot)) 
print("")
print("Spectronaut S05 vs S08, ARATH")    
print(spec_s05_s08_ARATH_diff_exp) 
print("Total different expression: " + str(spec_s05_s08_ARATH_diff_exp_tot)) 
print("")



S02 vs S06
Triqler S02 vs S06, HUMAN
R01    4470
R02    4637
R03    4490
R04    4507
R05    4356
dtype: int64
Total different expression: 22460

Spectronaut S02 vs S06, HUMAN
R01    4195
R02    4372
R03    4232
R04    4170
R05    3996
dtype: int64
Total different expression: 20965

Triqler S02 vs S06, CAEEL
R01    0
R02    0
R03    0
R04    0
R05    0
dtype: int64
Total different expression: 0

Spectronaut S02 vs S06, CAEEL
R01    0
R02    0
R03    0
R04    0
R05    0
dtype: int64
Total different expression: 0

Triqler S02 vs S06, ARATH
R01    45
R02    38
R03    30
R04    34
R05    31
dtype: int64
Total different expression: 178

Spectronaut S02 vs S06, ARATH
R01    323
R02    286
R03    275
R04    264
R05    260
dtype: int64
Total different expression: 1408

S03 vs S04
Triqler S03 vs S04, HUMAN
R01    1802
R02    3201
R03    1894
R04    3556
R05    2797
dtype: int64
Total different expression: 13250

Spectronaut S03 vs S04, HUMAN
R01    2656
R02    3037
R03    2499
R04    3142
R05    2450
dtype: int64
Total different expression: 13784

Triqler S03 vs S04, CAEEL
R01    0
R02    0
R03    0
R04    0
R05    0
dtype: int64
Total different expression: 0

Spectronaut S03 vs S04, CAEEL
R01    0
R02    0
R03    0
R04    0
R05    0
dtype: int64
Total different expression: 0

Triqler S03 vs S04, ARATH
R01     8
R02    21
R03    15
R04    42
R05    19
dtype: int64
Total different expression: 105

Spectronaut S03 vs S04, ARATH
R01    119
R02    197
R03    143
R04    338
R05    196
dtype: int64
Total different expression: 993

S04 vs S09
Triqler S04 vs S09, HUMAN
R01    4086
R02    2706
R03    3451
R04    2796
R05    3109
dtype: int64
Total different expression: 16148

Spectronaut S04 vs S09, HUMAN
R01    3909
R02    3175
R03    4001
R04    3352
R05    3427
dtype: int64
Total different expression: 17864

Triqler S04 vs S09, CAEEL
R01    0
R02    0
R03    0
R04    0
R05    0
dtype: int64
Total different expression: 0

Spectronaut S04 vs S09, CAEEL
R01    0
R02    0
R03    0
R04    0
R05    0
dtype: int64
Total different expression: 0

Triqler S04 vs S09, ARATH
R01    14
R02    13
R03    23
R04    12
R05     5
dtype: int64
Total different expression: 67

Spectronaut S04 vs S09, ARATH
R01    197
R02    208
R03    246
R04    168
R05    138
dtype: int64
Total different expression: 957

S05 vs S08
Triqler S05 vs S08, HUMAN
R01    3745
R02    4341
R03    4193
R04    4592
R05    4744
dtype: int64
Total different expression: 21615

Spectronaut S05 vs S08, HUMAN
R01    3809
R02    4503
R03    3808
R04    4448
R05    4666
dtype: int64
Total different expression: 21234

Triqler S05 vs S08, CAEEL
R01    0
R02    0
R03    0
R04    0
R05    0
dtype: int64
Total different expression: 0

Spectronaut S05 vs S08, CAEEL
R01    0
R02    0
R03    0
R04    0
R05    0
dtype: int64
Total different expression: 0

Triqler S05 vs S08, ARATH
R01    19
R02    59
R03    25
R04    12
R05     9
dtype: int64
Total different expression: 124

Spectronaut S05 vs S08, ARATH
R01    157
R02    333
R03    219
R04    202
R05    130
dtype: int64
Total different expression: 1041

We need to make treshold depending on specie.

For HUMAN we want sample to be larger. For CAEEL we want sample to be smaller. For ARATH we want sample to be close to 0.

Triqler seems to get more or similar differentially expressed for HUMAN. Triqler seems to get less or similar differenatially expressed for ARATH.

which is good... but I seem to be doing something wrong with the CAEEL computations.

I could visualize this as matrix for all samples, where coloring of the matrix indicated which has more differentially expressed.

Think about if we should remove proteins that have not been discovered in different runs from spec...

2020-12-10

Continuing my work on differential expression

Just coded up from scripts for extracting protein abundances and proteins list from melted dataframes.

os.chdir("/home/ptruong/git/bayesMS/bin")
from extract_from_melt import get_protein_abundance, get_proteins
triq
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
q_value posterior_error_prob protein num_peptides protein_id_posterior_error_prob log2_fold_change diff_exp_prob_0.8 sample run value peptide
0 1.168000e-09 1.168000e-09 P02566_CAEEL 152 1.167000e-09 5.84600 9.828000e-14 S01 R01 41.3300 ELLLDLPIK;YKQLTHQLEDAEER;HPNFEKPKPPK;DLEEANM[1...
1 1.168000e-09 1.168000e-09 P02566_CAEEL 152 1.167000e-09 5.84600 9.828000e-14 S01 R02 41.0700 ELLLDLPIK;YKQLTHQLEDAEER;HPNFEKPKPPK;DLEEANM[1...
2 1.168000e-09 1.168000e-09 P02566_CAEEL 152 1.167000e-09 5.84600 9.828000e-14 S01 R03 38.9800 ELLLDLPIK;YKQLTHQLEDAEER;HPNFEKPKPPK;DLEEANM[1...
3 1.168000e-09 1.168000e-09 P02566_CAEEL 152 1.167000e-09 5.84600 9.828000e-14 S01 R04 40.1300 ELLLDLPIK;YKQLTHQLEDAEER;HPNFEKPKPPK;DLEEANM[1...
4 1.168000e-09 1.168000e-09 P02566_CAEEL 152 1.167000e-09 5.84600 9.828000e-14 S01 R05 40.5900 ELLLDLPIK;YKQLTHQLEDAEER;HPNFEKPKPPK;DLEEANM[1...
... ... ... ... ... ... ... ... ... ... ... ...
543045 9.537000e-01 1.000000e+00 Q0WLB5_ARATH 66 1.214000e-08 -0.01544 1.000000e+00 S10 R01 1.0150 DLLLVNLR;VAAYIYK;NFLM[16]EAK;EATSFLLDVLKPNLPEH...
543046 9.537000e-01 1.000000e+00 Q0WLB5_ARATH 66 1.214000e-08 -0.01544 1.000000e+00 S10 R02 1.0360 DLLLVNLR;VAAYIYK;NFLM[16]EAK;EATSFLLDVLKPNLPEH...
543047 9.537000e-01 1.000000e+00 Q0WLB5_ARATH 66 1.214000e-08 -0.01544 1.000000e+00 S10 R03 1.0420 DLLLVNLR;VAAYIYK;NFLM[16]EAK;EATSFLLDVLKPNLPEH...
543048 9.537000e-01 1.000000e+00 Q0WLB5_ARATH 66 1.214000e-08 -0.01544 1.000000e+00 S10 R04 0.9652 DLLLVNLR;VAAYIYK;NFLM[16]EAK;EATSFLLDVLKPNLPEH...
543049 9.537000e-01 1.000000e+00 Q0WLB5_ARATH 66 1.214000e-08 -0.01544 1.000000e+00 S10 R05 0.9473 DLLLVNLR;VAAYIYK;NFLM[16]EAK;EATSFLLDVLKPNLPEH...

543050 rows × 11 columns

proteins = get_proteins(spec, "HUMAN") #FUNC TEST
proteins = get_proteins(triq, "HUMAN") #FUNC TEST
samples = ["S0"+str(i) for i in range(1,10)] + ["S10"] 
species = ["ARATH", "CAEEL", "HUMAN"]

df = df_triq #variable
df = df_spec #variable

sample = samples[0] #Variable
specie = "HUMAN" #variable
protein = "Q8TBA6_HUMAN" #VAR

protein = proteins[0] #iteration variable (if the protein is diff exp)
    
    
get_protein_abundance(triq, "S01", "HUMAN", protein) #FUNC TEST
18550    0.7940
18551    1.3410
18552    0.4786
18553    1.6060
18554    0.1356
Name: value, dtype: float64

I want to compute a pairwise log2FC between samples for each protein and run. i.e. a dataframe similar to

Protein Run1 Run2 Run3 Run4 Run5
P1 .
P2 .
... .
Pn . ...

will be generated for each pair SX vs XY. Then we can treshold this table and count the logFC differential expression. Either by

  1. Treshold by FC and setting if >3 proteins are differntially expressed than it is a differentially expressed protein or
  2. Count the total number of differentially expressed proteins (which should be better for triqler, since we cannot compare with NaN).

However, the code to produce this dataframe (below) is very slow. It takes about 15min for each pairwise matrix to be construction. For 45 comparisons it will take 675min (i.e) 11 hours. I will try to parallize the computation to save some time.

import time
start = time.time()

sample1 = "S02" #Var
sample2 = "S06" #Var
specie = "HUMAN" #Var
protein = proteins[0] #Var

log2FC_array = []
for protein in proteins:
    abundance_sample1 = get_protein_abundance(df, sample1, specie, protein)
    abundance_sample2 = get_protein_abundance(df, sample2, specie, protein)
    log2FC = np.log2(abundance_sample2) - np.log2(abundance_sample1)
    log2FC_array.append(log2FC)
    #break # ADDED JUST IN CASE WE RUN CELL - IT TAKES 15min on Planck.
end = time.time()
print(end-start)
755.7753973007202

The library joblib has a simple process for performing embarrasingly parallell jobs.

# Example script

# Sequential
from math import sqrt
test= []
for i in range(10):
    test.append(sqrt(i**2))

# Parallell
from math import sqrt
from joblib import Parallel, delayed
Parallel(n_jobs=2)(delayed(sqrt)(i ** 2) for i in range(10))
[0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0]
def parallel_get_protein(df, protein, sample1, sample2, specie):
    abundance_sample1 = get_protein_abundance(df, sample1, specie, protein)
    abundance_sample2 = get_protein_abundance(df, sample2, specie, protein)
    log2FC = np.log2(abundance_sample2.values) - np.log2(abundance_sample1.values)
    return log2FC

import multiprocessing
num_cores = multiprocessing.cpu_count()

sample1 = "S02" #VAR
sample2 = "S06" #VAR
specie = "HUMAN" #VAR
df = triq #VAR

parallel_get_protein(df, protein, sample1, sample2, specie)
array([1.25103933, 1.14043778, 1.16028107, 1.14924195, 1.21299372])
start=time.time()
log2FC_array = Parallel(n_jobs=num_cores)(delayed(parallel_get_protein)(df, protein, sample1, sample2, specie) for protein in proteins)
end=time.time()
print(time.time())
exception calling callback for <Future at 0x7f420ded0f60 state=finished raised BrokenProcessPool>
joblib.externals.loky.process_executor._RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/home/ptruong/anaconda3/lib/python3.6/site-packages/joblib/externals/loky/process_executor.py", line 404, in _process_worker
    call_item = call_queue.get(block=True, timeout=timeout)
  File "/home/ptruong/anaconda3/lib/python3.6/multiprocessing/queues.py", line 113, in get
    return _ForkingPickler.loads(res)
ModuleNotFoundError: No module named 'extract_from_melt'
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/ptruong/anaconda3/lib/python3.6/site-packages/joblib/externals/loky/_base.py", line 625, in _invoke_callbacks
    callback(self)
  File "/home/ptruong/anaconda3/lib/python3.6/site-packages/joblib/parallel.py", line 366, in __call__
    self.parallel.dispatch_next()
  File "/home/ptruong/anaconda3/lib/python3.6/site-packages/joblib/parallel.py", line 799, in dispatch_next
    if not self.dispatch_one_batch(self._original_iterator):
  File "/home/ptruong/anaconda3/lib/python3.6/site-packages/joblib/parallel.py", line 866, in dispatch_one_batch
    self._dispatch(tasks)
  File "/home/ptruong/anaconda3/lib/python3.6/site-packages/joblib/parallel.py", line 784, in _dispatch
    job = self._backend.apply_async(batch, callback=cb)
  File "/home/ptruong/anaconda3/lib/python3.6/site-packages/joblib/_parallel_backends.py", line 531, in apply_async
    future = self._workers.submit(SafeFunction(func))
  File "/home/ptruong/anaconda3/lib/python3.6/site-packages/joblib/externals/loky/reusable_executor.py", line 178, in submit
    fn, *args, **kwargs)
  File "/home/ptruong/anaconda3/lib/python3.6/site-packages/joblib/externals/loky/process_executor.py", line 1102, in submit
    raise self._flags.broken
joblib.externals.loky.process_executor.BrokenProcessPool: A task has failed to un-serialize. Please ensure that the arguments of the function are all picklable.



---------------------------------------------------------------------------

_RemoteTraceback                          Traceback (most recent call last)

_RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/home/ptruong/anaconda3/lib/python3.6/site-packages/joblib/externals/loky/process_executor.py", line 404, in _process_worker
    call_item = call_queue.get(block=True, timeout=timeout)
  File "/home/ptruong/anaconda3/lib/python3.6/multiprocessing/queues.py", line 113, in get
    return _ForkingPickler.loads(res)
ModuleNotFoundError: No module named 'extract_from_melt'
"""


The above exception was the direct cause of the following exception:


BrokenProcessPool                         Traceback (most recent call last)

<ipython-input-155-12099593d474> in <module>
      1 start=time.time()
----> 2 log2FC_array = Parallel(n_jobs=num_cores)(delayed(parallel_get_protein)(df, protein, sample1, sample2, specie) for protein in proteins)
      3 end=time.time()
      4 print(time.time())


~/anaconda3/lib/python3.6/site-packages/joblib/parallel.py in __call__(self, iterable)
   1059 
   1060             with self._backend.retrieval_context():
-> 1061                 self.retrieve()
   1062             # Make sure that we get a last message telling us we are done
   1063             elapsed_time = time.time() - self._start_time


~/anaconda3/lib/python3.6/site-packages/joblib/parallel.py in retrieve(self)
    938             try:
    939                 if getattr(self._backend, 'supports_timeout', False):
--> 940                     self._output.extend(job.get(timeout=self.timeout))
    941                 else:
    942                     self._output.extend(job.get())


~/anaconda3/lib/python3.6/site-packages/joblib/_parallel_backends.py in wrap_future_result(future, timeout)
    540         AsyncResults.get from multiprocessing."""
    541         try:
--> 542             return future.result(timeout=timeout)
    543         except CfTimeoutError as e:
    544             raise TimeoutError from e


~/anaconda3/lib/python3.6/concurrent/futures/_base.py in result(self, timeout)
    430                 raise CancelledError()
    431             elif self._state == FINISHED:
--> 432                 return self.__get_result()
    433             else:
    434                 raise TimeoutError()


~/anaconda3/lib/python3.6/concurrent/futures/_base.py in __get_result(self)
    382     def __get_result(self):
    383         if self._exception:
--> 384             raise self._exception
    385         else:
    386             return self._result


~/anaconda3/lib/python3.6/site-packages/joblib/externals/loky/_base.py in _invoke_callbacks(self)
    623         for callback in self._done_callbacks:
    624             try:
--> 625                 callback(self)
    626             except BaseException:
    627                 LOGGER.exception('exception calling callback for %r', self)


~/anaconda3/lib/python3.6/site-packages/joblib/parallel.py in __call__(self, out)
    364         with self.parallel._lock:
    365             if self.parallel._original_iterator is not None:
--> 366                 self.parallel.dispatch_next()
    367 
    368 


~/anaconda3/lib/python3.6/site-packages/joblib/parallel.py in dispatch_next(self)
    797 
    798         """
--> 799         if not self.dispatch_one_batch(self._original_iterator):
    800             self._iterating = False
    801             self._original_iterator = None


~/anaconda3/lib/python3.6/site-packages/joblib/parallel.py in dispatch_one_batch(self, iterator)
    864                 return False
    865             else:
--> 866                 self._dispatch(tasks)
    867                 return True
    868 


~/anaconda3/lib/python3.6/site-packages/joblib/parallel.py in _dispatch(self, batch)
    782         with self._lock:
    783             job_idx = len(self._jobs)
--> 784             job = self._backend.apply_async(batch, callback=cb)
    785             # A job can complete so quickly than its callback is
    786             # called before we get here, causing self._jobs to


~/anaconda3/lib/python3.6/site-packages/joblib/_parallel_backends.py in apply_async(self, func, callback)
    529     def apply_async(self, func, callback=None):
    530         """Schedule a func to be run"""
--> 531         future = self._workers.submit(SafeFunction(func))
    532         future.get = functools.partial(self.wrap_future_result, future)
    533         if callback is not None:


~/anaconda3/lib/python3.6/site-packages/joblib/externals/loky/reusable_executor.py in submit(self, fn, *args, **kwargs)
    176         with self._submit_resize_lock:
    177             return super(_ReusablePoolExecutor, self).submit(
--> 178                 fn, *args, **kwargs)
    179 
    180     def _resize(self, max_workers):


~/anaconda3/lib/python3.6/site-packages/joblib/externals/loky/process_executor.py in submit(self, fn, *args, **kwargs)
   1100         with self._flags.shutdown_lock:
   1101             if self._flags.broken is not None:
-> 1102                 raise self._flags.broken
   1103             if self._flags.shutdown:
   1104                 raise ShutdownExecutorError(


BrokenProcessPool: A task has failed to un-serialize. Please ensure that the arguments of the function are all picklable.

BrokenProcessPool: A task has failed to un-serialize. Please ensure that the arguments of the function are all picklable.

This always happens when using multiprocessing in an iPython console in Spyder. A workaround is to run the script from the command line instead.

(https://stackoverflow.com/questions/56154654/a-task-failed-to-un-serialize)

Write about the wierd parallel processing timing...

from math import sqrt
test= []
for i in range(10):
    test.append(sqrt(i**2))

from math import sqrt
from joblib import Parallel, delayed
test2 = Parallel(n_jobs=2)(delayed(sqrt)(i ** 2) for i in range(10))

g = "g"
def get_val(x,g):
    return str(x)+str("_tetst")+str(g)
# write about this timing thingy here... wierd...
start=time.time()
test3 = Parallel(n_jobs=1)(delayed(get_val)(i,g) for i in ["a","b","c","d","e","b","c","d","e","b","c","d","e"])
end=time.time()
print(end-start)
0.0007114410400390625
# write about this timing thingy here... wierd...
start=time.time()
test3 = Parallel(n_jobs=8)(delayed(get_val)(i,g) for i in ["a","b","c","d","e","b","c","d","e","b","c","d","e"])
end=time.time()
print(end-start)
0.2426769733428955

I wonder why it is so much slower (joblid)... the computations for log2FC is also much slower, it never finished before 15min... so I just ran some samples while in gym.

def get_log2FC_matrix(df, sample1, sample2, specie):
    proteins = get_proteins(df, specie) 
    runs = ["R0"+str(i) for i in range(1,6)]
    log2FC_array = []
    start=time.time()
    i = 0
    for protein in proteins:
        if i%200 == 0:
            print(str(i) + "/" + str(len(proteins)))
        log2FC_array.append(parallel_get_protein(df, protein, sample1, sample2, specie))
        i+=1
    end=time.time()
    print(end-start)
    df_log2FC = pd.DataFrame(log2FC_array, index=proteins, columns=runs)
    return df_log2FC

    df_log2FC = pd.DataFrame(log2FC_array, index=proteins, columns=runs).to_csv("spec_log2FC_S02_S06_HUMAN.csv", sep = "\t", index=False)

Hardcoding since the joblib did not work... let it run when im at the gym

print(1)
df_log2FC = get_log2FC_matrix(triq, "S02", "S06", "HUMAN")
df_log2FC.to_csv("triq_log2FC_S02_S06_HUMAN.csv", sep = "\t", index=False)
print(2)
df_log2FC = get_log2FC_matrix(triq, "S02", "S06", "CAEEL")
df_log2FC.to_csv("troq_log2FC_S02_S06_CAEEL.csv", sep = "\t", index=False)
print(3)
df_log2FC = get_log2FC_matrix(triq, "S02", "S06", "ARATH")
df_log2FC.to_csv("troq_log2FC_S02_S06_ARATH.csv", sep = "\t", index=False)
print(4)
df_log2FC = get_log2FC_matrix(spec, "S02", "S06", "CAEEL")
df_log2FC.to_csv("spec_log2FC_S02_S06_CAEEL.csv", sep = "\t", index=False)
print(5)
df_log2FC = get_log2FC_matrix(spec, "S02", "S06", "ARATH")
df_log2FC.to_csv("spec_log2FC_S02_S06_ARATH.csv", sep = "\t", index=False)

print(6)
df_log2FC = get_log2FC_matrix(triq, "S03", "S04", "HUMAN")
df_log2FC.to_csv("triq_log2FC_S03_S04_HUMAN.csv", sep = "\t", index=False)
print(7)
df_log2FC = get_log2FC_matrix(triq, "S03", "S04", "CAEEL")
df_log2FC.to_csv("triq_log2FC_S03_S04_CAEEL.csv", sep = "\t", index=False)
print(8)
df_log2FC = get_log2FC_matrix(triq, "S03", "S04", "ARATH")
df_log2FC.to_csv("triq_log2FC_S03_S04_ARATH.csv", sep = "\t", index=False)
print(9)
df_log2FC = get_log2FC_matrix(spec, "S03", "S04", "HUMAN")
df_log2FC.to_csv("spec_log2FC_S03_S04_HUMAN.csv", sep = "\t", index=False)
print(10)
df_log2FC = get_log2FC_matrix(spec, "S03", "S04", "CAEEL")
df_log2FC.to_csv("spec_log2FC_S03_S04_CAEEL.csv", sep = "\t", index=False)
print(11)
df_log2FC = get_log2FC_matrix(spec, "S03", "S04", "ARATH")
df_log2FC.to_csv("spec_log2FC_S03_S04_ARATH.csv", sep = "\t", index=False)

print(12)
df_log2FC = get_log2FC_matrix(triq, "S05", "S08", "HUMAN")
df_log2FC.to_csv("triq_log2FC_S05_S08_HUMAN.csv", sep = "\t", index=False)
print(13)
df_log2FC = get_log2FC_matrix(triq, "S05", "S08", "CAEEL")
df_log2FC.to_csv("triq_log2FC_S05_S08_CAEEL.csv", sep = "\t", index=False)
print(14)
df_log2FC = get_log2FC_matrix(triq, "S05", "S08", "ARATH")
df_log2FC.to_csv("triq_log2FC_S05_S08_ARATH.csv", sep = "\t", index=False)
print(15)
df_log2FC = get_log2FC_matrix(spec, "S05", "S08", "HUMAN")
df_log2FC.to_csv("spec_log2FC_S05_S08_HUMAN.csv", sep = "\t", index=False)
print(16)
df_log2FC = get_log2FC_matrix(spec, "S05", "S08", "CAEEL")
df_log2FC.to_csv("spec_log2FC_S05_S08_CAEEL.csv", sep = "\t", index=False)
print(17)
df_log2FC = get_log2FC_matrix(spec, "S05", "S08", "ARATH")
df_log2FC.to_csv("spec_log2FC_S05_S08_ARATH.csv", sep = "\t", index=False)
print(18)
df_log2FC = get_log2FC_matrix(triq, "S04", "S09", "HUMAN")
df_log2FC.to_csv("triq_log2FC_S04_S09_HUMAN.csv", sep = "\t", index=False)
print(19)
df_log2FC = get_log2FC_matrix(triq, "S04", "S09", "CAEEL")
df_log2FC.to_csv("triq_log2FC_S04_S09_CAEEL.csv", sep = "\t", index=False)
print(20)
df_log2FC = get_log2FC_matrix(triq, "S04", "S09", "ARATH")
df_log2FC.to_csv("triq_log2FC_S04_S09_ARATH.csv", sep = "\t", index=False)
print(21)
df_log2FC = get_log2FC_matrix(spec, "S04", "S09", "HUMAN")
df_log2FC.to_csv("spec_log2FC_S04_S09_HUMAN.csv", sep = "\t", index=False)
print(22)
df_log2FC = get_log2FC_matrix(spec, "S04", "S09", "CAEEL")
df_log2FC.to_csv("spec_log2FC_S04_S09_CAEEL.csv", sep = "\t", index=False)
print(23)
df_log2FC = get_log2FC_matrix(spec, "S04", "S09", "ARATH")
df_log2FC.to_csv("spec_log2FC_S04_S09_ARATH.csv", sep = "\t", index=False)
print("DONE!")
1
0/5593
200/5593
400/5593
600/5593
800/5593
1000/5593
1200/5593
1400/5593
1600/5593
1800/5593
2000/5593
2200/5593
2400/5593
2600/5593
2800/5593
3000/5593
3200/5593
3400/5593
3600/5593
3800/5593
4000/5593
4200/5593
4400/5593
4600/5593
4800/5593
5000/5593
5200/5593
5400/5593
422.6912865638733
2
0/1472
200/1472
400/1472
600/1472
800/1472
1000/1472
1200/1472
1400/1472
100.4037938117981
3
0/3796
200/3796
400/3796
600/3796
800/3796
1000/3796
1200/3796
1400/3796
1600/3796
1800/3796
2000/3796
2200/3796
2400/3796
2600/3796
2800/3796
3000/3796
3200/3796
3400/3796
3600/3796
273.03745770454407
4
0/4211
200/4211
400/4211
600/4211
800/4211
1000/4211
1200/4211
1400/4211
1600/4211
1800/4211
2000/4211
2200/4211
2400/4211
2600/4211
2800/4211
3000/4211
3200/4211
3400/4211
3600/4211
3800/4211
4000/4211
4200/4211
360.48885440826416
5
0/4620
200/4620
400/4620
600/4620
800/4620
1000/4620
1200/4620
1400/4620
1600/4620
1800/4620
2000/4620
2200/4620
2400/4620
2600/4620
2800/4620
3000/4620
3200/4620
3400/4620
3600/4620
3800/4620
4000/4620
4200/4620
4400/4620
4600/4620
398.10421776771545
6
0/5593
200/5593
400/5593
600/5593
800/5593
1000/5593
1200/5593
1400/5593
1600/5593
1800/5593
2000/5593
2200/5593
2400/5593
2600/5593
2800/5593
3000/5593
3200/5593
3400/5593
3600/5593
3800/5593
4000/5593
4200/5593
4400/5593
4600/5593
4800/5593
5000/5593
5200/5593
5400/5593
421.7850704193115
7
0/1472
200/1472
400/1472
600/1472
800/1472
1000/1472
1200/1472
1400/1472
100.94081354141235
8
0/3796
200/3796
400/3796
600/3796
800/3796
1000/3796
1200/3796
1400/3796
1600/3796
1800/3796
2000/3796
2200/3796
2400/3796
2600/3796
2800/3796
3000/3796
3200/3796
3400/3796
3600/3796
274.5622580051422
9
0/6183
200/6183
400/6183
600/6183
800/6183
1000/6183
1200/6183
1400/6183
1600/6183
1800/6183
2000/6183
2200/6183
2400/6183
2600/6183
2800/6183
3000/6183
3200/6183
3400/6183
3600/6183
3800/6183
4000/6183
4200/6183
4400/6183
4600/6183
4800/6183
5000/6183
5200/6183
5400/6183
5600/6183
5800/6183
6000/6183
545.6345722675323
10
0/4211
200/4211
400/4211
600/4211
800/4211
1000/4211
1200/4211
1400/4211
1600/4211
1800/4211
2000/4211
2200/4211
2400/4211
2600/4211
2800/4211
3000/4211
3200/4211
3400/4211
3600/4211
3800/4211
4000/4211
4200/4211
362.08672857284546
11
0/4620
200/4620
400/4620
600/4620
800/4620
1000/4620
1200/4620
1400/4620
1600/4620
1800/4620
2000/4620
2200/4620
2400/4620
2600/4620
2800/4620
3000/4620
3200/4620
3400/4620
3600/4620
3800/4620
4000/4620
4200/4620
4400/4620
4600/4620
399.01437854766846
12
0/5593
200/5593
400/5593
600/5593
800/5593
1000/5593
1200/5593
1400/5593
1600/5593
1800/5593
2000/5593
2200/5593
2400/5593
2600/5593
2800/5593
3000/5593
3200/5593
3400/5593
3600/5593
3800/5593
4000/5593
4200/5593
4400/5593
4600/5593
4800/5593
5000/5593
5200/5593
5400/5593
419.50996351242065
13
0/1472
200/1472
400/1472
600/1472
800/1472
1000/1472
1200/1472
1400/1472
100.6195158958435
14
0/3796
200/3796
400/3796
600/3796
800/3796
1000/3796
1200/3796
1400/3796
1600/3796
1800/3796
2000/3796
2200/3796
2400/3796
2600/3796
2800/3796
3000/3796
3200/3796
3400/3796
3600/3796
273.7214152812958
15
0/6183
200/6183
400/6183
600/6183
800/6183
1000/6183
1200/6183
1400/6183
1600/6183
1800/6183
2000/6183
2200/6183
2400/6183
2600/6183
2800/6183
3000/6183
3200/6183
3400/6183
3600/6183
3800/6183
4000/6183
4200/6183
4400/6183
4600/6183
4800/6183
5000/6183
5200/6183
5400/6183
5600/6183
5800/6183
6000/6183
546.1721587181091
16
0/4211
200/4211
400/4211
600/4211
800/4211
1000/4211
1200/4211
1400/4211
1600/4211
1800/4211
2000/4211
2200/4211
2400/4211
2600/4211
2800/4211
3000/4211
3200/4211
3400/4211
3600/4211
3800/4211
4000/4211
4200/4211
360.6661376953125
17
0/4620
200/4620
400/4620
600/4620
800/4620
1000/4620
1200/4620
1400/4620
1600/4620
1800/4620
2000/4620
2200/4620
2400/4620
2600/4620
2800/4620
3000/4620
3200/4620
3400/4620
3600/4620
3800/4620
4000/4620
4200/4620
4400/4620
4600/4620
399.0610158443451
18
0/5593
200/5593
400/5593
600/5593
800/5593
1000/5593
1200/5593
1400/5593
1600/5593
1800/5593
2000/5593
2200/5593
2400/5593
2600/5593
2800/5593
3000/5593
3200/5593
3400/5593
3600/5593
3800/5593
4000/5593
4200/5593
4400/5593
4600/5593
4800/5593
5000/5593
5200/5593
5400/5593
421.1472773551941
19
0/1472
200/1472
400/1472
600/1472
800/1472
1000/1472
1200/1472
1400/1472
101.06039953231812
20
0/3796
200/3796
400/3796
600/3796
800/3796
1000/3796
1200/3796
1400/3796
1600/3796
1800/3796
2000/3796
2200/3796
2400/3796
2600/3796
2800/3796
3000/3796
3200/3796
3400/3796
3600/3796
274.2331681251526
21
0/6183
200/6183
400/6183
600/6183
800/6183
1000/6183
1200/6183
1400/6183
1600/6183
1800/6183
2000/6183
2200/6183
2400/6183
2600/6183
2800/6183
3000/6183
3200/6183
3400/6183
3600/6183
3800/6183
4000/6183
4200/6183
4400/6183
4600/6183
4800/6183
5000/6183
5200/6183
5400/6183
5600/6183
5800/6183
6000/6183
543.7764277458191
22
0/4211
200/4211
400/4211
600/4211
800/4211
1000/4211
1200/4211
1400/4211
1600/4211
1800/4211
2000/4211
2200/4211
2400/4211
2600/4211
2800/4211
3000/4211
3200/4211
3400/4211
3600/4211
3800/4211
4000/4211
4200/4211
360.48128175735474
23
0/4620
200/4620
400/4620
600/4620
800/4620
1000/4620
1200/4620
1400/4620
1600/4620
1800/4620
2000/4620
2200/4620
2400/4620
2600/4620
2800/4620
3000/4620
3200/4620
3400/4620
3600/4620
3800/4620
4000/4620
4200/4620
4400/4620
4600/4620
397.95293641090393
DONE!

We see that some samples have less protein and take way less than 15min to run. So it might be feasible to get the runs quite quickly after all. Lets do a code for this later, now I want to complete the analysis.

df_log2FC
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
R01 R02 R03 R04 R05
A0A1P8AUY4 0.190271 -0.172924 0.992776 -0.710468 0.295473
A0MFS5 0.314738 -0.304560 -0.046011 -0.142394 0.123230
A1A6M1 0.045143 -0.039569 -0.031675 0.082436 -0.056626
A1L4X7 0.279982 -6.314680 0.403772 -0.886979 0.037891
A1L4Y2 -0.573467 -0.547050 -0.521499 -0.338061 NaN
... ... ... ... ... ...
Q9ZWJ3 0.243150 0.130662 0.133022 -0.244746 0.116473
Q9ZWT1 NaN 0.161794 0.542331 NaN NaN
Q9ZWT2 0.689017 0.528014 0.418026 0.011134 0.008915
X5JA13 0.592807 -0.553747 -0.015328 0.040232 0.044947
X5JB51 -0.161275 -0.138268 2.972913 -0.063825 -0.023935

4620 rows × 5 columns

2020-12-08

In the triqler manual. It is specified we should work with original input file (example case: iPRG2016 and we need the peptide quantification file iPRG2016.tsv.pqr.tsv).

In the example case. The following works:

python -m triqler iPRG2016.tsv

To generate the .tsv.pqr.tsv file. Then we do the protein_id plot.

python -m triqler.distribution.plot_posteriors --protein_id HPRR3730445_poolB iPRG2016.tsv

OUTPUT log:

(py36) ptruong@planck:~/git/triqler/example$ python -m triqler.distribution.plot_posteriors --protein_id HPRR3730445_poolB iPRG2016.tsv 
Triqler.distribution.plot_posteriors version 0.6.0
Copyright (c) 2018-2020 Matthew The. All rights reserved.
Written by Matthew The (matthew.the@scilifelab.se) in the
School of Engineering Sciences in Chemistry, Biotechnology and Health at the 
Royal Institute of Technology in Stockholm.
Issued command: plot_posteriors.py --protein_id HPRR3730445_poolB iPRG2016.tsv
Could not locate peptide quantification file iPRG2016.tsv.pqr.tsv. Run triqler to generate this file.
(py36) ptruong@planck:~/git/triqler/example$ python -m triqler iPRG2016.tsv 
Triqler version 0.6.0
Copyright (c) 2018-2020 Matthew The. All rights reserved.
Written by Matthew The (matthew.the@scilifelab.se) in the
School of Engineering Sciences in Chemistry, Biotechnology and Health at the 
Royal Institute of Technology in Stockholm.
Issued command: triqler.py iPRG2016.tsv
Parsing triqler input file
  Reading row 0
Calculating identification PEPs
  Identified 12113 PSMs at 1% FDR
Selecting best feature per run and spectrum
  featureGroupIdx: 0
Dividing intensities by 100000 for increased readability
Calculating peptide-level identification PEPs
  Identified 1988 peptides at 1% FDR
Writing peptide quant rows to file: iPRG2016.tsv.pqr.tsv
Calculating protein-level identification PEPs
  Identified 349 proteins at 1% FDR
Fitting hyperparameters
  params["muDetect"], params["sigmaDetect"] = 1.056334, 0.372395
  params["muXIC"], params["sigmaXIC"] = 3.276315, 0.953023
  params["muProtein"], params["sigmaProtein"] = 0.066437, 0.239524
  params["muFeatureDiff"], params["sigmaFeatureDiff"] = -0.013907, 0.149265
  params["shapeInGroupStdevs"], params["scaleInGroupStdevs"] = 1.027176, 0.089433
Calculating protein posteriors
  50 / 422 11.85%
  100 / 422 23.70%
  150 / 422 35.55%
  200 / 422 47.39%
  250 / 422 59.24%
  300 / 422 71.09%
  350 / 422 82.94%
  400 / 422 94.79%
Comparing 1:A+B to 2:B
  output file: proteins.1vs2.tsv
  Found 204 target proteins as differentially abundant at 5% FDR
Comparing 1:A+B to 3:A
  output file: proteins.1vs3.tsv
  Found 216 target proteins as differentially abundant at 5% FDR
Comparing 2:B to 3:A
  output file: proteins.2vs3.tsv
  Found 352 target proteins as differentially abundant at 5% FDR
Triqler execution took 28.871479630994145 seconds wall clock time
(py36) ptruong@planck:~/git/triqler/example$ python -m triqler.distribution.plot_posteriors --protein_id HPRR3730445_poolB iPRG2016.tsv 
Triqler.distribution.plot_posteriors version 0.6.0
Copyright (c) 2018-2020 Matthew The. All rights reserved.
Written by Matthew The (matthew.the@scilifelab.se) in the
School of Engineering Sciences in Chemistry, Biotechnology and Health at the 
Royal Institute of Technology in Stockholm.
Issued command: plot_posteriors.py --protein_id HPRR3730445_poolB iPRG2016.tsv
Could not locate peptide quantification file iPRG2016.tsv.pqr.tsv. Run triqler to generate this file.
(py36) ptruong@planck:~/git/triqler/example$ python -m triqler iPRG2016.tsv 
Triqler version 0.6.0
Copyright (c) 2018-2020 Matthew The. All rights reserved.
Written by Matthew The (matthew.the@scilifelab.se) in the
School of Engineering Sciences in Chemistry, Biotechnology and Health at the 
Royal Institute of Technology in Stockholm.
Issued command: triqler.py iPRG2016.tsv
Parsing triqler input file
  Reading row 0
Calculating identification PEPs
  Identified 12113 PSMs at 1% FDR
Selecting best feature per run and spectrum
  featureGroupIdx: 0
Dividing intensities by 100000 for increased readability
Calculating peptide-level identification PEPs
  Identified 1988 peptides at 1% FDR
Writing peptide quant rows to file: iPRG2016.tsv.pqr.tsv
Calculating protein-level identification PEPs
  Identified 349 proteins at 1% FDR
Fitting hyperparameters
  params["muDetect"], params["sigmaDetect"] = 1.056334, 0.372395
  params["muXIC"], params["sigmaXIC"] = 3.276315, 0.953023
  params["muProtein"], params["sigmaProtein"] = 0.066437, 0.239524
  params["muFeatureDiff"], params["sigmaFeatureDiff"] = -0.013907, 0.149265
  params["shapeInGroupStdevs"], params["scaleInGroupStdevs"] = 1.027176, 0.089433
Calculating protein posteriors
  50 / 422 11.85%
  100 / 422 23.70%
  150 / 422 35.55%
  200 / 422 47.39%
  250 / 422 59.24%
  300 / 422 71.09%
  350 / 422 82.94%
  400 / 422 94.79%
Comparing 1:A+B to 2:B
  output file: proteins.1vs2.tsv
  Found 204 target proteins as differentially abundant at 5% FDR
Comparing 1:A+B to 3:A
  output file: proteins.1vs3.tsv
  Found 216 target proteins as differentially abundant at 5% FDR
Comparing 2:B to 3:A
  output file: proteins.2vs3.tsv
  Found 352 target proteins as differentially abundant at 5% FDR
Triqler execution took 28.871479630994145 seconds wall clock time
(py36) ptruong@planck:~/git/triqler/example$ ls
iPRG2016_ref.proteins.1vs2.tsv	iPRG2016_ref.tsv.pqr.tsv  proteins.1vs2.tsv
iPRG2016_ref.proteins.1vs3.tsv	iPRG2016.tsv		  proteins.1vs3.tsv
iPRG2016_ref.proteins.2vs3.tsv	iPRG2016.tsv.pqr.tsv	  proteins.2vs3.tsv
(py36) ptruong@planck:~/git/triqler/example$ python -m triqler.distribution.plot_posteriors --protein_id HPRR3730445_poolB iPRG2016.tsv 
Triqler.distribution.plot_posteriors version 0.6.0
Copyright (c) 2018-2020 Matthew The. All rights reserved.
Written by Matthew The (matthew.the@scilifelab.se) in the
School of Engineering Sciences in Chemistry, Biotechnology and Health at the 
Royal Institute of Technology in Stockholm.
Issued command: plot_posteriors.py --protein_id HPRR3730445_poolB iPRG2016.tsv
Fitting hyperparameters
  params["muDetect"], params["sigmaDetect"] = 1.056315, 0.372438
  params["muXIC"], params["sigmaXIC"] = 3.276315, 0.953023
  params["muProtein"], params["sigmaProtein"] = 0.066425, 0.239521
  params["muFeatureDiff"], params["sigmaFeatureDiff"] = -0.013907, 0.149265
  params["shapeInGroupStdevs"], params["scaleInGroupStdevs"] = 1.027176, 0.089432

Protein ID: HPRR3730445_poolB

Peptide absolute abundances
760.43	509.03	1028.25	842.80	1610.55	1289.44	nan	nan	nan	combinedPEP=3.4e-06	peptide=R.WTAQGHANHGFVVEVAHLEEK.Q
99.93	166.59	3184.98	1868.59	6260.46	5909.35	nan	nan	59.71	combinedPEP=2.2e-05	peptide=R.LVNQNASRWESFDVTPAVMR.W
10064.52	12531.44	nan	27429.83	26226.20	23061.53nan	242.17	19.53	combinedPEP=0.0023	peptide=R.WESFDVTPAVMR.W

Peptide relative abundances
0.81	0.54	1.09	0.90	1.71	1.37	nan	nan	nan	combinedPEP=3.4e-06	peptide=R.WTAQGHANHGFVVEVAHLEEK.Q
0.12	0.21	3.96	2.32	7.78	7.34	nan	nan	0.07	combinedPEP=2.2e-05	peptide=R.LVNQNASRWESFDVTPAVMR.W
2.70	3.37	nan	7.37	7.05	6.20	nan	0.07	0.01	combinedPEP=0.0023	peptide=R.WESFDVTPAVMR.W

Protein abundance (expected value) and p-value
3.27	2.97	5.48	7.66	13.64	11.55	0.01	0.07	0.02
p-value: 3.2822538226303985e-05

Posterior probability |log2 fold change| < 1.00
  Group A+B vs Group B: 0.130072
  Group A+B vs Group A: 0.000000
  Group B vs Group A: 0.000000

Normal distribution fits for posterior distributions of treatment group relative abundances:
  Group A+B: mu, sigma = 0.472918, 0.114328
  Group B: mu, sigma = 0.918317, 0.071100
  Group A: mu, sigma = -1.782820, 0.269253
(py36) ptruong@planck:~/git/triqler/example$ 

Test attempt on PSSS3

and the following set of proteins.

A0A023T4K3_CAEEL

A0A061ACH4_CAEEL

A0A061ACK4_CAEEL

A0A061ACL3_CAEEL

A0A061ACR1_CAEEL

A0A061ACS5_CAEEL

A0A061ACU2_CAEEL

A0A061ACU6_CAEEL

A0A061ACY0_CAEEL

A0A061AD21_CAEEL

A0A061AD39_CAEEL

A0A061AD47_CAEEL

A0A061AE05_CAEEL

A0A061AJ42_CAEEL

A0A061AJK8_CAEEL

A0A061AKY5_CAEEL

A0A061AL58_CAEEL

A0A078BPG0_CAEEL

A0A078BPH9_CAEEL

A0A078BPJ4_CAEEL

There proteins are saved in protein_list.csv

Command run:

python -m triqler.distribution.plot_posteriors --protein_id_list protein_list.csv PSSS3_triqlerFormatted_nonShared.csv

output plots are in results/2020-12-08

Working with melted data

import os
os.chdir("/home/ptruong/git/bayesMS/bin")
import os 

import pandas as pd
import numpy as np 

from read_triqler_output import read_triqler_protein_output_to_df
from triqler_output_df_melter import melt_spectronaut_triqler_formatted, melt_triqler_output
from count_melted import count_melted_for_all_samples

os.chdir("/home/ptruong/git/bayesMS/data/old_data_pickled")
os.listdir()
['spectronaut.pkl',
 'spectronautParams.txt',
 'triqlerParams.txt',
 'triqler.pkl']
 pd.set_option('display.max_rows', 10)
spec = pd.read_pickle(r'spectronaut.pkl')
triq = pd.read_pickle(r'triqler.pkl')
os.chdir("/home/ptruong/git/bayesMS/data/triqlerResults_largeScale_minSamp20_FC0_8_adjInt")
os.listdir()
['proteins.6vs8.tsv',
 'proteins.5vs9.tsv',
 'proteins.3vs5.tsv',
 'proteins.4vs10.tsv',
 'proteins.3vs10.tsv',
 'proteins.1vs4.tsv',
 'proteins.5vs6.tsv',
 'proteins.2vs8.tsv',
 'proteins.4vs6.tsv',
 'proteins.2vs6.tsv',
 'proteins.9vs10.tsv',
 'proteins.2vs3.tsv',
 'proteins.7vs8.tsv',
 'proteins.1vs2.tsv',
 'proteins.3vs8.tsv',
 'proteins.3vs6.tsv',
 'proteins.1vs8.tsv',
 'proteins.6vs10.tsv',
 'proteins.5vs10.tsv',
 'proteins.3vs4.tsv',
 'proteins.7vs9.tsv',
 'proteins.1vs7.tsv',
 'proteins.2vs10.tsv',
 'proteins.3vs7.tsv',
 'proteins.8vs9.tsv',
 'proteins.4vs7.tsv',
 'proteins.7vs10.tsv',
 'proteins.2vs7.tsv',
 'proteins.6vs9.tsv',
 'proteins.1vs10.tsv',
 'proteins.4vs9.tsv',
 'proteins.5vs8.tsv',
 'proteins.2vs5.tsv',
 'proteins.1vs9.tsv',
 'proteins.1vs3.tsv',
 'proteins.5vs7.tsv',
 'proteins.1vs6.tsv',
 'proteins.2vs9.tsv',
 'proteins.3vs9.tsv',
 'proteins.6vs7.tsv',
 'proteins.8vs10.tsv',
 'proteins.4vs5.tsv',
 'proteins.4vs8.tsv',
 'proteins.1vs5.tsv',
 'proteins.2vs4.tsv']
triq = read_triqler_protein_output_to_df('proteins.3vs8.tsv')
triq2 = read_triqler_protein_output_to_df('proteins.2vs6.tsv')
spec = spec.rename(columns={'S03:S04_R05': 'S03:S03_R04'})
triq = melt_triqler_output(triq)
spec = melt_spectronaut_triqler_formatted(spec)
triq2=melt_triqler_output(triq2)
triq
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
q_value posterior_error_prob protein num_peptides protein_id_posterior_error_prob log2_fold_change diff_exp_prob_0.8 sample run value peptide species specie
0 1.168000e-09 1.168000e-09 P02566_CAEEL 152 1.167000e-09 5.84600 9.828000e-14 S01 R01 41.3300 ELLLDLPIK;YKQLTHQLEDAEER;HPNFEKPKPPK;DLEEANM[1... CAEEL CAEEL
1 1.168000e-09 1.168000e-09 P02566_CAEEL 152 1.167000e-09 5.84600 9.828000e-14 S01 R02 41.0700 ELLLDLPIK;YKQLTHQLEDAEER;HPNFEKPKPPK;DLEEANM[1... CAEEL CAEEL
2 1.168000e-09 1.168000e-09 P02566_CAEEL 152 1.167000e-09 5.84600 9.828000e-14 S01 R03 38.9800 ELLLDLPIK;YKQLTHQLEDAEER;HPNFEKPKPPK;DLEEANM[1... CAEEL CAEEL
3 1.168000e-09 1.168000e-09 P02566_CAEEL 152 1.167000e-09 5.84600 9.828000e-14 S01 R04 40.1300 ELLLDLPIK;YKQLTHQLEDAEER;HPNFEKPKPPK;DLEEANM[1... CAEEL CAEEL
4 1.168000e-09 1.168000e-09 P02566_CAEEL 152 1.167000e-09 5.84600 9.828000e-14 S01 R05 40.5900 ELLLDLPIK;YKQLTHQLEDAEER;HPNFEKPKPPK;DLEEANM[1... CAEEL CAEEL
... ... ... ... ... ... ... ... ... ... ... ... ... ...
543045 9.537000e-01 1.000000e+00 Q0WLB5_ARATH 66 1.214000e-08 -0.01544 1.000000e+00 S10 R01 1.0150 DLLLVNLR;VAAYIYK;NFLM[16]EAK;EATSFLLDVLKPNLPEH... ARATH ARATH
543046 9.537000e-01 1.000000e+00 Q0WLB5_ARATH 66 1.214000e-08 -0.01544 1.000000e+00 S10 R02 1.0360 DLLLVNLR;VAAYIYK;NFLM[16]EAK;EATSFLLDVLKPNLPEH... ARATH ARATH
543047 9.537000e-01 1.000000e+00 Q0WLB5_ARATH 66 1.214000e-08 -0.01544 1.000000e+00 S10 R03 1.0420 DLLLVNLR;VAAYIYK;NFLM[16]EAK;EATSFLLDVLKPNLPEH... ARATH ARATH
543048 9.537000e-01 1.000000e+00 Q0WLB5_ARATH 66 1.214000e-08 -0.01544 1.000000e+00 S10 R04 0.9652 DLLLVNLR;VAAYIYK;NFLM[16]EAK;EATSFLLDVLKPNLPEH... ARATH ARATH
543049 9.537000e-01 1.000000e+00 Q0WLB5_ARATH 66 1.214000e-08 -0.01544 1.000000e+00 S10 R05 0.9473 DLLLVNLR;VAAYIYK;NFLM[16]EAK;EATSFLLDVLKPNLPEH... ARATH ARATH

543050 rows × 13 columns

triq2
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
q_value posterior_error_prob protein num_peptides protein_id_posterior_error_prob log2_fold_change diff_exp_prob_0.8 sample run value peptide
0 1.168000e-09 1.168000e-09 P02566_CAEEL 152 1.167000e-09 4.139000 1.967000e-13 S01 R01 41.3300 ELLLDLPIK;YKQLTHQLEDAEER;HPNFEKPKPPK;DLEEANM[1...
1 1.168000e-09 1.168000e-09 P02566_CAEEL 152 1.167000e-09 4.139000 1.967000e-13 S01 R02 41.0700 ELLLDLPIK;YKQLTHQLEDAEER;HPNFEKPKPPK;DLEEANM[1...
2 1.168000e-09 1.168000e-09 P02566_CAEEL 152 1.167000e-09 4.139000 1.967000e-13 S01 R03 38.9800 ELLLDLPIK;YKQLTHQLEDAEER;HPNFEKPKPPK;DLEEANM[1...
3 1.168000e-09 1.168000e-09 P02566_CAEEL 152 1.167000e-09 4.139000 1.967000e-13 S01 R04 40.1300 ELLLDLPIK;YKQLTHQLEDAEER;HPNFEKPKPPK;DLEEANM[1...
4 1.168000e-09 1.168000e-09 P02566_CAEEL 152 1.167000e-09 4.139000 1.967000e-13 S01 R05 40.5900 ELLLDLPIK;YKQLTHQLEDAEER;HPNFEKPKPPK;DLEEANM[1...
... ... ... ... ... ... ... ... ... ... ... ...
543045 6.674000e-01 1.000000e+00 Q0WLB5_ARATH 66 1.214000e-08 -0.007193 1.000000e+00 S10 R01 1.0150 DLLLVNLR;VAAYIYK;NFLM[16]EAK;EATSFLLDVLKPNLPEH...
543046 6.674000e-01 1.000000e+00 Q0WLB5_ARATH 66 1.214000e-08 -0.007193 1.000000e+00 S10 R02 1.0360 DLLLVNLR;VAAYIYK;NFLM[16]EAK;EATSFLLDVLKPNLPEH...
543047 6.674000e-01 1.000000e+00 Q0WLB5_ARATH 66 1.214000e-08 -0.007193 1.000000e+00 S10 R03 1.0420 DLLLVNLR;VAAYIYK;NFLM[16]EAK;EATSFLLDVLKPNLPEH...
543048 6.674000e-01 1.000000e+00 Q0WLB5_ARATH 66 1.214000e-08 -0.007193 1.000000e+00 S10 R04 0.9652 DLLLVNLR;VAAYIYK;NFLM[16]EAK;EATSFLLDVLKPNLPEH...
543049 6.674000e-01 1.000000e+00 Q0WLB5_ARATH 66 1.214000e-08 -0.007193 1.000000e+00 S10 R05 0.9473 DLLLVNLR;VAAYIYK;NFLM[16]EAK;EATSFLLDVLKPNLPEH...

543050 rows × 11 columns

spec
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
id specie protein sample run value
0 A0A023T4K3_CAEEL CAEEL A0A023T4K3 S01 R01 203877.890625
1 A0A023T4K3_CAEEL CAEEL A0A023T4K3 S01 R02 190713.625000
2 A0A023T4K3_CAEEL CAEEL A0A023T4K3 S01 R03 195925.703125
3 A0A023T4K3_CAEEL CAEEL A0A023T4K3 S01 R04 150170.421875
4 A0A023T4K3_CAEEL CAEEL A0A023T4K3 S01 R05 163190.718750
... ... ... ... ... ... ...
750695 X5M5N0_CAEEL CAEEL X5M5N0 S10 R01 NaN
750696 X5M5N0_CAEEL CAEEL X5M5N0 S10 R02 NaN
750697 X5M5N0_CAEEL CAEEL X5M5N0 S10 R03 NaN
750698 X5M5N0_CAEEL CAEEL X5M5N0 S10 R04 NaN
750699 X5M5N0_CAEEL CAEEL X5M5N0 S10 R05 NaN

750700 rows × 6 columns

Lets work with triq...

samples = ["S0"+str(i) for i in range(1,10)] + ["S10"]
samples
['S01', 'S02', 'S03', 'S04', 'S05', 'S06', 'S07', 'S08', 'S09', 'S10']
triq["specie"] = triq.protein.apply(lambda x: x.split('_')[-1])
triq[triq["sample"] == "S01"]
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
q_value posterior_error_prob protein num_peptides protein_id_posterior_error_prob log2_fold_change diff_exp_prob_0.8 sample run value peptide species specie
0 1.168000e-09 1.168000e-09 P02566_CAEEL 152 1.167000e-09 5.84600 9.828000e-14 S01 R01 41.3300 ELLLDLPIK;YKQLTHQLEDAEER;HPNFEKPKPPK;DLEEANM[1... CAEEL CAEEL
1 1.168000e-09 1.168000e-09 P02566_CAEEL 152 1.167000e-09 5.84600 9.828000e-14 S01 R02 41.0700 ELLLDLPIK;YKQLTHQLEDAEER;HPNFEKPKPPK;DLEEANM[1... CAEEL CAEEL
2 1.168000e-09 1.168000e-09 P02566_CAEEL 152 1.167000e-09 5.84600 9.828000e-14 S01 R03 38.9800 ELLLDLPIK;YKQLTHQLEDAEER;HPNFEKPKPPK;DLEEANM[1... CAEEL CAEEL
3 1.168000e-09 1.168000e-09 P02566_CAEEL 152 1.167000e-09 5.84600 9.828000e-14 S01 R04 40.1300 ELLLDLPIK;YKQLTHQLEDAEER;HPNFEKPKPPK;DLEEANM[1... CAEEL CAEEL
4 1.168000e-09 1.168000e-09 P02566_CAEEL 152 1.167000e-09 5.84600 9.828000e-14 S01 R05 40.5900 ELLLDLPIK;YKQLTHQLEDAEER;HPNFEKPKPPK;DLEEANM[1... CAEEL CAEEL
... ... ... ... ... ... ... ... ... ... ... ... ... ...
543000 9.537000e-01 1.000000e+00 Q0WLB5_ARATH 66 1.214000e-08 -0.01544 1.000000e+00 S01 R01 1.0040 DLLLVNLR;VAAYIYK;NFLM[16]EAK;EATSFLLDVLKPNLPEH... ARATH ARATH
543001 9.537000e-01 1.000000e+00 Q0WLB5_ARATH 66 1.214000e-08 -0.01544 1.000000e+00 S01 R02 1.0300 DLLLVNLR;VAAYIYK;NFLM[16]EAK;EATSFLLDVLKPNLPEH... ARATH ARATH
543002 9.537000e-01 1.000000e+00 Q0WLB5_ARATH 66 1.214000e-08 -0.01544 1.000000e+00 S01 R03 1.0300 DLLLVNLR;VAAYIYK;NFLM[16]EAK;EATSFLLDVLKPNLPEH... ARATH ARATH
543003 9.537000e-01 1.000000e+00 Q0WLB5_ARATH 66 1.214000e-08 -0.01544 1.000000e+00 S01 R04 1.0330 DLLLVNLR;VAAYIYK;NFLM[16]EAK;EATSFLLDVLKPNLPEH... ARATH ARATH
543004 9.537000e-01 1.000000e+00 Q0WLB5_ARATH 66 1.214000e-08 -0.01544 1.000000e+00 S01 R05 0.9655 DLLLVNLR;VAAYIYK;NFLM[16]EAK;EATSFLLDVLKPNLPEH... ARATH ARATH

54305 rows × 13 columns

# Identified protein through all samples
print("%s : %i" % ("triqler number of protein ids", int(sum(triq.protein_id_posterior_error_prob < 0.01))))
print("%s : %i" % ("spectronaut number of protein ids", int(len(spec.dropna()))))

# Identified protein for species across all samples
triq_count = len(triq[triq.specie == "ARATH"])
spec_count = len(spec[spec.specie == "ARATH"].dropna())

print("%s : %i" % ("triqler ARATH number of protein ids", triq_count))
print("%s : %i" % ("spectronaut ARATH number of protein ids", spec_count))

triq_count = len(triq[triq.specie == "CAEEL"])
spec_count = len(spec[spec.specie == "CAEEL"].dropna())

print("%s : %i" % ("triqler CAEEL number of protein ids", triq_count))
print("%s : %i" % ("spectronaut CAEEL number of protein ids", spec_count))

triq_count = len(triq[triq.specie == "HUMAN"])
spec_count = len(spec[spec.specie == "HUMAN"].dropna())

print("%s : %i" % ("triqler HUMAN number of protein ids", triq_count))
print("%s : %i" % ("spectronaut HUMAN number of protein ids", spec_count))
triqler number of protein ids : 543050
spectronaut number of protein ids : 543601
triqler ARATH number of protein ids : 189800
spectronaut ARATH number of protein ids : 218754
triqler CAEEL number of protein ids : 73600
spectronaut CAEEL number of protein ids : 58732
triqler HUMAN number of protein ids : 279650
spectronaut HUMAN number of protein ids : 266115
samples = ["S0"+str(i) for i in range(1,10)] + ["S10"]
samples
['S01', 'S02', 'S03', 'S04', 'S05', 'S06', 'S07', 'S08', 'S09', 'S10']
count_melted_for_all_samples(triq, specie = None)
[54305, 54305, 54305, 54305, 54305, 54305, 54305, 54305, 54305, 54305]
count_melted_for_all_samples(triq, specie = "HUMAN")
[27965, 27965, 27965, 27965, 27965, 27965, 27965, 27965, 27965, 27965]
count_melted_for_all_samples(triq, specie = "CAEEL")
[7360, 7360, 7360, 7360, 7360, 7360, 7360, 7360, 7360, 7360]
count_melted_for_all_samples(triq, specie = "ARATH")
[18980, 18980, 18980, 18980, 18980, 18980, 18980, 18980, 18980, 18980]
count_melted_for_all_samples(spec, specie = None)
[43530, 66501, 61173, 57595, 54205, 52857, 52299, 51731, 52031, 51679]
count_melted_for_all_samples(spec, specie = "HUMAN")
[150, 26634, 29423, 29849, 29887, 29977, 30086, 29976, 30193, 29940]
count_melted_for_all_samples(spec, specie = "CAEEL")
[20808, 17464, 9863, 5896, 2566, 1155, 490, 229, 137, 124]
count_melted_for_all_samples(spec, specie = "ARATH")
[22572, 22403, 21887, 21850, 21752, 21725, 21723, 21526, 21701, 21615]

In the above counting we are counting the number identified proteins, without imputation for spec for each sample (i.e. with all 5 runs for each sample, triq is tresholded at 1%PEP and spec is tresholded at 1%FRD).

2020-12-09

Fold change matrix

We can compute a couple of fold change matrices for exploration.

  1. triqler posterior distribution differential expression for each proteinXvsY file.
  2. triqler fold change differential expression.
  3. spectronaut fold change differential expression.

How about the imputations?

  • I want to skip imputation for as long as possible.

Triqler input should have 1 intensities set to NaN

Coded up this part in bin/script_convert_PSSS3_1toNaN.py

Normalization

Normalization withing sample and run is performed. The logic is that all proteins from each sample and run should sum to 100%, but I do not know if this is correct after FDR tresholding.

os.chdir("/home/ptruong/git/bayesMS/bin")
from normalize_melted import normalize_within_sample
from normalize_melted import get_ratios_from_normalized_melted_df #Why no work??
---------------------------------------------------------------------------

ImportError                               Traceback (most recent call last)

<ipython-input-103-c583aa259998> in <module>
----> 1 from normalize_melted import get_ratios_from_normalized_melted_df #Why no work??


ImportError: cannot import name 'get_ratios_from_normalized_melted_df'
def get_ratios_from_normalized_melted_df(df, specie):
    """
    input - normalized df (triq, spec)
    output - ratios dataframe
    """
    samples = ["S0"+str(i) for i in range(1,10)] + ["S10"] 
    runs = ["R0" + str(i) for i in range(1,6)]
    
    ratios = []
    for sample in samples:
        ratios_for_sample = []
        for run in runs:
            df_sample = df[df["sample"] == sample]
            df_sample_run = df_sample[df_sample["run"] == run]
            df_sample_run_spec = df_sample_run[df_sample_run["specie"] == specie]
            ratios_for_sample.append(df_sample_run_spec.value.sum())
        ratios.append(ratios_for_sample)
    ratios_df = pd.DataFrame(ratios, index = samples, columns = runs)
    return ratios_df
df_spec = normalize_within_sample(spec)
df_triq  = normalize_within_sample(triq)
/home/ptruong/.local/lib/python3.6/site-packages/pandas/core/generic.py:5096: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self[name] = value

Mixture ratios

Below is a presentation of the mixture ratios of the different species.

True ratios as a reminder

Species S01 S02 S03 S04 S05 S06 S07 S08 S09 S10
A. thaliana 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5
C. elegans 0.5 0.25 0.125 0.0625 0.031 0.0155 0.008 0.004 0.002 0
H. sapiens 0 0.25 0.375 0.4375 0.469 0.4845 0.492 0.496 0.498 0.5
Table 1 . Protein ratios of the ten samples {S01, S02, ..., S10}.

Triqler normalized within sample ratios

get_ratios_from_normalized_melted_df(df_triq, "HUMAN")
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
R01 R02 R03 R04 R05
S01 0.128203 0.126626 0.120877 0.127743 0.116869
S02 0.289482 0.291946 0.290659 0.287219 0.290460
S03 0.491754 0.487170 0.486278 0.483526 0.484137
S04 0.560859 0.560926 0.560348 0.558908 0.562115
S05 0.601416 0.598230 0.592589 0.593048 0.590547
S06 0.612168 0.626370 0.606038 0.604525 0.601781
S07 0.623440 0.639131 0.609387 0.615723 0.612576
S08 0.615171 0.708059 0.620531 0.620357 0.613928
S09 0.625644 0.620596 0.613507 0.624937 0.613674
S10 0.610641 0.606414 0.606816 0.605386 0.603594
get_ratios_from_normalized_melted_df(df_triq, "ARATH")
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
R01 R02 R03 R04 R05
S01 0.202221 0.207353 0.206096 0.203861 0.210821
S02 0.263496 0.264762 0.266483 0.266022 0.270651
S03 0.273122 0.276097 0.279398 0.281625 0.281548
S04 0.282105 0.285016 0.288407 0.286770 0.290583
S05 0.290369 0.291970 0.297159 0.295287 0.298664
S06 0.284978 0.278644 0.289323 0.288021 0.292126
S07 0.285500 0.269952 0.284272 0.283905 0.285864
S08 0.277279 0.225521 0.279139 0.277611 0.278581
S09 0.276741 0.279461 0.281202 0.284556 0.282891
S10 0.286438 0.283075 0.288762 0.289589 0.289557
get_ratios_from_normalized_melted_df(df_triq, "CAEEL")
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
R01 R02 R03 R04 R05
S01 0.669577 0.666021 0.673027 0.668396 0.672311
S02 0.447023 0.443292 0.442858 0.446759 0.438889
S03 0.235124 0.236733 0.234325 0.234849 0.234315
S04 0.157036 0.154057 0.151245 0.154322 0.147303
S05 0.108215 0.109800 0.110252 0.111665 0.110790
S06 0.102855 0.094986 0.104640 0.107454 0.106093
S07 0.091061 0.090917 0.106341 0.100372 0.101560
S08 0.107549 0.066420 0.100330 0.102032 0.107491
S09 0.097615 0.099942 0.105291 0.090508 0.103435
S10 0.102920 0.110511 0.104422 0.105024 0.106849

Spectronaut normalized within sample ratios

get_ratios_from_normalized_melted_df(df_spec, "HUMAN")
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
R01 R02 R03 R04 R05
S01 0.000142 0.006533 0.000000 0.000340 0.000000
S02 0.219433 0.230812 0.215881 0.270386 0.268880
S03 0.353454 0.346912 0.338908 0.342514 0.342131
S04 0.378023 0.380670 0.387151 0.385364 0.387890
S05 0.399998 0.406819 0.393940 0.398728 0.396520
S06 0.426578 0.427706 0.410134 0.415170 0.414603
S07 0.440774 0.442464 0.420743 0.419045 0.454594
S08 0.423673 0.532117 0.432001 0.439144 0.429417
S09 0.443085 0.430571 0.427718 0.431624 0.425632
S10 0.412168 0.416974 0.411104 0.419243 0.414248
get_ratios_from_normalized_melted_df(df_spec, "ARATH")
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
R01 R02 R03 R04 R05
S01 0.718387 0.709902 0.717234 0.719183 0.715033
S02 0.650569 0.635745 0.650215 0.609311 0.605776
S03 0.596266 0.592043 0.597642 0.595323 0.602168
S04 0.591857 0.587589 0.584385 0.578559 0.581851
S05 0.588080 0.580873 0.591711 0.586733 0.585161
S06 0.565183 0.564956 0.583932 0.576295 0.577756
S07 0.555158 0.552160 0.578245 0.573948 0.542448
S08 0.575006 0.466419 0.565983 0.558071 0.568368
S09 0.553483 0.568904 0.569775 0.568369 0.573979
S10 0.587524 0.579042 0.587372 0.580757 0.583747
get_ratios_from_normalized_melted_df(df_spec, "CAEEL")
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
R01 R02 R03 R04 R05
S01 0.281471 0.283565 0.282766 0.280477 0.284967
S02 0.129998 0.133443 0.133904 0.120304 0.125345
S03 0.050280 0.061045 0.063450 0.062163 0.055701
S04 0.030119 0.031741 0.028464 0.036078 0.030258
S05 0.011922 0.012308 0.014349 0.014539 0.018319
S06 0.008239 0.007338 0.005934 0.008535 0.007641
S07 0.004068 0.005375 0.001012 0.007007 0.002958
S08 0.001320 0.001464 0.002016 0.002785 0.002215
S09 0.003432 0.000525 0.002507 0.000007 0.000389
S10 0.000308 0.003984 0.001524 0.000000 0.002005
from constants import get_sample_ratios, get_log2FC_ratio_matrix, get_log2FC_ratio_matrices
ARATH_FC_matrix, CAEEL_FC_matrix, HUMAN_FC_matrix = get_log2FC_ratio_matrices()
ARATH_FC_matrix
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
S01 S02 S03 S04 S05 S06 S07 S08 S09 S10
S01 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
S02 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
S03 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
S04 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
S05 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
S06 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
S07 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
S08 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
S09 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
S10 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
CAEEL_FC_matrix
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
S01 S02 S03 S04 S05 S06 S07 S08 S09 S10
S01 0.000000 1.000000 2.000000 3.000000 4.011588 5.011588 5.965784 6.965784 7.965784 18.931569
S02 -1.000000 0.000000 1.000000 2.000000 3.011588 4.011588 4.965784 5.965784 6.965784 17.931569
S03 -2.000000 -1.000000 0.000000 1.000000 2.011588 3.011588 3.965784 4.965784 5.965784 16.931569
S04 -3.000000 -2.000000 -1.000000 0.000000 1.011588 2.011588 2.965784 3.965784 4.965784 15.931569
S05 -4.011588 -3.011588 -2.011588 -1.011588 0.000000 1.000000 1.954196 2.954196 3.954196 14.919981
S06 -5.011588 -4.011588 -3.011588 -2.011588 -1.000000 0.000000 0.954196 1.954196 2.954196 13.919981
S07 -5.965784 -4.965784 -3.965784 -2.965784 -1.954196 -0.954196 0.000000 1.000000 2.000000 12.965784
S08 -6.965784 -5.965784 -4.965784 -3.965784 -2.954196 -1.954196 -1.000000 0.000000 1.000000 11.965784
S09 -7.965784 -6.965784 -5.965784 -4.965784 -3.954196 -2.954196 -2.000000 -1.000000 0.000000 10.965784
S10 -18.931569 -17.931569 -16.931569 -15.931569 -14.919981 -13.919981 -12.965784 -11.965784 -10.965784 0.000000
HUMAN_FC_matrix
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
S01 S02 S03 S04 S05 S06 S07 S08 S09 S10
S01 0.000000 -17.931569 -18.516531 -18.738923 -18.839228 -18.886137 -18.908299 -18.919981 -18.925786 -18.931569
S02 17.931569 0.000000 -0.584963 -0.807355 -0.907660 -0.954569 -0.976730 -0.988412 -0.994218 -1.000000
S03 18.516531 0.584963 0.000000 -0.222392 -0.322697 -0.369606 -0.391768 -0.403450 -0.409255 -0.415037
S04 18.738923 0.807355 0.222392 0.000000 -0.100305 -0.147214 -0.169375 -0.181057 -0.186863 -0.192645
S05 18.839228 0.907660 0.322697 0.100305 0.000000 -0.046909 -0.069070 -0.080752 -0.086558 -0.092340
S06 18.886137 0.954569 0.369606 0.147214 0.046909 0.000000 -0.022162 -0.033843 -0.039649 -0.045431
S07 18.908299 0.976730 0.391768 0.169375 0.069070 0.022162 0.000000 -0.011682 -0.017487 -0.023270
S08 18.919981 0.988412 0.403450 0.181057 0.080752 0.033843 0.011682 0.000000 -0.005806 -0.011588
S09 18.925786 0.994218 0.409255 0.186863 0.086558 0.039649 0.017487 0.005806 0.000000 -0.005782
S10 18.931569 1.000000 0.415037 0.192645 0.092340 0.045431 0.023270 0.011588 0.005782 0.000000

Thoughs on how to compute the fold change differential expression

For FC diff expression could we take the FC_matrix x ratio of the FC to get a treshold for what is considered FC differentially expressed?.

This fold change could for example be 0.8 of the true FC. Where we choose the ratio such that it allows for a small margin, write it as a formula so that we can change this treshold...

2020-12-03

Job run on kebnekaise failed because due to exceeding storage. I did wrong in using the storage project. I've contacted the SNIC service to ask for guidance in this.

Day has been spent writing on Project description.

Raw data

os.chdir("/home/ptruong/git/bayesMS/data")
os.listdir()
['triqlerResults_largeScale_minSamp20_FC0_8_adjInt',
 'triqler_default_params',
 'PSSS3_raw_sample_top_100000.csv',
 'Headers.xlsx',
 'old_data_pickled']
df = pd.read_csv("PSSS3_raw_sample_top_100000.csv", sep = "\t")
df
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
Unnamed: 0 R.Condition R.FileName PG.Organisms PG.ProteinAccessions PG.Cscore PG.NrOfStrippedSequencesIdentified PG.Qvalue PG.Quantity EG.StrippedSequence EG.IsDecoy EG.PrecursorId EG.PEP EG.Qvalue EG.Cscore FG.NormalizedMS2PeakArea
0 0 S500-PSSS3-S04 G_D180330_S500-PSSS3-S04_MHRM_R03_T0 Caenorhabditis elegans OX=6239 A0A023T4K3 0.182511 2 0.000646 377452.156250 SFYYLVQDLK False _SFYYLVQDLK_.2 0.014793 0.003878 2.355977 731908.023851
1 1 S500-PSSS3-S04 G_D180330_S500-PSSS3-S04_MHRM_R03_T0 Caenorhabditis elegans OX=6239 A0A023T4K3 0.182511 2 0.000646 377452.156250 EATVSESVLSELKR False _EATVSESVLSELKR_.3 0.022104 0.005114 2.073540 22996.253835
2 2 S500-PSSS3-S04 G_D180330_S500-PSSS3-S04_MHRM_R03_T0 Caenorhabditis elegans OX=6239 A0A023T4K3 0.182511 2 0.000646 377452.156250 EATVSESVLSELKR False _EATVSESVLSELKR_.2 1.000000 0.762912 -2.174942 42741.205204
3 3 S500-PSSS3-S04 G_D180330_S500-PSSS3-S04_MHRM_R03_T0 Caenorhabditis elegans OX=6239 A0A023T4K3 0.182511 2 0.000646 377452.156250 AADFYVR False _AADFYVR_.2 0.009466 0.015673 1.283862 36317.379819
4 4 S500-PSSS3-S04 G_D180330_S500-PSSS3-S04_MHRM_R03_T0 Caenorhabditis elegans OX=6239 A0A023T4K3 0.182511 2 0.000646 377452.156250 IGALADVNNSKDPDGLR False _IGALADVNNSKDPDGLR_.3 0.422581 0.102376 0.036027 22531.007700
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
99995 99995 S500-PSSS3-S04 G_D180330_S500-PSSS3-S04_MHRM_R03_T0 Arabidopsis thaliana O23044 0.304250 7 0.000407 48658.785156 GLFQSDSALTTNPTTLSNINR False _GLFQSDSALTTNPTTLSNINR_.2 0.377402 0.064631 -0.746827 9815.095549
99996 99996 S500-PSSS3-S04 G_D180330_S500-PSSS3-S04_MHRM_R03_T0 Arabidopsis thaliana O23044 0.304250 7 0.000407 48658.785156 KTFDLSYYQLVLK False _KTFDLSYYQLVLK_.2 0.862989 0.246905 -1.849602 70103.139684
99997 99997 S500-PSSS3-S04 G_D180330_S500-PSSS3-S04_MHRM_R03_T0 Arabidopsis thaliana O23044 NaN 11 NaN NaN TGAVGVSR True _TGAVGVSR_.2 NaN NaN -2.336857 4658.170465
99998 99998 S500-PSSS3-S04 G_D180330_S500-PSSS3-S04_MHRM_R03_T0 Arabidopsis thaliana O23044 NaN 11 NaN NaN ALTNPDTVR True _ALTNPDTVR_.2 NaN NaN -1.404710 8258.498615
99999 99999 S500-PSSS3-S04 G_D180330_S500-PSSS3-S04_MHRM_R03_T0 Arabidopsis thaliana O23044 NaN 11 NaN NaN FDGGIFAIK True _FDGGIFAIK_.2 NaN NaN -1.539223 3918.578812

100000 rows × 16 columns

2020-12-07 Exploring posteriors data

import os

os.chdir("/home/ptruong/git/bayesMS/data/triqler_default_params_posterios")
os.listdir()
['proteins.6vs8.tsv',
 'proteins.5vs9.tsv',
 'proteins.3vs5.tsv',
 'proteins.4vs10.tsv',
 'proteins.3vs10.tsv',
 'proteins.1vs4.tsv',
 'proteins.5vs6.tsv',
 'proteins.2vs8.tsv',
 'proteins.4vs6.tsv',
 'proteins.2vs6.tsv',
 'proteins.9vs10.tsv',
 'proteins.2vs3.tsv',
 'proteins.7vs8.tsv',
 'proteins.1vs2.tsv',
 'proteins.3vs8.tsv',
 'proteins.3vs6.tsv',
 'proteins.1vs8.tsv',
 'proteins.6vs10.tsv',
 'proteins.5vs10.tsv',
 'proteins.3vs4.tsv',
 'proteins.7vs9.tsv',
 'proteins.1vs7.tsv',
 'F_OUT',
 'proteins.2vs10.tsv',
 'proteins.3vs7.tsv',
 'proteins.8vs9.tsv',
 'proteins.4vs7.tsv',
 'proteins.7vs10.tsv',
 'proteins.2vs7.tsv',
 'proteins.6vs9.tsv',
 'proteins.1vs10.tsv',
 'proteins.4vs9.tsv',
 'proteins.5vs8.tsv',
 'proteins.2vs5.tsv',
 'G_OUT',
 'proteins.1vs9.tsv',
 'proteins.1vs3.tsv',
 'proteins.5vs7.tsv',
 'proteins.1vs6.tsv',
 'proteins.2vs9.tsv',
 'proteins.3vs9.tsv',
 'proteins.6vs7.tsv',
 'proteins.8vs10.tsv',
 'proteins.4vs5.tsv',
 'P_OUT',
 'proteins.4vs8.tsv',
 'proteins.1vs5.tsv',
 'proteins.2vs4.tsv']
import pandas as pd
import numpy as np
df = pd.read_csv("P_OUT", sep = "\t")
df
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
protein group:run -5 -4.99 -4.98 -4.97 -4.96 -4.95 -4.94 -4.93 ... 4.91 4.92 4.93 4.94 4.95 4.96 4.97 4.98 4.99 5
0 A0A023T4K3_CAEEL S01:S01_R01 7.645000e-40 8.081000e-40 8.542000e-40 9.029000e-40 9.544000e-40 1.009000e-39 1.066000e-39 1.127000e-39 ... 2.770000e-36 2.482000e-36 2.225000e-36 1.995000e-36 1.789000e-36 1.605000e-36 1.440000e-36 1.292000e-36 1.160000e-36 1.042000e-36
1 A0A023T4K3_CAEEL S01:S01_R02 3.036000e-40 3.209000e-40 3.392000e-40 3.586000e-40 3.790000e-40 4.006000e-40 4.235000e-40 4.476000e-40 ... 7.363000e-37 6.611000e-37 5.938000e-37 5.335000e-37 4.793000e-37 4.308000e-37 3.872000e-37 3.481000e-37 3.130000e-37 2.815000e-37
2 A0A023T4K3_CAEEL S01:S01_R03 7.332000e-40 7.750000e-40 8.192000e-40 8.659000e-40 9.153000e-40 9.675000e-40 1.023000e-39 1.081000e-39 ... 2.311000e-36 2.060000e-36 1.836000e-36 1.638000e-36 1.461000e-36 1.304000e-36 1.164000e-36 1.040000e-36 9.296000e-37 8.310000e-37
3 A0A023T4K3_CAEEL S01:S01_R04 9.449000e-44 9.988000e-44 1.056000e-43 1.116000e-43 1.180000e-43 1.247000e-43 1.318000e-43 1.393000e-43 ... 2.771000e-40 2.479000e-40 2.219000e-40 1.987000e-40 1.779000e-40 1.594000e-40 1.428000e-40 1.280000e-40 1.148000e-40 1.030000e-40
4 A0A023T4K3_CAEEL S01:S01_R05 1.423000e-35 1.504000e-35 1.590000e-35 1.681000e-35 1.777000e-35 1.878000e-35 1.985000e-35 2.098000e-35 ... 9.235000e-34 8.348000e-34 7.547000e-34 6.825000e-34 6.172000e-34 5.582000e-34 5.050000e-34 4.569000e-34 4.135000e-34 3.742000e-34
5 A0A023T4K3_CAEEL S02:S02_R01 1.345000e-30 1.421000e-30 1.502000e-30 1.588000e-30 1.679000e-30 1.774000e-30 1.876000e-30 1.982000e-30 ... 2.041000e-28 1.842000e-28 1.664000e-28 1.502000e-28 1.357000e-28 1.225000e-28 1.106000e-28 9.995000e-29 9.028000e-29 8.156000e-29
6 A0A023T4K3_CAEEL S02:S02_R02 5.609000e-34 5.929000e-34 6.267000e-34 6.624000e-34 7.002000e-34 7.401000e-34 7.823000e-34 8.269000e-34 ... 9.408000e-32 8.481000e-32 7.647000e-32 6.896000e-32 6.219000e-32 5.609000e-32 5.060000e-32 4.565000e-32 4.119000e-32 3.717000e-32
7 A0A023T4K3_CAEEL S02:S02_R03 5.444000e-35 5.755000e-35 6.083000e-35 6.430000e-35 6.796000e-35 7.184000e-35 7.594000e-35 8.026000e-35 ... 6.297000e-33 5.672000e-33 5.111000e-33 4.606000e-33 4.152000e-33 3.743000e-33 3.375000e-33 3.044000e-33 2.746000e-33 2.477000e-33
8 A0A023T4K3_CAEEL S02:S02_R04 2.411000e-35 2.548000e-35 2.693000e-35 2.847000e-35 3.009000e-35 3.181000e-35 3.362000e-35 3.554000e-35 ... 2.225000e-34 2.037000e-34 1.866000e-34 1.711000e-34 1.569000e-34 1.441000e-34 1.323000e-34 1.216000e-34 1.118000e-34 1.029000e-34
9 A0A023T4K3_CAEEL S02:S02_R05 1.096000e-29 1.158000e-29 1.224000e-29 1.294000e-29 1.368000e-29 1.446000e-29 1.528000e-29 1.615000e-29 ... 5.117000e-29 4.766000e-29 4.441000e-29 4.141000e-29 3.862000e-29 3.604000e-29 3.365000e-29 3.143000e-29 2.937000e-29 2.746000e-29
10 A0A023T4K3_CAEEL S03:S03_R01 7.973000e-21 8.427000e-21 8.908000e-21 9.415000e-21 9.952000e-21 1.052000e-20 1.112000e-20 1.175000e-20 ... 2.547000e-20 2.406000e-20 2.274000e-20 2.149000e-20 2.031000e-20 1.919000e-20 1.814000e-20 1.715000e-20 1.621000e-20 1.532000e-20
11 A0A023T4K3_CAEEL S03:S03_R02 1.251000e-21 1.323000e-21 1.398000e-21 1.478000e-21 1.562000e-21 1.651000e-21 1.745000e-21 1.845000e-21 ... 3.976000e-21 3.758000e-21 3.552000e-21 3.357000e-21 3.174000e-21 3.000000e-21 2.836000e-21 2.681000e-21 2.535000e-21 2.396000e-21
12 A0A023T4K3_CAEEL S03:S03_R03 1.081000e-21 1.143000e-21 1.208000e-21 1.277000e-21 1.350000e-21 1.427000e-21 1.508000e-21 1.594000e-21 ... 3.377000e-21 3.195000e-21 3.022000e-21 2.858000e-21 2.703000e-21 2.557000e-21 2.419000e-21 2.288000e-21 2.164000e-21 2.047000e-21
13 A0A023T4K3_CAEEL S03:S03_R04 3.798000e-21 4.015000e-21 4.243000e-21 4.485000e-21 4.741000e-21 5.011000e-21 5.297000e-21 5.599000e-21 ... 1.206000e-20 1.140000e-20 1.078000e-20 1.019000e-20 9.628000e-21 9.102000e-21 8.604000e-21 8.134000e-21 7.690000e-21 7.270000e-21
14 A0A023T4K3_CAEEL S03:S03_R05 1.110000e-20 1.174000e-20 1.240000e-20 1.311000e-20 1.386000e-20 1.465000e-20 1.548000e-20 1.637000e-20 ... 3.493000e-20 3.302000e-20 3.123000e-20 2.953000e-20 2.792000e-20 2.641000e-20 2.497000e-20 2.361000e-20 2.233000e-20 2.112000e-20
15 A0A023T4K3_CAEEL S04:S04_R01 1.058000e-16 1.118000e-16 1.182000e-16 1.249000e-16 1.321000e-16 1.396000e-16 1.476000e-16 1.560000e-16 ... 3.293000e-16 3.115000e-16 2.947000e-16 2.788000e-16 2.637000e-16 2.495000e-16 2.360000e-16 2.233000e-16 2.112000e-16 1.998000e-16
16 A0A023T4K3_CAEEL S04:S04_R02 2.719000e-18 2.875000e-18 3.038000e-18 3.212000e-18 3.395000e-18 3.588000e-18 3.793000e-18 4.009000e-18 ... 8.460000e-18 8.004000e-18 7.572000e-18 7.163000e-18 6.777000e-18 6.411000e-18 6.065000e-18 5.738000e-18 5.428000e-18 5.135000e-18
17 A0A023T4K3_CAEEL S04:S04_R03 5.098000e-17 5.389000e-17 5.696000e-17 6.021000e-17 6.364000e-17 6.727000e-17 7.111000e-17 7.516000e-17 ... 1.585000e-16 1.500000e-16 1.419000e-16 1.342000e-16 1.270000e-16 1.202000e-16 1.137000e-16 1.075000e-16 1.017000e-16 9.625000e-17
18 A0A023T4K3_CAEEL S04:S04_R04 2.126000e-17 2.247000e-17 2.376000e-17 2.511000e-17 2.654000e-17 2.806000e-17 2.965000e-17 3.135000e-17 ... 6.612000e-17 6.255000e-17 5.918000e-17 5.599000e-17 5.297000e-17 5.011000e-17 4.741000e-17 4.485000e-17 4.243000e-17 4.014000e-17
19 A0A023T4K3_CAEEL S04:S04_R05 6.677000e-16 7.058000e-16 7.460000e-16 7.885000e-16 8.335000e-16 8.810000e-16 9.312000e-16 9.843000e-16 ... 2.076000e-15 1.964000e-15 1.858000e-15 1.758000e-15 1.663000e-15 1.574000e-15 1.489000e-15 1.408000e-15 1.332000e-15 1.261000e-15
20 A0A023T4K3_CAEEL S05:S05_R01 4.752000e-15 5.023000e-15 5.309000e-15 5.612000e-15 5.932000e-15 6.270000e-15 6.627000e-15 7.005000e-15 ... 1.478000e-14 1.398000e-14 1.322000e-14 1.251000e-14 1.184000e-14 1.120000e-14 1.059000e-14 1.002000e-14 9.482000e-15 8.971000e-15
21 A0A023T4K3_CAEEL S05:S05_R02 1.389000e-15 1.468000e-15 1.552000e-15 1.640000e-15 1.734000e-15 1.833000e-15 1.937000e-15 2.047000e-15 ... 4.319000e-15 4.086000e-15 3.865000e-15 3.657000e-15 3.460000e-15 3.273000e-15 3.096000e-15 2.929000e-15 2.771000e-15 2.622000e-15
22 A0A023T4K3_CAEEL S05:S05_R03 4.499000e-15 4.755000e-15 5.027000e-15 5.313000e-15 5.616000e-15 5.936000e-15 6.275000e-15 6.632000e-15 ... 1.399000e-14 1.324000e-14 1.252000e-14 1.185000e-14 1.121000e-14 1.060000e-14 1.003000e-14 9.490000e-15 8.978000e-15 8.494000e-15
23 A0A023T4K3_CAEEL S05:S05_R04 2.276000e-15 2.406000e-15 2.543000e-15 2.688000e-15 2.841000e-15 3.003000e-15 3.174000e-15 3.355000e-15 ... 7.077000e-15 6.695000e-15 6.334000e-15 5.993000e-15 5.669000e-15 5.364000e-15 5.074000e-15 4.801000e-15 4.542000e-15 4.297000e-15
24 A0A023T4K3_CAEEL S05:S05_R05 2.830000e-15 2.992000e-15 3.162000e-15 3.343000e-15 3.533000e-15 3.735000e-15 3.948000e-15 4.173000e-15 ... 8.802000e-15 8.327000e-15 7.878000e-15 7.453000e-15 7.051000e-15 6.671000e-15 6.311000e-15 5.970000e-15 5.648000e-15 5.344000e-15
25 A0A023T4K3_CAEEL S06:S06_R01 4.436000e-15 4.689000e-15 4.956000e-15 5.238000e-15 5.537000e-15 5.853000e-15 6.186000e-15 6.539000e-15 ... 1.379000e-14 1.305000e-14 1.235000e-14 1.168000e-14 1.105000e-14 1.045000e-14 9.890000e-15 9.356000e-15 8.852000e-15 8.374000e-15
26 A0A023T4K3_CAEEL S06:S06_R02 2.629000e-15 2.778000e-15 2.937000e-15 3.104000e-15 3.281000e-15 3.468000e-15 3.666000e-15 3.875000e-15 ... 8.174000e-15 7.733000e-15 7.316000e-15 6.921000e-15 6.548000e-15 6.195000e-15 5.861000e-15 5.545000e-15 5.246000e-15 4.963000e-15
27 A0A023T4K3_CAEEL S06:S06_R03 4.414000e-15 4.665000e-15 4.931000e-15 5.212000e-15 5.509000e-15 5.824000e-15 6.156000e-15 6.507000e-15 ... 1.372000e-14 1.298000e-14 1.228000e-14 1.162000e-14 1.099000e-14 1.040000e-14 9.840000e-15 9.310000e-15 8.807000e-15 8.332000e-15
28 A0A023T4K3_CAEEL S06:S06_R04 4.286000e-15 4.530000e-15 4.788000e-15 5.061000e-15 5.350000e-15 5.655000e-15 5.977000e-15 6.318000e-15 ... 1.333000e-14 1.261000e-14 1.193000e-14 1.128000e-14 1.068000e-14 1.010000e-14 9.555000e-15 9.040000e-15 8.552000e-15 8.091000e-15
29 A0A023T4K3_CAEEL S06:S06_R05 3.043000e-16 3.216000e-16 3.400000e-16 3.594000e-16 3.799000e-16 4.015000e-16 4.244000e-16 4.486000e-16 ... 9.463000e-16 8.952000e-16 8.469000e-16 8.012000e-16 7.580000e-16 7.171000e-16 6.785000e-16 6.419000e-16 6.072000e-16 5.745000e-16
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
933870 decoy_V6CJA9_CAEEL S05:S05_R01 2.340000e-14 2.473000e-14 2.614000e-14 2.763000e-14 2.921000e-14 3.087000e-14 3.263000e-14 3.449000e-14 ... 7.275000e-14 6.883000e-14 6.512000e-14 6.161000e-14 5.828000e-14 5.514000e-14 5.216000e-14 4.935000e-14 4.669000e-14 4.417000e-14
933871 decoy_V6CJA9_CAEEL S05:S05_R02 2.340000e-14 2.473000e-14 2.614000e-14 2.763000e-14 2.921000e-14 3.087000e-14 3.263000e-14 3.449000e-14 ... 7.275000e-14 6.883000e-14 6.512000e-14 6.161000e-14 5.828000e-14 5.514000e-14 5.216000e-14 4.935000e-14 4.669000e-14 4.417000e-14
933872 decoy_V6CJA9_CAEEL S05:S05_R03 2.340000e-14 2.473000e-14 2.614000e-14 2.763000e-14 2.921000e-14 3.087000e-14 3.263000e-14 3.449000e-14 ... 7.275000e-14 6.883000e-14 6.512000e-14 6.161000e-14 5.828000e-14 5.514000e-14 5.216000e-14 4.935000e-14 4.669000e-14 4.417000e-14
933873 decoy_V6CJA9_CAEEL S05:S05_R04 2.340000e-14 2.473000e-14 2.614000e-14 2.763000e-14 2.921000e-14 3.087000e-14 3.263000e-14 3.449000e-14 ... 7.275000e-14 6.883000e-14 6.512000e-14 6.161000e-14 5.828000e-14 5.514000e-14 5.216000e-14 4.935000e-14 4.669000e-14 4.417000e-14
933874 decoy_V6CJA9_CAEEL S05:S05_R05 2.340000e-14 2.473000e-14 2.614000e-14 2.763000e-14 2.921000e-14 3.087000e-14 3.263000e-14 3.449000e-14 ... 7.275000e-14 6.883000e-14 6.512000e-14 6.161000e-14 5.828000e-14 5.514000e-14 5.216000e-14 4.935000e-14 4.669000e-14 4.417000e-14
933875 decoy_V6CJA9_CAEEL S06:S06_R01 2.340000e-14 2.473000e-14 2.614000e-14 2.763000e-14 2.921000e-14 3.087000e-14 3.263000e-14 3.449000e-14 ... 7.275000e-14 6.883000e-14 6.512000e-14 6.161000e-14 5.828000e-14 5.514000e-14 5.216000e-14 4.935000e-14 4.669000e-14 4.417000e-14
933876 decoy_V6CJA9_CAEEL S06:S06_R02 2.340000e-14 2.473000e-14 2.614000e-14 2.763000e-14 2.921000e-14 3.087000e-14 3.263000e-14 3.449000e-14 ... 7.275000e-14 6.883000e-14 6.512000e-14 6.161000e-14 5.828000e-14 5.514000e-14 5.216000e-14 4.935000e-14 4.669000e-14 4.417000e-14
933877 decoy_V6CJA9_CAEEL S06:S06_R03 2.340000e-14 2.473000e-14 2.614000e-14 2.763000e-14 2.921000e-14 3.087000e-14 3.263000e-14 3.449000e-14 ... 7.275000e-14 6.883000e-14 6.512000e-14 6.161000e-14 5.828000e-14 5.514000e-14 5.216000e-14 4.935000e-14 4.669000e-14 4.417000e-14
933878 decoy_V6CJA9_CAEEL S06:S06_R04 2.340000e-14 2.473000e-14 2.614000e-14 2.763000e-14 2.921000e-14 3.087000e-14 3.263000e-14 3.449000e-14 ... 7.275000e-14 6.883000e-14 6.512000e-14 6.161000e-14 5.828000e-14 5.514000e-14 5.216000e-14 4.935000e-14 4.669000e-14 4.417000e-14
933879 decoy_V6CJA9_CAEEL S06:S06_R05 2.340000e-14 2.473000e-14 2.614000e-14 2.763000e-14 2.921000e-14 3.087000e-14 3.263000e-14 3.449000e-14 ... 7.275000e-14 6.883000e-14 6.512000e-14 6.161000e-14 5.828000e-14 5.514000e-14 5.216000e-14 4.935000e-14 4.669000e-14 4.417000e-14
933880 decoy_V6CJA9_CAEEL S07:S07_R01 2.196000e-14 2.321000e-14 2.453000e-14 2.593000e-14 2.741000e-14 2.897000e-14 3.062000e-14 3.237000e-14 ... 6.828000e-14 6.460000e-14 6.111000e-14 5.782000e-14 5.470000e-14 5.175000e-14 4.896000e-14 4.632000e-14 4.382000e-14 4.145000e-14
933881 decoy_V6CJA9_CAEEL S07:S07_R02 2.340000e-14 2.473000e-14 2.614000e-14 2.763000e-14 2.921000e-14 3.087000e-14 3.263000e-14 3.449000e-14 ... 7.275000e-14 6.883000e-14 6.512000e-14 6.161000e-14 5.828000e-14 5.514000e-14 5.216000e-14 4.935000e-14 4.669000e-14 4.417000e-14
933882 decoy_V6CJA9_CAEEL S07:S07_R03 2.340000e-14 2.473000e-14 2.614000e-14 2.763000e-14 2.921000e-14 3.087000e-14 3.263000e-14 3.449000e-14 ... 7.275000e-14 6.883000e-14 6.512000e-14 6.161000e-14 5.828000e-14 5.514000e-14 5.216000e-14 4.935000e-14 4.669000e-14 4.417000e-14
933883 decoy_V6CJA9_CAEEL S07:S07_R04 2.340000e-14 2.473000e-14 2.614000e-14 2.763000e-14 2.921000e-14 3.087000e-14 3.263000e-14 3.449000e-14 ... 7.275000e-14 6.883000e-14 6.512000e-14 6.161000e-14 5.828000e-14 5.514000e-14 5.216000e-14 4.935000e-14 4.669000e-14 4.417000e-14
933884 decoy_V6CJA9_CAEEL S07:S07_R05 2.340000e-14 2.473000e-14 2.614000e-14 2.763000e-14 2.921000e-14 3.087000e-14 3.263000e-14 3.449000e-14 ... 7.275000e-14 6.883000e-14 6.512000e-14 6.161000e-14 5.828000e-14 5.514000e-14 5.216000e-14 4.935000e-14 4.669000e-14 4.417000e-14
933885 decoy_V6CJA9_CAEEL S08:S08_R01 2.340000e-14 2.473000e-14 2.614000e-14 2.763000e-14 2.921000e-14 3.087000e-14 3.263000e-14 3.449000e-14 ... 7.275000e-14 6.883000e-14 6.512000e-14 6.161000e-14 5.828000e-14 5.514000e-14 5.216000e-14 4.935000e-14 4.669000e-14 4.417000e-14
933886 decoy_V6CJA9_CAEEL S08:S08_R02 2.340000e-14 2.473000e-14 2.614000e-14 2.763000e-14 2.921000e-14 3.087000e-14 3.263000e-14 3.449000e-14 ... 7.275000e-14 6.883000e-14 6.512000e-14 6.161000e-14 5.828000e-14 5.514000e-14 5.216000e-14 4.935000e-14 4.669000e-14 4.417000e-14
933887 decoy_V6CJA9_CAEEL S08:S08_R03 2.340000e-14 2.473000e-14 2.614000e-14 2.763000e-14 2.921000e-14 3.087000e-14 3.263000e-14 3.449000e-14 ... 7.275000e-14 6.883000e-14 6.512000e-14 6.161000e-14 5.828000e-14 5.514000e-14 5.216000e-14 4.935000e-14 4.669000e-14 4.417000e-14
933888 decoy_V6CJA9_CAEEL S08:S08_R04 2.340000e-14 2.473000e-14 2.614000e-14 2.763000e-14 2.921000e-14 3.087000e-14 3.263000e-14 3.449000e-14 ... 7.275000e-14 6.883000e-14 6.512000e-14 6.161000e-14 5.828000e-14 5.514000e-14 5.216000e-14 4.935000e-14 4.669000e-14 4.417000e-14
933889 decoy_V6CJA9_CAEEL S08:S08_R05 2.340000e-14 2.473000e-14 2.614000e-14 2.763000e-14 2.921000e-14 3.087000e-14 3.263000e-14 3.449000e-14 ... 7.275000e-14 6.883000e-14 6.512000e-14 6.161000e-14 5.828000e-14 5.514000e-14 5.216000e-14 4.935000e-14 4.669000e-14 4.417000e-14
933890 decoy_V6CJA9_CAEEL S09:S09_R01 2.340000e-14 2.473000e-14 2.614000e-14 2.763000e-14 2.921000e-14 3.087000e-14 3.263000e-14 3.449000e-14 ... 7.275000e-14 6.883000e-14 6.512000e-14 6.161000e-14 5.828000e-14 5.514000e-14 5.216000e-14 4.935000e-14 4.669000e-14 4.417000e-14
933891 decoy_V6CJA9_CAEEL S09:S09_R02 2.340000e-14 2.473000e-14 2.614000e-14 2.763000e-14 2.921000e-14 3.087000e-14 3.263000e-14 3.449000e-14 ... 7.275000e-14 6.883000e-14 6.512000e-14 6.161000e-14 5.828000e-14 5.514000e-14 5.216000e-14 4.935000e-14 4.669000e-14 4.417000e-14
933892 decoy_V6CJA9_CAEEL S09:S09_R03 2.340000e-14 2.473000e-14 2.614000e-14 2.763000e-14 2.921000e-14 3.087000e-14 3.263000e-14 3.449000e-14 ... 7.275000e-14 6.883000e-14 6.512000e-14 6.161000e-14 5.828000e-14 5.514000e-14 5.216000e-14 4.935000e-14 4.669000e-14 4.417000e-14
933893 decoy_V6CJA9_CAEEL S09:S09_R04 1.585000e-14 1.675000e-14 1.770000e-14 1.871000e-14 1.978000e-14 2.091000e-14 2.210000e-14 2.336000e-14 ... 4.928000e-14 4.662000e-14 4.410000e-14 4.173000e-14 3.947000e-14 3.735000e-14 3.533000e-14 3.343000e-14 3.162000e-14 2.992000e-14
933894 decoy_V6CJA9_CAEEL S09:S09_R05 2.340000e-14 2.473000e-14 2.614000e-14 2.763000e-14 2.921000e-14 3.087000e-14 3.263000e-14 3.449000e-14 ... 7.275000e-14 6.883000e-14 6.512000e-14 6.161000e-14 5.828000e-14 5.514000e-14 5.216000e-14 4.935000e-14 4.669000e-14 4.417000e-14
933895 decoy_V6CJA9_CAEEL S10:S10_R01 1.485000e-14 1.569000e-14 1.659000e-14 1.753000e-14 1.853000e-14 1.959000e-14 2.071000e-14 2.189000e-14 ... 4.617000e-14 4.368000e-14 4.132000e-14 3.909000e-14 3.698000e-14 3.499000e-14 3.310000e-14 3.132000e-14 2.963000e-14 2.803000e-14
933896 decoy_V6CJA9_CAEEL S10:S10_R02 2.340000e-14 2.473000e-14 2.614000e-14 2.763000e-14 2.921000e-14 3.087000e-14 3.263000e-14 3.449000e-14 ... 7.275000e-14 6.883000e-14 6.512000e-14 6.161000e-14 5.828000e-14 5.514000e-14 5.216000e-14 4.935000e-14 4.669000e-14 4.417000e-14
933897 decoy_V6CJA9_CAEEL S10:S10_R03 2.340000e-14 2.473000e-14 2.614000e-14 2.763000e-14 2.921000e-14 3.087000e-14 3.263000e-14 3.449000e-14 ... 7.275000e-14 6.883000e-14 6.512000e-14 6.161000e-14 5.828000e-14 5.514000e-14 5.216000e-14 4.935000e-14 4.669000e-14 4.417000e-14
933898 decoy_V6CJA9_CAEEL S10:S10_R04 2.340000e-14 2.473000e-14 2.614000e-14 2.763000e-14 2.921000e-14 3.087000e-14 3.263000e-14 3.449000e-14 ... 7.275000e-14 6.883000e-14 6.512000e-14 6.161000e-14 5.828000e-14 5.514000e-14 5.216000e-14 4.935000e-14 4.669000e-14 4.417000e-14
933899 decoy_V6CJA9_CAEEL S10:S10_R05 2.340000e-14 2.473000e-14 2.614000e-14 2.763000e-14 2.921000e-14 3.087000e-14 3.263000e-14 3.449000e-14 ... 7.275000e-14 6.883000e-14 6.512000e-14 6.161000e-14 5.828000e-14 5.514000e-14 5.216000e-14 4.935000e-14 4.669000e-14 4.417000e-14

933900 rows × 1003 columns

Trying to run the code /home/ptruong/git/triqler/triqler/distribution/python plot_posteriors.py. The following relative imports:

from ..triqler import version, copyright from .. import parsers from .. import hyperparameters from .. import pgm from .. import diff_exp

Gives this error

Traceback (most recent call last):

File "plot_posteriors.py", line 18, in

from ..triqler import __version__, __copyright__

ValueError: attempted relative import beyond top-level package

How does this relative import work?

Plotting worked with:

python -m triqler.distribution.plot_posteriors --protein_id_list protein_list.csv P_OUT

OUTPUT:

Triqler.distribution.plot_posteriors version 0.6.0 Copyright (c) 2018-2020 Matthew The. All rights reserved. Written by Matthew The ([email protected]) in the School of Engineering Sciences in Chemistry, Biotechnology and Health at the Royal Institute of Technology in Stockholm. Issued command: plot_posteriors.py --protein_id_list protein_list.csv P_OUT Protein list posterior plotting not yet supported for protein posteriors

The input F_OUT and G_OUT seem to give the similar response. I need to check if the heatmap plot needs to be run manually.

2020-12-02

os.chdir("/home/ptruong/git/bayesMS/bin")
import os 

import pandas as pd
import numpy as np 
from read_triqler_output import read_triqler_protein_output_to_df
from triqler_output_df_melter import melt_spectronaut_triqler_formatted, melt_triqler_output
# data dir /home/ptruong/git/bayesMS/data/old_data_pickled
os.chdir("/home/ptruong/git/bayesMS/data/old_data_pickled")
os.listdir()
['triqler.pkl',
 'spectronaut.pkl',
 'triqlerParams.txt',
 'spectronautParams.txt']
spec = pd.read_pickle(r'spectronaut.pkl')
triq = pd.read_pickle(r'triqler.pkl')

Note: To read triqler output data we need to use

from read_triqler_output import read_triqler_protein_output_to_df

because the peptide seperation is the same as tab-seperation.

os.chdir("/home/ptruong/git/bayesMS/data/triqlerResults_largeScale_minSamp20_FC0_8_adjInt")
os.listdir()
['proteins.6vs8.tsv',
 'proteins.5vs9.tsv',
 'proteins.3vs5.tsv',
 'proteins.4vs10.tsv',
 'proteins.3vs10.tsv',
 'proteins.1vs4.tsv',
 'proteins.5vs6.tsv',
 'proteins.2vs8.tsv',
 'proteins.4vs6.tsv',
 'proteins.2vs6.tsv',
 'proteins.9vs10.tsv',
 'proteins.2vs3.tsv',
 'proteins.7vs8.tsv',
 'proteins.1vs2.tsv',
 'proteins.3vs8.tsv',
 'proteins.3vs6.tsv',
 'proteins.1vs8.tsv',
 'proteins.6vs10.tsv',
 'proteins.5vs10.tsv',
 'proteins.3vs4.tsv',
 'proteins.7vs9.tsv',
 'proteins.1vs7.tsv',
 'proteins.2vs10.tsv',
 'proteins.3vs7.tsv',
 'proteins.8vs9.tsv',
 'proteins.4vs7.tsv',
 'proteins.7vs10.tsv',
 'proteins.2vs7.tsv',
 'proteins.6vs9.tsv',
 'proteins.1vs10.tsv',
 'proteins.4vs9.tsv',
 'proteins.5vs8.tsv',
 'proteins.2vs5.tsv',
 'proteins.1vs9.tsv',
 'proteins.1vs3.tsv',
 'proteins.5vs7.tsv',
 'proteins.1vs6.tsv',
 'proteins.2vs9.tsv',
 'proteins.3vs9.tsv',
 'proteins.6vs7.tsv',
 'proteins.8vs10.tsv',
 'proteins.4vs5.tsv',
 'proteins.4vs8.tsv',
 'proteins.1vs5.tsv',
 'proteins.2vs4.tsv']
triq = read_triqler_protein_output_to_df('proteins.3vs8.tsv')
spec.columns
Index(['id', 'specie', 'protein', 'S01:S01_R01', 'S01:S01_R02', 'S01:S01_R03',
       'S01:S01_R04', 'S01:S01_R05', 'S02:S02_R01', 'S02:S02_R02',
       'S02:S02_R03', 'S02:S02_R04', 'S02:S02_R05', 'S03:S03_R01',
       'S03:S03_R02', 'S03:S03_R03', 'S03:S03_R05', 'S03:S04_R05',
       'S04:S04_R01', 'S04:S04_R02', 'S04:S04_R03', 'S04:S04_R04',
       'S04:S04_R05', 'S05:S05_R01', 'S05:S05_R02', 'S05:S05_R03',
       'S05:S05_R04', 'S05:S05_R05', 'S06:S06_R01', 'S06:S06_R02',
       'S06:S06_R03', 'S06:S06_R04', 'S06:S06_R05', 'S07:S07_R01',
       'S07:S07_R02', 'S07:S07_R03', 'S07:S07_R04', 'S07:S07_R05',
       'S08:S08_R01', 'S08:S08_R02', 'S08:S08_R03', 'S08:S08_R04',
       'S08:S08_R05', 'S09:S09_R01', 'S09:S09_R02', 'S09:S09_R03',
       'S09:S09_R04', 'S09:S09_R05', 'S10:S10_R01', 'S10:S10_R02',
       'S10:S10_R03', 'S10:S10_R04', 'S10:S10_R05'],
      dtype='object')

We notice naming error and rename the column S03:S04_R05 to S03:S03_R05

spec = spec.rename(columns={'S03:S04_R05': 'S03:S03_R04'})

triqler and spectroanut raw data is presented below. Both data set with 0.01 FDR treshold.

spec
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
id specie protein S01:S01_R01 S01:S01_R02 S01:S01_R03 S01:S01_R04 S01:S01_R05 S02:S02_R01 S02:S02_R02 ... S09:S09_R01 S09:S09_R02 S09:S09_R03 S09:S09_R04 S09:S09_R05 S10:S10_R01 S10:S10_R02 S10:S10_R03 S10:S10_R04 S10:S10_R05
0 A0A023T4K3_CAEEL CAEEL A0A023T4K3 203877.890625 190713.625000 195925.703125 150170.421875 163190.718750 97127.828125 195944.765625 ... NaN NaN NaN NaN NaN NaN 767819.750000 NaN NaN NaN
1 decoy_A0A023T4K3_CAEEL CAEEL decoy_A0A023T4K3 NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 A0A061ACH4_CAEEL CAEEL A0A061ACH4 8781.009766 10325.387695 10948.747070 8789.944336 11853.339844 72.113838 3990.177002 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 A0A061ACL3_CAEEL CAEEL A0A061ACL3 298404.781250 296928.375000 309051.406250 325567.562500 277911.718750 144550.265625 142709.437500 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 A0A061ACR1_CAEEL CAEEL A0A061ACR1 336096.625000 354560.000000 378152.875000 359206.312500 387415.562500 187493.328125 200906.875000 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
15009 V6CLV0_CAEEL CAEEL V6CLV0 803615.500000 860085.062500 897633.250000 971245.750000 834293.187500 400180.968750 399113.750000 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
15010 V6CM07_CAEEL CAEEL V6CM07 18498.939453 21163.210938 19706.724609 28860.738281 16341.950195 13379.542969 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
15011 X5JA13_ARATH ARATH X5JA13 141788.171875 102453.429688 97403.984375 92313.593750 101041.617188 125743.546875 114827.593750 ... 164211.046875 85307.000000 88487.984375 95901.960938 81865.812500 127238.65625 110625.953125 98996.132812 98382.304688 81635.281250
15012 X5JB51_ARATH ARATH X5JB51 509657.687500 625327.125000 9751.568359 4141.071289 25130.304688 55028.699219 140951.250000 ... 164315.000000 51476.402344 46677.218750 41652.757812 7216.791992 36751.68750 59819.433594 46432.910156 3826.284180 48352.703125
15013 X5M5N0_CAEEL CAEEL X5M5N0 NaN 27413.113281 21796.166016 8100.251465 64.185448 NaN NaN ... NaN NaN 370496.218750 NaN NaN NaN NaN NaN NaN NaN

15014 rows × 53 columns

triq
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
q_value posterior_error_prob num_peptides protein_id_posterior_error_prob log2_fold_change diff_exp_prob_1.0 S01:S01_R01 S01:S01_R02 S01:S01_R03 S01:S01_R04 ... S10:S10_R02 S10:S10_R03 S10:S10_R04 S10:S10_R05 peptides specie decoy id protein FDR
0 1.284000e-10 1.284000e-10 465 1.284000e-10 -11.47000 2.924000e-15 0.000238 0.000244 0.000480 0.000195 ... 2.4480 2.4250 2.5010 2.4420 VKMPDVDISVPK|FKMPFLSISSPK|FKMPEINIK|ISM[16]SEV... HUMAN False Q09666_HUMAN Q09666 1.284000e-10
1 1.420000e-10 1.556000e-10 421 1.556000e-10 -10.43000 8.442000e-14 0.000804 0.000450 0.000397 0.000422 ... 2.4030 2.2490 2.2650 2.3440 SLQEEHVAVAQLR|EAEQEAARR|KQEELQQLEQQR|TISLVIR|I... HUMAN False Q15149_HUMAN Q15149 1.420000e-10
2 2.176000e-10 3.686000e-10 311 3.686000e-10 -10.21000 2.266000e-14 0.000641 0.000293 0.000492 0.001091 ... 2.4610 2.2850 2.3040 2.2670 EM[16]KPVIFLDVFLPR|LLNFLMK|NLLIFENLIDLKR|YKEVY... HUMAN False P78527-2_HUMAN P78527-2 2.175333e-10
3 2.622000e-10 3.960000e-10 314 3.936000e-10 -9.47600 2.345000e-12 0.001133 0.000662 0.000671 0.001489 ... 2.2670 2.1890 2.1260 2.0830 LKVNFLPEIITLSK|LSLSNAISTALPLTQLR|INMLVIELK|WAI... HUMAN False Q14204_HUMAN Q14204 2.615500e-10
5 5.753000e-10 1.204000e-09 238 1.183000e-09 -8.84800 2.112000e-11 0.001145 0.002007 0.001241 0.001354 ... 2.2320 2.0110 1.9700 2.0390 GRSEADSDKNATILELR|NGVGTSSSMGSGVSDDVFSSSR|NLPLA... HUMAN False P15924_HUMAN P15924 4.458400e-10
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
9425 5.113000e-01 9.990000e-01 3 1.116000e-01 0.21620 9.989000e-01 1.135000 1.156000 1.031000 1.021000 ... 1.0090 1.0090 1.0090 1.0090 ESGDQLSVSR|GIGLGLVQQLVK|NAALTVEQSTAELISSFNK CAEEL False P90780_CAEEL P90780 9.970788e-03
6074 2.684000e-01 9.112000e-01 1 1.118000e-01 0.27480 9.000000e-01 1.920000 2.049000 1.938000 2.441000 ... 0.9519 0.9519 0.9519 0.9519 RIGYGIK CAEEL False Q9TVW5_CAEEL Q9TVW5 9.977931e-03
11844 5.984000e-01 9.999000e-01 3 1.119000e-01 0.02839 9.999000e-01 0.988800 0.981800 0.988800 0.988800 ... 1.0370 1.0370 0.8964 1.1400 VACGEGATTEKRPAEAAVEL|DLKIPSEDLSNLVEQCLEK|SFEYC... HUMAN True decoy_O96017_HUMAN O96017 9.985081e-03
9835 5.295000e-01 9.994000e-01 3 1.121000e-01 -0.06439 9.994000e-01 0.904700 0.831100 0.747000 0.765600 ... 1.0830 1.0660 1.0940 1.0090 TGAALLIVPR|VVDTETGISLPR|SQGSQLTEDDVK ARATH False Q9LU36_ARATH Q9LU36 9.992243e-03
14934 6.686000e-01 1.000000e+00 1 1.122000e-01 -0.04344 1.000000e+00 1.076000 1.089000 1.083000 0.895800 ... 1.0540 1.0590 1.0180 0.9144 AVLTLSK CAEEL False P34479_CAEEL P34479 9.999411e-03

14258 rows × 62 columns

The reported protein expressions per run are the expected value of the protein's expression in that run. They represent relative values (not log transformed) to the protein's mean expression across all runs, which itself would correspond to the value 1.0. For example, a value of 1.5 means that the expression in this sample is 50% higher than the mean across all runs. A second example comparing values across samples: if sample1 has a value of 2.0 and sample2 a value of 1.5, it means that the expression in sample1 is 33% higher than in sample2 (2.0/1.5=1.33). We don't necessarily recommend using these values for downstream analysis, as the idea is that the actual value of interest is the fold change between treatment groups rather than between samples.

The triqler proteins are relative values, so we need to transform spectroanut values to relative values.

We should also treshold the triqler results on protein_id_posterior_error_prob

print(sum(triq.protein_id_posterior_error_prob < 0.01))
11162
print("%s : %i" % ("triqler number of protein ids", int(sum(triq.protein_id_posterior_error_prob < 0.01))))
print("%s : %i" % ("spectronaut number of protein ids", int(len(spec))))
triqler number of protein ids : 11162
spectronaut number of protein ids : 15014

Spectronaut has higher than protein count, but looking at the data above we can see that there are plenty of NaN among samples in the spectronaut. We need to fix this.

We could do this by saying that proteins with more than 2 or 3 NaN in a samples if left out.

We wrangle triq and spec data to melted df for easier data manipulation. Scripts for this can be found in

"from triqler_output_df_melter import melt_spectronaut_triqler_formatted, melt_triqler_output"

triq = melt_triqler_output(triq)
spec = melt_spectronaut_triqler_formatted(spec)
spec
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
id specie protein sample run value
0 A0A023T4K3_CAEEL CAEEL A0A023T4K3 S01 R01 203877.890625
1 A0A023T4K3_CAEEL CAEEL A0A023T4K3 S01 R02 190713.625000
2 A0A023T4K3_CAEEL CAEEL A0A023T4K3 S01 R03 195925.703125
3 A0A023T4K3_CAEEL CAEEL A0A023T4K3 S01 R04 150170.421875
4 A0A023T4K3_CAEEL CAEEL A0A023T4K3 S01 R05 163190.718750
... ... ... ... ... ... ...
750695 X5M5N0_CAEEL CAEEL X5M5N0 S10 R01 NaN
750696 X5M5N0_CAEEL CAEEL X5M5N0 S10 R02 NaN
750697 X5M5N0_CAEEL CAEEL X5M5N0 S10 R03 NaN
750698 X5M5N0_CAEEL CAEEL X5M5N0 S10 R04 NaN
750699 X5M5N0_CAEEL CAEEL X5M5N0 S10 R05 NaN

750700 rows × 6 columns

spec
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
id specie protein sample run value
0 A0A023T4K3_CAEEL CAEEL A0A023T4K3 S01 R01 203877.890625
1 A0A023T4K3_CAEEL CAEEL A0A023T4K3 S01 R02 190713.625000
2 A0A023T4K3_CAEEL CAEEL A0A023T4K3 S01 R03 195925.703125
3 A0A023T4K3_CAEEL CAEEL A0A023T4K3 S01 R04 150170.421875
4 A0A023T4K3_CAEEL CAEEL A0A023T4K3 S01 R05 163190.718750
... ... ... ... ... ... ...
750695 X5M5N0_CAEEL CAEEL X5M5N0 S10 R01 NaN
750696 X5M5N0_CAEEL CAEEL X5M5N0 S10 R02 NaN
750697 X5M5N0_CAEEL CAEEL X5M5N0 S10 R03 NaN
750698 X5M5N0_CAEEL CAEEL X5M5N0 S10 R04 NaN
750699 X5M5N0_CAEEL CAEEL X5M5N0 S10 R05 NaN

750700 rows × 6 columns

Now, we can work with the melted dataframes.

Summary :

  • made script for reading in triqler output.
  • melting script implemented.
  • run jobs on kebnekaise with posterior distributions output and default parameters.
  • run jobs on kebnekaise with posterior distributions output and min_samples = 10.
  • run jobs on kebnekaise with posterior distributions output and min_samples = 1 (Failed: min_samples must be >= 2).

To do:

  • check posterior distribution output format.
  • figure out how posterior distribution heatmap plotting works in triqler.
  • think about normalization for melted dataframes spec and triq.

2020-11-09 Saturday

spectronautFile: ../data/500-PSSS3-raw-reformatted_dropna_dropdup_decoy_nonShared_again.csv FDR_treshold: 0.01 Impute: None Global_impute: False

triqlerFile: ../data/triqlerResults_largeScale_minSamp20_FC0_8_adjInt/proteins.XvsY.tsv FDR_treshold: 0.01

import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt 
---------------------------------------------------------------------------

ModuleNotFoundError                       Traceback (most recent call last)

<ipython-input-2-11ddc2afe7eb> in <module>
      1 import pandas as pd
      2 import numpy as np
----> 3 import matplotlib.pyplot as plt


ModuleNotFoundError: No module named 'matplotlib'
def read_in_triqler_x_vs_y_data(filename):
    #filename = "proteins.1vs4.tsv"
    f = open(filename, "r")
    cols = f.readline().split("\n")[0].split("\t")
    n_cols = len(cols)
    vals = []
    for i in f:
        val = i.split("\n")[0].split("\t")[0:n_cols-1]
        peptides = i.split("\n")[0].split("\t")[n_cols-1:]
        val.append(";".join(peptides))
        vals.append(val)
    
    return pd.DataFrame(vals, columns = cols)  
# Readin spectronau
file_dir = "~/git/bayesMS/data/old_data_pickled/"
spectronaut = pd.read_pickle(file_dir + "spectronaut.pkl")
# Remove decoy
spectronaut["decoy"] = spectronaut.protein.str.strip().str[:5]
spectronaut = spectronaut[spectronaut["decoy"] != "decoy"]
# Readin triqler 

#triqler = pd.read_pickle("triqler.pkl")
file_dir = "/home/ptruong/git/bayesMS/data/triqlerResults_largeScale_minSamp20_FC0_8_adjInt/"
filename = "proteins.2vs6.tsv" 
triqler = read_in_triqler_x_vs_y_data(file_dir + filename)#Choose the sample
triqler["specie"] = triqler.protein.str.strip().str[-5:]
# Remove decoy
triqler["decoy"] = triqler.protein.str.strip().str[:5]
triqler = triqler[triqler["decoy"] != "decoy"]
species = ["ARATH", "HUMAN", "CAEEL"]
cols = spectronaut.columns[3:]
cols = pd.DataFrame(cols)

#Sample S02 
specie = 2
sample = 1 # +1 on the sample

max_nan = 0 # set the number of nan
species_spec = spectronaut[spectronaut["specie"] == species[specie]]
species_spec = species_spec.set_index("protein")
start_of_sample_spectronaut = range(2,52,5) # start of samples
sample_s = species_spec.ix[:, start_of_sample_spectronaut[sample]:start_of_sample_spectronaut[sample]+5]
sample_s # 4210
sample_s = sample_s[sample_s.isnull().sum(axis=1) <= max_nan]

sample_s["mean"] =  sample_s.mean(axis=1)
sample_s["median"] = sample_s.median(axis=1)
sample_s["min"] = sample_s.min(axis=1)
sample_s["max"] = sample_s.max(axis=1)



start_of_sample_triqler = range(6,52,5) # start of samples
species_triq = triqler[triqler["specie"] == species[specie]]
species_triq = species_triq.set_index("protein")
sample_t = species_triq.ix[:, start_of_sample_triqler[sample]:start_of_sample_triqler[sample]+5]
sample_t = sample_t.astype(float)

sample_t["mean"] =  sample_t.mean(axis=1)
sample_t["median"] = sample_t.median(axis=1)
sample_t["min"] = sample_t.min(axis=1)
sample_t["max"] = sample_t.max(axis=1)

#reindex sample_t
rename_idx = dict(zip(sample_t.index, sample_t.index.str.strip().str[:-6]))
sample_t = sample_t.rename(index=rename_idx)

# Find overlapping proteins
overlap_set = sample_s.index.intersection(sample_t.index)



/home/ptruong/anaconda3/envs/py36/lib/python3.6/site-packages/ipykernel_launcher.py:13: DeprecationWarning: 
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  del sys.path[0]
/home/ptruong/anaconda3/envs/py36/lib/python3.6/site-packages/ipykernel_launcher.py:27: DeprecationWarning: 
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
# Random sample of n
n = 20
protein_sample = pd.DataFrame(overlap_set).sample(n)
protein_sample

sub_sample_t = sample_t.ix[protein_sample.protein]
sub_sample_s = sample_s.ix[protein_sample.protein]


#sample_s["protein"] = sample_s.index
#sample_s 
#sample_t["protein"] = sample_t.index
#sample_t

sub_sample_t["protein"] = sub_sample_t.index
sub_sample_s["protein"] = sub_sample_s.index
/home/ptruong/anaconda3/envs/py36/lib/python3.6/site-packages/ipykernel_launcher.py:6: DeprecationWarning: 
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  
/home/ptruong/anaconda3/envs/py36/lib/python3.6/site-packages/ipykernel_launcher.py:7: DeprecationWarning: 
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  import sys
sub_sample_t
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
S02:S02_R01 S02:S02_R02 S02:S02_R03 S02:S02_R04 S02:S02_R05 mean median min max protein
protein
Q23205 0.9859 1.2550 1.2840 1.2040 0.7995 1.10568 1.15484 0.7995 1.2840 Q23205
Q2V0X4 0.9297 0.9930 0.9796 1.0570 0.8322 0.95830 0.96895 0.8322 1.0570 Q2V0X4
P34703 0.7671 0.7828 0.8199 0.7757 0.7439 0.77788 0.77679 0.7439 0.8199 P34703
O61793 1.6680 1.6140 1.5020 1.3250 1.4030 1.50240 1.50220 1.3250 1.6680 O61793
Q9U2V9 2.4630 2.2590 2.0090 2.1940 1.6910 2.12320 2.15860 1.6910 2.4630 Q9U2V9
G5ECL3 1.1710 0.8818 1.0340 0.9105 0.7678 0.95302 0.93176 0.7678 1.1710 G5ECL3
Q9N588 0.7893 1.7660 1.3230 1.0530 0.3292 1.05210 1.05255 0.3292 1.7660 Q9N588
Q20636 1.0430 1.0030 1.0030 1.1770 0.8978 1.02476 1.01388 0.8978 1.1770 Q20636
Q9TZC4 1.0920 1.2220 1.1830 1.1560 1.2520 1.18100 1.18200 1.0920 1.2520 Q9TZC4
O62146 10.1200 8.2050 7.9570 7.8600 9.6830 8.76500 8.48500 7.8600 10.1200 O62146
Q4PIU9 3.3510 3.4500 3.2490 3.4240 3.0200 3.29880 3.32490 3.0200 3.4500 Q4PIU9
Q9TZD9 0.9299 0.8928 0.9383 0.9931 0.9040 0.93162 0.93076 0.8928 0.9931 Q9TZD9
A0A1X7RC97 1.1410 1.0730 1.0850 0.9977 0.9331 1.04596 1.05948 0.9331 1.1410 A0A1X7RC97
O02286 18.3500 19.1700 19.4100 19.6700 18.8400 19.08800 19.12900 18.3500 19.6700 O02286
P19626 11.8000 11.2600 11.6900 12.4900 11.4700 11.74200 11.71600 11.2600 12.4900 P19626
Q9NAN1 0.8104 0.7525 0.6352 0.7612 0.7624 0.74434 0.75685 0.6352 0.8104 Q9NAN1
Q9N585 0.5011 0.8069 0.4862 0.5113 0.6124 0.58358 0.54744 0.4862 0.8069 Q9N585
O17268 1.0740 0.9896 0.9896 1.0120 0.9896 1.01096 1.00028 0.9896 1.0740 O17268
P90795 0.7129 1.0050 0.9680 0.4897 0.3489 0.70490 0.70890 0.3489 1.0050 P90795
Q20228 10.8100 11.1400 11.3000 11.2300 12.2800 11.35200 11.26500 10.8100 12.2800 Q20228
#Triqler is already run with a FRD treshold

#sns ts plot to to visualize diff
ax = sub_sample_s.plot(x="protein", y = ["mean", "median"])
plt.fill_between(x = "protein", y1 = "min", y2 = "max", data = sub_sample_s, facecolor='green')
plt.title("Spectronaut - random samples")

ax = sub_sample_t.plot(x="protein", y = ["mean", "median"])
plt.fill_between(x = "protein", y1 = "min", y2 = "max", data = sub_sample_t, facecolor='green')
plt.title("Triqler - random samples")
Text(0.5,1,'Triqler - random samples')

png

png

Code reads in triqler and spectronaut data. Removed decoy proteins. Compute mean, median, min, max and code plot function for mean and median (as lines) and fill-in with min-max borders.

The code samples n random overlapping proteins from triqler and spectroanut and plots mean and median with min-max borders.

Things to check in the code:

  • Did i forget to treshold triqler data on q-values?
  • What is the unit of triqler protein quantification? (Why is there a magnitude of difference)
  • What was the normalization used in previous attempt? (within sample normalization, and why did that make sense?)
  • Is there a better way to compare than the previous normalization?
  • I guess it is between sample relationship that matter?
  • Read up on spike-in proteomics?

2020-11-06 Friday

Data-dependent Acquisition and Data Independent Acquisition (DIA).

In data-dependent acquisition (DDA), a protein sample is digested into peptides, ionized and analyzed by mass spectrometry. Peptide signals that rise above a certain treshold (noise level) in a mass spectrum are selected for fragmentation, producing tandem mass spectra (MS/MS) that can be matched to spectra in a database. The mass spectrometer randomly samples peptides for fragmentation and is biased to pick those with strongest signal, which makes it problematic to reproducibly quantify low-abundance peptides (which could hold biological value).

Data-independent acquisition is an approach to acquisition in Mass Spectrometry which fragments all peptides within a defined mass-to-charge (m/z) windows (as opposed to a select narrow window around "the strongest signal"). The analysis is repeated as the mass spectrometer marches up the full m/z range, which results in accurate peptide quantificaiton without being limited to profiling predefined peptied of interest.

A potential drawback of the DIA is the existence of multiple peptides in an m/z window. The fragmentation of multiple peptides results in chimeric (multiplexed) spectra which are more complex than single peptide spectra. Procedures for deconvoluting these spectra are required. Two methods for DIA are (OpenSWATH)[https://pubmed.ncbi.nlm.nih.gov/24727770/] and (ISOQUANT)[https://www.nature.com/articles/nmeth.2767].

2020-10-20 Tuesday

Thoughts about how to approach the problem.

The problem was that different imputation methods could severly skew the results. How I could go about this problem is to compare different imputation methods with the result of triqler and argue why the results are bad for spectronaut and good/ok for triqler in each case. E.g. with protein counts, with boxplots etc.

One problem was that triqler had "bad" results for samples where there where low or no samples, because then the protein intensity got close to the prior, which is based on empirical means (meaning that the intensities are much higher than they should be). We could truncate the results, but in a real world case we would not know the samples so we would not be able to remove the results.

Things to do:

  • Think about how the data would be generated in real case, and perform analysis based on this.
  • Think about how to handle zero samples.

2020-10-19 Monday

Starting up again.

I just started checking into this project again. Let's start from scratch to get this correct from start. I will check through all the mails related to this.

Annotations that is good to know:

  • PG - Protein Group.
  • EG - Elution Group (modified peptide, including charge state).
  • FG - Fragment Group (modified peptide, including charge state).

/data/Headers.xlsx - contains column info.

R.FileName is the columns for MS-measurement

Old mail containing relevant information.

All the data is on bose:/media/hdd/matthew/mergespec/data/spectronaut/ The files from the latest try are named PSSS3_triqler_input_renormalized.tsv (triqler input file) and PSSS3_triqler_output_proteins.vs.tsv

The script I used for converting the spectronaut files (they have an xls extension, but they're actually just tab separated files): https://github.com/statisticalbiotechnology/mergespec/blob/master/bin/bayesquant/convert_spectronaut_to_triqler_input.py

The version of triqler I used to generate the data is on a branch called large_scale_optimizations: https://github.com/statisticalbiotechnology/triqler/tree/large_scale_optimizations However, I'm not entirely sure if this is the same version that I used to generate the files, since I made some changes while working on this. So, if you run triqler yourself with this branch the results might be different.

Some issues I had to generate a report comparing Triqler to Spectronaut were:

  1. we either have their original results (500-PSSS3-precursor_Report.xls), which has a column "PG.Quantity" that contain the protein concentrations, but has missing values as it has been filtered on some FDR. Alternatively, we have their results with decoys ('S500-PSSS3-equ decoy_Report.xls'), which is not filtered on FDR and does not contain missing values, but does not contain a column with the protein quantity and I don't really know how they summarized peptide quantities to protein quantities. The first option seems more reasonable, as this is what they would normally report.
  2. with these original results, we have to choose a missing value strategy and this will most likely give them either an unfair disadvantage (impute row average) or an unfair advantage (impute lowest observed value). Imputing the lowest observed value seems to be most in line with the DIA approach and I already created a file with protein concentrations using this strategy (500-PSSS3-precursor_Report.proteins.tsv) which could be useful. For the Triqler results, you can use any of the PSSS3_triqler_output_proteins.vs.tsv (the columns we'll use don't change for different and ) and take the columns starting with "S01:S01_R01".

Future steps:

  • Sort the proteins by pearson correlation(?) between the true concentrations and predicted concentrations and create graph with the correlation on the x-axis and the number of proteins on the y-axis. Note that the protein concentrations for spectronaut are not log2 transformed, whereas the Triqler protein concentrations are log2 transformed. Also, note that the Triqler results include proteins with high identification PEP, we thus might want to filter the list of proteins on e.g. 1% protein-identification FDR. This should all be relatively easy to do.
  • Check if the value of the lowest observed imputed value for the spectronaut data (currently 139.428100585938) matters, since it will most likely affect the pearson correlation, especially if we do a log2 transformation first.
  • Check the influence of the number of allowed missing values. The spectronaut data does not seem to include a limit for the number of missing values, while I think I allowed up to 25 (out of 50 samples) missing values per peptide for Triqler. It's a bit hard to compare though, since the missing values for spectronaut would only work on protein level, whereas the missing values for triqler are on peptide level.

Correction to mail:

Small correction to the previous mail, I actually allowed up to 35 missing values for Triqler. The concentrations for the C Elegans were dropping off so fast that such a large number was necessary (https://github.com/statisticalbiotechnology/mergespec/blob/735ed743924fa705cb2c1f64509e5a1540add574/bin/bayesquant/calibration.py#L63).

Information about the normalization of the data: It should have been normalized (locally over the RT gradient) based on the constant Arabidopsis background.

I have generated a new report with now global median normalized data (based on the identified peptides of the constant Arabidopsis background), but also added the unnormalized quantities.

Additionally, the data is FDR unfiltered. You can filter the data by PG.Qvalue (protein group FDR) and EG.Qvalue (Precursor FDR) to get only the filtered data.

And an equal number of decoy are present, can be selected by EG.IsDecoy column.

File: 500-PSSS3-equ decoy_Report-V2-raw-and-normal.zip

According to a exploratory heatmap on this mail the normalization of the data is fine. Although, I do not recall how a heatmap indicates fine normalization.

Some info about the data

The PG.Quantity is based on a mix of Top3 intense peptides and reproducibility of identification.

Answer too my question about FG.NormalizedMS2PeakArea = 1.0

My Question: "How should we interpret non-decoy peptides with FG.NormalizedMS2PeakArea = 1.0? The PG.Quantities for some of these that I have seen seems to be NaNs. For example like these FG.NormalizedMS2PeakArea."

alues of 1 can come from two sources:

In Spectronaut, small values for quantities (<1) are set to one, these arise from small noise peaks or from local normalization effects.

In both cases were the signals noise or close to noise.

It mostly arises due to the fact that the dynamic range of MS1 and MS2 are not necessarily the same.

In Spectronaut MS1 and MS2 information is used for identification and it can be that one layer is enough. So the quantitative information of the other layer can be very low.

How to approach this problem from my current position.

I have lot of triqler output, but it is probabily better that i redo and generate new results since I am have forgotten how these was generated, which could cause further problems down the line.

Also, I just noted that the triqler has been updated has new output options for posterior distribution.

One remaining question is still, how to we report NaNs for the Spectronaut results.

I should also ask my PI if the PSSS3 results from last year are still relevant.

2020-10-18 Sunday

Setting up a Rmarkdown log for this project.

Historical log

https://patruong.github.io/bayesProtQuant/