-
Notifications
You must be signed in to change notification settings - Fork 1
/
DataAnalysis.Rmd
448 lines (303 loc) · 19.9 KB
/
DataAnalysis.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
---
title: "workshop"
output: html_document
date: "2022-08-24"
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
```{r, warning=FALSE}
library(tidyverse)
library(ggplot2)
library(QFeatures)
library(msqrob2)
```
# Introduction
This part of the workshop will focus on the data analysis pipeline that comes after the search engine from quantification to differential expression analysis. The pipeline is designed to work with ionbot output, but is of course only one example. Different parts of the pipeline can be interchanged for other pieces of software as desired.
We will work with the same dataset as in the first part of the workshop. In the paper, the authors worked with archeological samples and with modern samples (dental calculus and plaque). Here, we will try and see whether we can find any proteins that are differentially expressed between those groups.
# Quantification with FlashLFQ
FlashLFQ is a high-speed label-free quantification algorithm. It was developed by Millikin et al. at the university of Wisconsin. In future releases of ionbot, the user will be able to ask for flashLFQ input directly from the ionbot search engine, making it easier to go straight to the quantification step. For now, we will provide you with the FlashLFQ input file based on the ionbot search results. For details on how this was done, see: https://github.com/sdgroeve/ionbot.quant
FlashLFQ has a GUI, can be run from the command line, has a docker image and can be installed via bioconda (which we have done here). You can run FlashLFQ command line from within an RMarkDown file by adding a bash chunk. To run it on Linux, you need .NET core. To see all the arguments, run the help parameter.
```{bash}
FlashLFQ --help
```
Let's keep it simple for now and try to quantify 2 files from the project. You can find the file containing the id input data under /Data_part2 and the RAW files under /Data_part2/RAW/.
There are multiple parameters that you can change the arguments for (see above). You need to specify the identification file(s) and the spectrum files. If you work with Thermo Raw files you also need to accept their license (parameter -ath). You can also add an output directory and choose if you want match between runs enabled.
Let's try this out now!
```{bash}
FlashLFQ --idt ~/Data_part2/flashlfq_input_.tsv --rep ~/Data_part2/RAW/ --out ~/FlashLFQ_out/ --ppm 5 --mbr --ath
```
While this is running you can find a small explanation of how FlashLFQ works below. See also their GitHub wiki page: https://github.com/smith-chem-wisc/FlashLFQ/wiki/ and their paper: https://doi.org/10.1021/acs.jproteome.7b00608
In MS-based proteomics experiments, there are two distinct ways of quantification: label-based and label-free. In label-based quantification stable isotope labels are incorporated into peptides. The mass spectrometer can recognise the mass difference between the labeled and unlabeled versions of a peptide, which is used to compare their respective signal intensities. In label-free quantification the MS1 ion intensity is used as a proxy for the peptide concentration in the sample.
Here, we focus on label-free quantification.
To be able to use the MS1 ion intensity as a proxy for peptide abundance, we need software that is able to detect the MS1 peaks and read out their intensity.
FlashLFQ detects and quantifies chromatographic peaks and reports either apex or integrated intensity of each peak.
It tries to improve on existing LFQ algorithms by making use of indexing, a method to categorize information into lookup-tables based on its properties, to drastically increase the speed of such algorithms. The indexing approach results in rapid chromatographic peak detection by storing each MS1 peak in the lookup-table according to its m/z if it matches an identification's m/z. The latter of course means that the MS2 spectra must have already been searched by a search engine.
Go to the output directory you specified, which outputs do you find?
In this workflow, we choose to work with the peptides file, because the robust based summarisation that we will use (see below) oftentimes has superior performance to other summarisation based methods.
We have already done the quantification for the entire project for you.
# Lowest Common Ancestor calculation with UniPept (optional step)
https://unipept.ugent.be/
https://doi.org/10.1021/acs.jproteome.8b00716
At the end of the data analysis pipeline you might want to assess from which bacterial species the differential proteins (if any) originate.
To identify which peptides belong to which taxa, we will use UniPept.
Unipept is an open source web application developed at Ghent University that is designed for metaproteomics data analysis with a focus on interactive datavisualizations. Unipept is powered by an index containing all UniProt entries, a tweaked version of the NCBI taxonomy and a custom lowest common ancestor algorithm. This combination enables a blazingly fast biodiversity analysis of large and complex metaproteome samples. The UniPept functionality can be accessed via an API, a desktop app, the website and a set of command line tools.
To get you acquainted with UniPept, we will go to the website and analyse a small set of peptides.
You can find the small set of peptides in the following file: /Data_part2/peptidesUniPept.txt
Go to the website (https://unipept.ugent.be/), click on the tab "Metaproteomics Analysis" and copy-paste the peptides under Peptide list.
Add the dataset to selected datasets and click search.
You can explore the output after a few seconds.
# Data analysis with QFeatures and msqrob2
We will now start with the data analysis.
First we will import the data into R, then do some preprocessing, then summarise to the protein level and then do the actual modelling, inference and differential abundance analysis.
For this pipeline we will make us of (amongst others) two dedicated R packages: QFeatures and msqrob2.
You can find more information via the links below:
msqrob papers:
- https://doi.org/10.1074/mcp.M115.055897
- https://doi.org/10.1016/j.jprot.2017.04.004
- https://doi.org/10.1074/mcp.RA119.001624
- https://doi.org/10.1021/acs.analchem.9b04375
Note that there is also a GUI of the msqrob2 package where you can do the same analysis as written in R code here.
To open the GUI run:
QFeatures package:
https://github.com/rformassspectrometry/QFeatures
## Import the data in R
### Data infrastructure
QFeatures: http://bioconductor.org/packages/release/bioc/html/QFeatures.html
We use the `QFeatures` package that provides the infrastructure to
- store,
- process,
- manipulate and
- analyse quantitative data/features from mass spectrometry experiments.
It is based on the `SummarizedExperiment` and `MultiAssayExperiment` classes.
Assays in a QFeatures object have a hierarchical relation:
- proteins are composed of peptides,
- themselves produced by spectra
- relations between assays are tracked and recorded throughout data processing
You can find a visual representation of a QFeatures object and the aggregative relation between different assays below
```{r pressure, echo=FALSE, fig.cap="Representation of a QFeatures object and the aggregative relation between different assays", out.width = '80%'}
knitr::include_graphics("~/Data_part2/QFeaturesRepresentation.png")
```
1. We use the QuantifiedPeptides.tsv file from MS-data quantified with FlashLFQ that contains MS1 intensities summarized at the peptide level. This file is for the entire dataset and already contains the UniPept LCA annotations.
```{r}
peptidesFile <- "~/Data_part2/QuantifiedPeptidesUniPept.tsv"
```
2. The QFeatures object needs to know where the intensity columns reside within the peptidesFile. A lot of LFQ algorithms store the intensity data for the different samples in columnns that start with/contain Intensity. We can use this to retreive the column names with the intensity data with the code below:
```{r}
ecols <- grep("Intensity", names(read.delim(peptidesFile)))
```
3. Read the data and store it in a QFeatures object
```{r}
pe <- readQFeatures(
table = peptidesFile,
#column to be used as feature names
fnames = 1,
#Here we give the information about the intensity columns
ecol = ecols,
name = "peptideRaw", sep="\t")
```
### Explore object
The rowData contains information on the features (peptidoforms) in the assay. E.g. Sequence, protein, ...
```{r}
head(rowData(pe[["peptideRaw"]]))
```
The colData contains information on the samples
```{r}
colData(pe)
```
No information is stored yet on the design.
We have seen in the paper that they used both archaeological and modern samples. We can add this information into the colData based on the colnames of the data object.
In the following analysis, we will investigate whether there are any proteins differentially expressed between these groups.
```{r}
pe[["peptideRaw"]] %>% colnames
```
We update the colData with information on the design
```{r}
colData(pe)$group <- sapply(colnames(pe[["peptideRaw"]]), function(x){
case_when(grepl("CALC", x) ~ "ModCalc",
grepl("GeoG", x) ~ "Tjærby",
grepl("Plaq", x) ~ "ModPlaque")
})
```
We explore the colData again
```{r}
colData(pe)
```
## Preprocessing
Before data analysis, the data need to be preprocessed.
Here we will do the following steps:
- Log transformation
- Remove contaminants
- Removing peptides with less than two intensity values
For this, we can make use of some functions present in the QFeatures package that make it easy to manipulate the entire dataset.
We calculate how many non zero intensities we have for each peptide as this is useful for filtering.
```{r}
rowData(pe[["peptideRaw"]])$nNonZero <- rowSums(assay(pe[["peptideRaw"]]) > 0)
```
Peptides with zero intensities are missing peptides and should be represented with an `NA` value rather than `0`.
```{r}
pe <- zeroIsNA(pe, "peptideRaw") # convert 0 to NA
```
1. *Logtransform data with base 2*
```{r}
pe <- logTransform(pe, base = 2, i = "peptideRaw", name = "peptideLog")
```
Note that the peptide intensities are now to be interpreted on the log scale.
2. *Remove reverse sequences (decoys) and contaminants*
Reverse sequences are filtered out in the step between ionbot output and FlashLFQ input.
We still need to remove the contaminants, we will use the cRAP database as a contaminants database;
contaminants.txt contains the uniprot IDs of the proteins contained in the crap database
```{r}
contaminants <- read.delim("~/Data_part2/contaminants.txt")
```
```{r}
#We use the contaminants list as input for the grep function, which is an easy way to do pattern matching in R
contaminants_list <- paste0(contaminants$uniprotID, collapse = "|")
#We look for proteins that are present in the contaminants list and keep the ones that are not
pe <- filterFeatures(pe,~grep(Protein.Groups,pattern=contaminants_list,invert = T))
```
3. *Drop peptides that were only identified in one sample*
We keep peptides that were observed at least twice.
```{r}
pe <- filterFeatures(pe,~ nNonZero >=2)
nrow(pe[["peptideLog"]])
```
## Visualisation
We use an multidimensional scaling (MDS) plot to see which samples group together. The MDS plot plots samples on a two-dimensional scatterplot so that distances on the plot approximate the typical log2 fold changes between the samples.
```{r}
limma::plotMDS(assay(pe[["peptideLog"]], withDimnames = FALSE), col = (colData(pe)$group %>% as.factor %>% as.numeric()))
```
What can you see on this plot?
```{r fig1, fig.height = 6, fig.width = 6}
limma::plotDensities(assay(pe[["peptideLog"]]), legend = FALSE)
```
The intensity distributions of the different samples do not coincide, this occurs because of technical variability: differences between the mass spec runs, machines, labs etc.
## Normalization of the data by median centering
```{r}
pe <- normalize(pe,
i = "peptideLog",
name = "peptideNorm",
method = "center.median")
```
Now visualise the effect of the normalisation. Why is the normalisation necessary?
```{r fig2, fig.height = 6, fig.width = 6}
limma::plotDensities(assay(pe[["peptideNorm"]]), legend = FALSE)
```
The intensity distributions need to be put on the same footprint by shifting the intensity peaks to the same locations, so that the different samples are comparable.
Upon normalization the marginal distributions of the peptide intensities across samples are much more comparable. We have now done this by median centering (subtraction of the sample median from each intensity that was measured in that sample).
However, we still see deviations, this could be cause to try more sophisticated normalisation methods, but this is very dataset dependent and very challenging because of the missingness in the data.
## Summarisation to protein level
Now we want to summarise all peptide intensities from the same protein in a sample into a single protein expression value.
These protein expression values will be used in the differential abundance analysis later on.
However, there are some difficulties we have to take into account. For example, each peptide (even from the same protein) has different characteristics (different length, mass..) and thus will fly differently through the mass spec, resulting in strong differences in peptide intensities between different peptides.
Indeed, peptides that do not fly very well through the mass spec will not be picked up as well as other peptides that do, leading to more missingnes for the first set of peptides.
For all those different peptides belonging to the same proteins we will now try to find one protein expression value. You could do this by taking the mean or median of all the peptide intensities, but that can introduce biases (eg because of missingness in some peptides).
To correct for this, we will use a linear model that will estimate a sample mean, while also correcting for the different peptide characteristics (see below).
If you do that, you will correct for the fact that different peptides (even from the same protein) have different characteristics.
how do you do that? by modeling the intensities of sample i en peptide p with sample specific effect and peptide specific effect, than you will have sample average (beta sample) that is corrected for peptide species (addition of beta peptide)
- Mean summarization
$$
y_{ip}=\beta_i^\text{samp} + \epsilon_{ip}
$$
- Model based summarization:
$$
y_{ip}=\beta_i^\text{samp} + \beta_p^\text{pep} + \epsilon_{ip}
$$
When we compare a mean summarisation method to our model based summarisation, we can clearly see that the model based method has an additional parameter $\beta_p^\text{pep}$, which corrects the sample average ($\beta_i^\text{samp}$) for the peptide specific effect.
We use the standard sumarization in aggregateFeatures, which is robust model based summarization.
In short, "robust" statistics try to make the model less sensitive to (minor) deviations in the data.
For more information on robust based summarisation, see the following paper:
- https://doi.org/10.1074/mcp.RA119.001624
Some peptides do not have a protein associated, these need to be filtered out in order for the summarisation to work properly.
```{r}
pe <- filterFeatures(pe, ~Protein.Groups!="")
```
```{r,warning=FALSE}
pe <- aggregateFeatures(pe,
i = "peptideNorm",
fcol = "Protein.Groups",
na.rm = TRUE,
name = "protein")
```
```{r}
limma::plotMDS(assay(pe[["protein"]], withDimnames = FALSE), col = (colData(pe)$group %>% as.factor %>% as.numeric()))
```
What do you see on this MDS plot?
## Data Analysis
We model the protein level expression values using `msqrob`.
By default `msqrob2` estimates the model parameters using robust regression.
In short, "robust" statistics try to make the model less sensitive to (minor) deviations in the data.
If you want to know more, see: https://doi.org/10.1074/mcp.M115.055897
We will model the data with a different group mean for each of the groups (Tjærby, ModPlaque and ModCalc).
The group is incoded in the variable `group` of the colData.
We can specify this model by using a formula with the factor group as its predictor:
`formula = ~group`.
Note, that a formula always starts with a symbol '~'.
We can now choose a reference group, by using the function relevel. If we here choose "Tjærby", we will be able to compare our other group levels ("ModPlaque" and "ModCalc") directly to the "Tjærby" group. (see also below)
```{r}
colData(pe)$group <- relevel(as.factor(colData(pe)$group), ref = "Tjærby")
```
```{r, warning=FALSE}
pe <- msqrob(object = pe, i = "protein", formula = ~group)
```
### Inference
First, we extract the parameter names of the model by looking at the first model.
The models are stored in the rowdata of the assay under the default name msqrobModels.
```{r}
getCoef(rowData(pe[["protein"]])$msqrobModels[[2]])
```
We can also explore the design of the model that we specified using the the package `ExploreModelMatrix`
```{r}
library(ExploreModelMatrix)
VisualizeDesign(colData(pe),~group)$plotlist[[1]]
```
In this example, this is quite simple because we have only one model parameter "group".
As chosen above, Tjærby is the reference class. So the mean log2 expression for samples from Tjærby is '(Intercept)'.
The mean log2 expression for samples from groupModCalc is '(Intercept)+groupModCalc'.
Hence, the average log2 fold change between Tjærby and ModCalc is modelled using the parameter 'groupModCalc' (Intercept - (Intercept + groupModCalc)) .
Thus, we assess the contrast 'groupModCalc = 0' (our null hypothesis) with our statistical test.
In the same way, the average log2 fold change between Tjærby and ModPlaque is modelled using the parameter 'groupModPlaque'.
Moreover, the average log2 fold change between ModCalc and ModPlaque is modelled using 'groupModCalc - groupModPlaque'.
Let's first investigate whether there are proteins differentially expressed between the Tjærby group and the Modern Calculus group.
As explained above, to do this, we will test the null hypothesis: groupModCalc=0.
```{r}
#The first argument is/are the contrast(s) we want to test, the second are the parameternames present in the model
L <- makeContrast("groupModCalc=0", parameterNames = c("groupModCalc", "groupModPlaque"))
#statistical inference, multiple testing correction defaults to the Benjamini-Hochberg procedure
pe <- hypothesisTest(object = pe, i = "protein", contrast = L)
```
### Significance table
We can now assess the proteins that are found as differentially expressed.
```{r}
rowData(pe[["protein"]])$groupModCalc %>%
#We only look at the proteins that have an adjusted pvalue smaller than 0.05
filter(adjPval<0.05) %>%
head(20)
```
### Plots
#### Volcano-plot
```{r,warning=FALSE}
volcano <- ggplot(rowData(pe[["protein"]])$groupModCalc,
aes(x = logFC, y = -log10(pval), color = adjPval < 0.05)) +
geom_point(cex = 2.5) +
scale_color_manual(values = alpha(c("black", "red"), 0.5)) + theme_minimal()
volcano
```
Note, that `r sum(rowData(pe[["protein"]])$groupgroup2$adjPval < 0.05, na.rm = TRUE)` proteins are found to be differentially abundant.
#### Heatmap
We first select the names of the proteins that were declared signficant.
```{r, fig.height=4, fig.width=10}
sigNames <- rowData(pe[["protein"]])$groupModCalc %>%
rownames_to_column("protein") %>%
filter(adjPval<0.05) %>%
head(100) %>%
pull(protein)
heatmap(MSnbase::filterNA(assay(pe[["protein"]])[sigNames, ]), cexCol = 0.7, cexRow = 0.7)
```
### Exercise: inference
If you have completed all of the above and you want an extra challenge, try to find differential proteins between groupModPlaque and groupTjærby and between groupModPlaque and groupModCalc.
# Acknowledgements
Parts of this workshop were inspired by the QFeatures tutorial of Laurent Gatto (http://www.bioconductor.org/packages/release/bioc/vignettes/QFeatures/inst/doc/Processing.html)
and by the proteomics data analysis workshops by Lieven Clement
(https://statomics.github.io/PDA/)
Special thanks to Kevin Velghe for setting up the binder for this workshop