-
Notifications
You must be signed in to change notification settings - Fork 9
/
Copy pathXCMS3_Preprocessing_bacterial_samples.Rmd
286 lines (222 loc) · 10.8 KB
/
XCMS3_Preprocessing_bacterial_samples.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
## Preprocess your data using XCMS3 and export data files for feature-based molecular networking through GNPS
To follow this example tutorial, download the *MSV000079204* data set from: <br>
https://gnps.ucsd.edu/ProteoSAFe/result.jsp?task=d74ca92d9dec4e2883f28506c670e3ca&view=advanced_view
Note that the settings for `xcms` used in this tutorial were not optimized,
specifically the alignment based on the default *obiwarp* parameters might
perform a little to strong retention time adjustment.
For more information on optimization of the parameters see the [xcms vignette](https://bioconductor.org/packages/release/bioc/vignettes/xcms/inst/doc/xcms.html)
or the [preprocessing-untargeted-metabolomics](https://github.com/Bioconductor/CSAMA/tree/2019/lab/4-thursday/lab-05-metabolomics) workshop.
Load required libraries and utility functions for GNPS export.
```{r, message = FALSE}
library(xcms)
source("https://raw.githubusercontent.com/jorainer/xcms-gnps-tools/master/customFunctions.R")
```
Use socket based parallel processing on Windows systems. The number (`4`)
defines the number of parallel tasks. Adapt this setting to the number of CPUs
available on your system. Also note that it is usually better to not use all
CPUs of a system as a) during the analysis the MS data has to be imported from
the original mzML/mzXML/CDF files and it will thus be limited by the I/O of the
hard disks and b) the computer needs to have enough memory to load the complete
MS data of as many raw data files than there are parallel jobs.
```{r}
if (.Platform$OS.type == "unix") {
register(bpstart(MulticoreParam(4)))
} else {
register(bpstart(SnowParam(4)))
}
register(SerialParam())
```
### Load data
Load all .mzXML files and define sample grouping. Sample grouping should be
adjusted to the actual experimental setup. For the present example analysis we
put all files into the same sample group.
```{r}
mzXMLfiles <- paste0('MSV000079204/',
list.files('MSV000079204/', pattern = ".mzXML$",
recursive = TRUE))
s_groups <- rep("sample", length(mzXMLfiles))
pheno <- data.frame(sample_name = basename(mzXMLfiles),
sample_group = s_groups, stringsAsFactors = FALSE)
```
```{r}
head(pheno)
```
Read all raw data, including MS2 level.
```{r}
rawData <- readMSData(mzXMLfiles, centroided. = TRUE, mode = "onDisk",
pdata = new("NAnnotatedDataFrame", pheno))
```
Create a base peak chromatogram (BPC) of your data for visual inspection.
```{r, fig.width = 12, fig.height - 6, fig.cap = "Base peak chromatogram"}
bpis <- chromatogram(rawData, aggregationFun = "max")
plot(bpis)
```
### Peak picking
Define settings for the centWave peak detection.
```{r}
cwp <- CentWaveParam(snthresh = 3, noise = 5000,
peakwidth = c(5, 30), ppm = 10)
```
Perform the chromatographic peak detection.
```{r, message = FALSE, warning = FALSE}
processedData <- findChromPeaks(rawData, param = cwp)
```
Get an overview of the detected peaks, using a heatmap which represents the
number of peaks detected for each file along the retention time range.
```{r, fig.width = 10, fig.height = 6}
plotChromPeakImage(processedData, binSize = 10)
```
### Retention time alignment
We skip the retention time adjustment, because the different files have
considerable differences in retention time ranges (ranging from 300 to 5000
seconds).
### Peak grouping
Define the parameters for the *peak density*-based peak grouping (correspondence
analysis).
```{r, message = FALSE, warning = FALSE}
pdp <- PeakDensityParam(sampleGroups = processedData$sample_group,
minFraction = 0.10)
processedData <- groupChromPeaks(processedData, param = pdp)
```
### Gap filling
Fill-in missing peaks. Peak detection might have failed for some features in
some samples. The `fillChromPeaks` function allows to integrate for such cases
all signal in the respective m/z - retention time range. Below we first define
the median width of identified chromatographic peaks in retention time dimension
and use this as parameter `fixedRt` for the `fillChromPeaks`.
```{r, message = FALSE, warning = FALSE}
medWidth <- median(chromPeaks(processedData)[, "rtmax"] -
chromPeaks(processedData)[, "rtmin"])
processed_Data <- fillChromPeaks(processedData,
param = FillChromPeaksParam(fixedRt = medWidth))
```
### Export data
#### export MS1 and MS2 features
Below we use the `featureSpectra` function to extract all MS2 spectra with their
precursor m/z being within the m/z range of a feature/peak and their retention
time within the rt range of the same feature/peak. Note that for older `xcms`
versions (i.e. before version 3.12) `return.type = "Spectra"` has to be used
instead of `return.type = "MSpectra"` as in the example below. Zero-intensity
values are removed from each spectrum with the `clean` function, and
subsequently processed into the expected format using the `formatSpectraForGNPS`
function.
```{r}
## export the individual spectra into a .mgf file
filteredMs2Spectra <- featureSpectra(processedData, return.type = "MSpectra")
filteredMs2Spectra <- clean(filteredMs2Spectra, all = TRUE)
filteredMs2Spectra <- formatSpectraForGNPS(filteredMs2Spectra)
```
The extracted MS2 spectra are saved as *ms2spectra_all.mgf* file. This file can
for example be used to do *in silico* structure prediction through
[SIRIUS+CSI:FingerID](https://bio.informatik.uni-jena.de/software/sirius/).
```{r}
writeMgfData(filteredMs2Spectra, "ms2spectra_all.mgf")
```
Export peak area quantification table. To this end we first extract the *feature
definitions* (i.e. the m/z and retention time ranges and other metadata for all
defined features in the data set) and then the integrated peak areas (with the
`featureValues` function). This peak area quantification table contains features
and respective per sample peak areas in columns. The combined data is then saved
to the file *xcms_all.txt*. Note that it is now also possible to use the entire
feature table in the FBMN workflow.
```{r}
## get data
featuresDef <- featureDefinitions(processedData)
featuresIntensities <- featureValues(processedData, value = "into")
## generate data table
dataTable <- merge(featuresDef, featuresIntensities, by=0, all=TRUE)
dataTable <- dataTable[, !(names(dataTable) %in% c("peakidx"))]
```
```{r}
head(dataTable)
```
```{r}
write.table(dataTable, "xcms_all.txt", sep = "\t", quote = FALSE, row.names = FALSE)
```
#### export MS2 features only
The `filteredMs2Spectra` contains all MS2 spectra with their precursor m/z
within the feature's m/z range and a retention time that is within the retention
time of the chromatographic peak/feature. We thus have multiple MS2 spectra for
each feature (also from each sample). Metadata column `"feature_id"` indicates
to which feature a MS2 spectrum belongs:
```{r}
filteredMs2Spectra
```
We next select a single MS2 spectrum for each feature and export this reduced
set also as an .mgf file. We use the `combineSpectra` function on the list of
spectra and specify with `fcol = "feature_id"` how the spectra are grouped
(i.e. all spectra with the same feature id are processed together). On the set
of spectra of the same feature we apply the `maxTic` function that simply
returns the spectrum with the largest sum of intensities. We thus select with
the code below the spectrum with the largest total signal as the
*representative* MS2 spectrum for each feature.
```{r}
## Select for each feature the Spectrum2 with the largest TIC.
filteredMs2Spectra_maxTic <- combineSpectra(filteredMs2Spectra,
fcol = "feature_id",
method = maxTic)
```
Next we export the data to a file which can then be submitted to GNPS [feature-based
molecular
networking](https://ccms-ucsd.github.io/GNPSDocumentation/featurebasedmolecularnetworking/).
```{r}
writeMgfData(filteredMs2Spectra_maxTic, "ms2spectra_maxTic.mgf")
```
At last we subset the peak area quantification table to features for which we
have also an MS2 spectrum and export this to the *xcms_onlyMS2.txt* file. This
file can be submitted to GNPS [feature-based molecular
networking](https://ccms-ucsd.github.io/GNPSDocumentation/featurebasedmolecularnetworking/):
```{r}
## filter data table to contain only peaks with MSMS DF[ , !(names(DF) %in% drops)]
filteredDataTable <- dataTable[which(
dataTable$Row.names %in% filteredMs2Spectra@elementMetadata$feature_id),]
```
```{r}
head(filteredDataTable)
```
```{r}
write.table(filteredDataTable, "xcms_onlyMS2.txt", sep = "\t", quote = FALSE, row.names = FALSE)
```
#### Export MS2 consensus spectra
Alternatively, instead of selecting the spectrum with the largest total signal
as representative MS2 spectrum for each feature, we can create a *consensus MS2
spectrum*. A consensus MS2 spectrum can for example be created by combining all
MS2 spectra for a feature into a single spectrum that contains peaks present in
the majority of spectra. Note however that this feature is experimental at
present.
To this end we can use the `consensusSpectrum` function in combination with the
`combineSpectra` function. The parameter `minProp` defines the mimimal
proportion of spectra in which a peak has to be present in order for it to be
added to the consensus spectrum (0.8 -> 80% of spectra). The parameters `mzd`
and `ppm` allow to define how to group peaks between spectra with `mzd` being a
fixed, constant value and all peaks between spectra with a difference in their
m/z < `mzd` are combined into the final mass peak in the consensus
spectrum. Finally, the parameter `ppm` allows to perform an m/z dependent
grouping of mass peaks, i.e. mass peaks with a difference in their m/z smaller
than `ppm` are combined.
For more details see the documentation of the
[consensusSpectrum](https://rdrr.io/bioc/MSnbase/man/consensusSpectrum.html)
function in the MSnbase R package.
```{r, message = FALSE, warning = FALSE}
filteredMs2Spectra_consensus <- combineSpectra(
filteredMs2Spectra, fcol = "feature_id", method = consensusSpectrum,
mzd = 0, minProp = 0.8, ppm = 10)
writeMgfData(filteredMs2Spectra_consensus, "ms2spectra_consensus_bacterial.mgf")
```
Analogously we subset the peak area quantification table to features for which
we have an MS2 consensus spectrum and export this to the *xcms_consensusMS2.txt*
file. This file can be submitted to GNPS [feature-based molecular
networking](https://ccms-ucsd.github.io/GNPSDocumentation/featurebasedmolecularnetworking/):
```{r}
consensusDataTable <- dataTable[which(dataTable$Row.names %in%
filteredMs2Spectra_consensus@elementMetadata$feature_id),]
head(consensusDataTable)
```
```{r}
write.table(consensusDataTable, "xcms_consensusMS2_bacterial.txt",
sep = "\t", quote = FALSE, row.names = FALSE)
```
### Session information
```{r}
sessionInfo()
```