XCMS3_Preprocessing_bacterial_samples.Rmd

## Preprocess your data using XCMS3 and export data files for feature-based molecular networking through GNPS

To follow this example tutorial, download the *MSV000079204* data set from: <br>
https://gnps.ucsd.edu/ProteoSAFe/result.jsp?task=d74ca92d9dec4e2883f28506c670e3ca&view=advanced_view

Note that the settings for `xcms` used in this tutorial were not optimized,
specifically the alignment based on the default *obiwarp* parameters might
perform a little to strong retention time adjustment.
For more information on optimization of the parameters see the [xcms vignette](https://bioconductor.org/packages/release/bioc/vignettes/xcms/inst/doc/xcms.html)
or the [preprocessing-untargeted-metabolomics](https://github.com/Bioconductor/CSAMA/tree/2019/lab/4-thursday/lab-05-metabolomics) workshop.

Load required libraries and utility functions for GNPS export.

```{r, message = FALSE}
library(xcms)
source("https://raw.githubusercontent.com/jorainer/xcms-gnps-tools/master/customFunctions.R")
```

Use socket based parallel processing on Windows systems. The number (`4`)
defines the number of parallel tasks. Adapt this setting to the number of CPUs
available on your system. Also note that it is usually better to not use all
CPUs of a system as a) during the analysis the MS data has to be imported from
the original mzML/mzXML/CDF files and it will thus be limited by the I/O of the
hard disks and b) the computer needs to have enough memory to load the complete
MS data of as many raw data files than there are parallel jobs.

```{r}
if (.Platform$OS.type == "unix") {
  register(bpstart(MulticoreParam(4)))
} else {
  register(bpstart(SnowParam(4)))
}
register(SerialParam())
```

### Load data

Load all .mzXML files and define sample grouping. Sample grouping should be
adjusted to the actual experimental setup. For the present example analysis we
put all files into the same sample group.

```{r}
mzXMLfiles <- paste0('MSV000079204/',
                     list.files('MSV000079204/', pattern = ".mzXML$",
                                recursive = TRUE))
s_groups <- rep("sample", length(mzXMLfiles))
pheno <- data.frame(sample_name = basename(mzXMLfiles), 
                    sample_group = s_groups, stringsAsFactors = FALSE)
```

```{r}
head(pheno)
```

Read all raw data, including MS2 level.

```{r}
rawData <- readMSData(mzXMLfiles, centroided. = TRUE, mode = "onDisk",
                      pdata = new("NAnnotatedDataFrame", pheno))
```

Create a base peak chromatogram (BPC) of your data for visual inspection.

```{r, fig.width = 12, fig.height - 6, fig.cap = "Base peak chromatogram"}
bpis <- chromatogram(rawData, aggregationFun = "max")
plot(bpis)
```

### Peak picking

Define settings for the centWave peak detection.

```{r}
cwp <- CentWaveParam(snthresh = 3, noise = 5000,
                     peakwidth = c(5, 30), ppm = 10)
```

Perform the chromatographic peak detection.

```{r, message = FALSE, warning = FALSE}
processedData <- findChromPeaks(rawData, param = cwp)
```

Get an overview of the detected peaks, using a heatmap which represents the
number of peaks detected for each file along the retention time range.

```{r, fig.width = 10, fig.height = 6}
plotChromPeakImage(processedData, binSize = 10) 
```

### Retention time alignment

We skip the retention time adjustment, because the different files have
considerable differences in retention time ranges (ranging from 300 to 5000
seconds).


### Peak grouping

Define the parameters for the *peak density*-based peak grouping (correspondence
analysis).

```{r, message = FALSE, warning = FALSE}
pdp <- PeakDensityParam(sampleGroups = processedData$sample_group,
                        minFraction = 0.10)
processedData <- groupChromPeaks(processedData, param = pdp) 
```

### Gap filling

Fill-in missing peaks. Peak detection might have failed for some features in
some samples. The `fillChromPeaks` function allows to integrate for such cases
all signal in the respective m/z - retention time range. Below we first define
the median width of identified chromatographic peaks in retention time dimension
and use this as parameter `fixedRt` for the `fillChromPeaks`.

```{r, message = FALSE, warning = FALSE}
medWidth <- median(chromPeaks(processedData)[, "rtmax"] -
                   chromPeaks(processedData)[, "rtmin"])
processed_Data <- fillChromPeaks(processedData,
                                 param = FillChromPeaksParam(fixedRt = medWidth))
```

### Export data

#### export MS1 and MS2 features

Below we use the `featureSpectra` function to extract all MS2 spectra with their
precursor m/z being within the m/z range of a feature/peak and their retention
time within the rt range of the same feature/peak. Note that for older `xcms`
versions (i.e. before version 3.12) `return.type = "Spectra"` has to be used
instead of `return.type = "MSpectra"` as in the example below. Zero-intensity
values are removed from each spectrum with the `clean` function, and
subsequently processed into the expected format using the `formatSpectraForGNPS`
function.

```{r}
## export the individual spectra into a .mgf file
filteredMs2Spectra <- featureSpectra(processedData, return.type = "MSpectra")
filteredMs2Spectra <- clean(filteredMs2Spectra, all = TRUE)
filteredMs2Spectra <- formatSpectraForGNPS(filteredMs2Spectra)
```

The extracted MS2 spectra are saved as *ms2spectra_all.mgf* file. This file can
for example be used to do *in silico* structure prediction through
[SIRIUS+CSI:FingerID](https://bio.informatik.uni-jena.de/software/sirius/).

```{r}
writeMgfData(filteredMs2Spectra, "ms2spectra_all.mgf")
```

Export peak area quantification table. To this end we first extract the *feature
definitions* (i.e. the m/z and retention time ranges and other metadata for all
defined features in the data set) and then the integrated peak areas (with the
`featureValues` function). This peak area quantification table contains features
and respective per sample peak areas in columns. The combined data is then saved
to the file *xcms_all.txt*. Note that it is now also possible to use the entire
feature table in the FBMN workflow.

```{r}
## get data
featuresDef <- featureDefinitions(processedData)
featuresIntensities <- featureValues(processedData, value = "into")

## generate data table
dataTable <- merge(featuresDef, featuresIntensities, by=0, all=TRUE)
dataTable <- dataTable[, !(names(dataTable) %in% c("peakidx"))]
```

```{r}
head(dataTable)
```

```{r}
write.table(dataTable, "xcms_all.txt", sep = "\t", quote = FALSE, row.names = FALSE)
```

#### export MS2 features only

The `filteredMs2Spectra` contains all MS2 spectra with their precursor m/z
within the feature's m/z range and a retention time that is within the retention
time of the chromatographic peak/feature. We thus have multiple MS2 spectra for
each feature (also from each sample). Metadata column `"feature_id"` indicates
to which feature a MS2 spectrum belongs:

```{r}
filteredMs2Spectra
```

We next select a single MS2 spectrum for each feature and export this reduced
set also as an .mgf file. We use the `combineSpectra` function on the list of
spectra and specify with `fcol = "feature_id"` how the spectra are grouped
(i.e. all spectra with the same feature id are processed together). On the set
of spectra of the same feature we apply the `maxTic` function that simply
returns the spectrum with the largest sum of intensities. We thus select with
the code below the spectrum with the largest total signal as the
*representative* MS2 spectrum for each feature.


```{r}
## Select for each feature the Spectrum2 with the largest TIC.
filteredMs2Spectra_maxTic <- combineSpectra(filteredMs2Spectra,
                                            fcol = "feature_id",
                                            method = maxTic)
```

Next we export the data to a file which can then be submitted to GNPS [feature-based
molecular
networking](https://ccms-ucsd.github.io/GNPSDocumentation/featurebasedmolecularnetworking/).

```{r}
writeMgfData(filteredMs2Spectra_maxTic, "ms2spectra_maxTic.mgf")
```

At last we subset the peak area quantification table to features for which we
have also an MS2 spectrum and export this to the *xcms_onlyMS2.txt* file. This
file can be submitted to GNPS [feature-based molecular
networking](https://ccms-ucsd.github.io/GNPSDocumentation/featurebasedmolecularnetworking/):

```{r}
## filter data table to contain only peaks with MSMS DF[ , !(names(DF) %in% drops)]
filteredDataTable <- dataTable[which(
    dataTable$Row.names %in% filteredMs2Spectra@elementMetadata$feature_id),]
```

```{r}
head(filteredDataTable)
```

```{r}
write.table(filteredDataTable, "xcms_onlyMS2.txt", sep = "\t", quote = FALSE, row.names = FALSE)
```

#### Export MS2 consensus spectra

Alternatively, instead of selecting the spectrum with the largest total signal
as representative MS2 spectrum for each feature, we can create a *consensus MS2
spectrum*. A consensus MS2 spectrum can for example be created by combining all
MS2 spectra for a feature into a single spectrum that contains peaks present in
the majority of spectra. Note however that this feature is experimental at
present.

To this end we can use the `consensusSpectrum` function in combination with the
`combineSpectra` function. The parameter `minProp` defines the mimimal
proportion of spectra in which a peak has to be present in order for it to be
added to the consensus spectrum (0.8 -> 80% of spectra). The parameters `mzd`
and `ppm` allow to define how to group peaks between spectra with `mzd` being a
fixed, constant value and all peaks between spectra with a difference in their
m/z < `mzd` are combined into the final mass peak in the consensus
spectrum. Finally, the parameter `ppm` allows to perform an m/z dependent
grouping of mass peaks, i.e. mass peaks with a difference in their m/z smaller
than `ppm` are combined.

For more details see the documentation of the
[consensusSpectrum](https://rdrr.io/bioc/MSnbase/man/consensusSpectrum.html)
function in the MSnbase R package.

```{r, message = FALSE, warning = FALSE}
filteredMs2Spectra_consensus <- combineSpectra(
    filteredMs2Spectra, fcol = "feature_id", method = consensusSpectrum,
    mzd = 0, minProp = 0.8, ppm = 10)

writeMgfData(filteredMs2Spectra_consensus, "ms2spectra_consensus_bacterial.mgf")
```

Analogously we subset the peak area quantification table to features for which
we have an MS2 consensus spectrum and export this to the *xcms_consensusMS2.txt*
file. This file can be submitted to GNPS [feature-based molecular
networking](https://ccms-ucsd.github.io/GNPSDocumentation/featurebasedmolecularnetworking/):

```{r}
consensusDataTable <- dataTable[which(dataTable$Row.names %in%
                                      filteredMs2Spectra_consensus@elementMetadata$feature_id),]
head(consensusDataTable)
```

```{r}
write.table(consensusDataTable, "xcms_consensusMS2_bacterial.txt",
            sep = "\t", quote = FALSE, row.names = FALSE)
```

### Session information

```{r}
sessionInfo()
```