-
Notifications
You must be signed in to change notification settings - Fork 2
/
publication--05--16s_human_microbiome.Rmd
291 lines (224 loc) · 12.5 KB
/
publication--05--16s_human_microbiome.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
---
title: "Human microbiome example"
bibliography: bibliography.bibtex
---
```{r init, echo = FALSE, message = FALSE}
library(knitr)
if (! is.null(current_input())) { # if knitr is being used
knitr::read_chunk("settings.R")
} else {
source("settings.R")
}
```
```{r rendering_settings, include=FALSE}
```
## Requirements
**NOTE:** This analysis requires at least 10Gb of RAM to run.
It uses large files not included in the repository and many steps can take a few minutes to run.
## Parameters
### Analysis input/output
```{r io_settings}
```
### Analysis parameters
```{r hmp_parameters}
matrix_plot_depth <- 6 # The maximum number of ranks to display
color_interval <- c(-4, 4) # The range of values (log 2 ratio of median proportion) to display
min_read_count <- 100 # The minium number of reads needed for a taxon to be graphed
otu_file_url = "http://downloads.hmpdacc.org/data/HMQCP/otu_table_psn_v35.txt.gz"
mapping_file_url = "http://downloads.hmpdacc.org/data/HMQCP/v35_map_uniquebyPSN.txt.bz2"
```
## Downloading the data
This analysis used OTU abundance data from the Human Microbiome Project (HMP) [@human2012structure], which used 16S metabarcoding to explore the bacterial microbiome of various human body parts.
First we will download the files from the HMP website.
```{r hmp_download, warning=FALSE, message=FALSE, cache = TRUE}
download_file <- function(path) {
temp_file_path <- file.path(tempdir(), basename(path))
utils::download.file(url = path, destfile = temp_file_path, quiet = TRUE)
path <- temp_file_path
return(path)
}
otu_file <- download_file(otu_file_url)
mapping_file <- download_file(mapping_file_url)
```
## Parsing the data
Then we will read these files into R and tidy up the data a bit.
We will use the `readr` package to read these `gloss$add("Tab-delimited text file", shown = "TSV")` files.
```{r hmp_parse_raw_otu, cache = TRUE}
library(readr)
otu_data <- read_tsv(otu_file, skip = 1) # Skip the first line, which is a comment
colnames(otu_data)[1] <- "otu_id" # make column name more R friendly
print(otu_data)
```
The last column "Consensus Lineage" contains the `gloss$add("taxonomic classification")`, so we will use that to convert the rest of the data into a `taxmap` object format.
```{r}
library(metacoder)
```
Since this is tabular data with an embedded taxonomic classification, we use `parse_tax_data`:
```{r hmp_parse_otu_table, cache = TRUE}
hmp_data <- parse_tax_data(otu_data,
class_cols = "Consensus Lineage",
class_regex = "([a-z]{0,1})_{0,2}(.*)$",
class_key = c(hmp_rank = "taxon_rank", hmp_tax = "taxon_name"),
class_sep = ";")
hmp_data$data$class_data <- NULL # Delete uneeded regex match table
names(hmp_data$data) <- "otu_count" # Rename abundnace matrix to something more understandable
print(hmp_data)
```
I will save the information on sample characteristics (e.g. body site) in a seperate table, since it is not directly assocaited with taxonomic information.
```{r hmp_sample_data, cache = TRUE}
sample_data <- read_tsv(mapping_file,
col_types = "ccccccccc") # Force all columns to be character
colnames(sample_data) <- c("sample_id", "rsid", "visit_no", "sex", "run_center",
"body_site", "mislabeled", "contaminated", "description")
print(sample_data)
```
In this analysis, I will be looking at just a subset of the body sites, so I will subset the sample data to just those.
```{r}
library(dplyr)
```
```{r}
sites_to_compare <- c("Saliva", "Tongue_dorsum", "Buccal_mucosa", "Anterior_nares", "Stool")
sample_data <- filter(sample_data, body_site %in% sites_to_compare)
```
Not all the samples in the sample data table appear in the abundance matrix and there are some in the abundance matrix that dont appear in the sample data and they are in a different order.
Making the two correspond will probably make things easier later on.
First, lets identify which columns they both share:
```{r, cache = TRUE}
sample_names <- intersect(colnames(hmp_data$data$otu_count), sample_data$sample_id)
```
We can then use `match` to order and subset the sample data based on the shared samples.
```{r, cache = TRUE}
sample_data <- sample_data[match(sample_names, sample_data$sample_id), ]
```
We can also subset the columns in the abundance matrix to just these shared samples.
We also need to preserve the first two columns, the otu and taxon IDs.
```{r, cache = TRUE}
hmp_data$data$otu_count <- hmp_data$data$otu_count[ , c("taxon_id", "otu_id", sample_names)]
```
## Filtering the taxonomy
Looking at the names of the taxa in the taxonomy, there are a lot of names that appear to be codes for uncharacterized taxa:
```{r, cache = TRUE}
head(taxon_names(hmp_data), n = 20)
```
Those might be relevant depending on the research question, but for this analysis I choose to remove them, since they do not mean much to me (you might want to keep such taxa in your ananlysis).
We can remove taxa from a `taxmap` object using the `filter_taxa` function.
I will remove them by removing any taxon whose name is not composed of only letters.
```{r, cache = TRUE}
hmp_data <- filter_taxa(hmp_data, grepl(taxon_names, pattern = "^[a-zA-Z]+$"))
print(hmp_data)
```
Note that we have not removed any OTUs even though there were OTUs assigned to those taxa.
Those OTUs were automatically reassigned to a `gloss$add("supertaxon")` that was not filtered out.
This also removed taxa with no name (`""`), since the `+` in the above regex means "one or more".
## Converting counts to proportions
We can now convert these counts to proportions and a simple alternative to rarefaction using `calc_obs_props`.
This accounts for uneven numbers of sequences for each sample.
```{r hmp_calc_props, cache = TRUE}
hmp_data$data$otu_prop <- calc_obs_props(hmp_data,
data = "otu_count",
cols = sample_data$sample_id)
```
## Calculating abundance per taxon
The input data included read abundance for each sample-OTU combination, but we need the abundances associated with each taxon for graphing.
There will usually be multiple OTUs assigned to the same taxon, especailly for corase taxonomic ranks (e.g. the root will have all OTU indexes), so the abundances at those indexes are are summed to provide the total abundance for each taxon using the `calc_taxon_abund` function.
```{r hmp_calc_tax_abund, cache = TRUE}
hmp_data$data$tax_prop <- calc_taxon_abund(hmp_data,
data = "otu_prop",
cols = sample_data$sample_id)
```
## Plot of everything
To get an idea of how the data looks overall lets make a plot showing OTU and read abundance of all the data combined.
I will exclude any taxon that has less than `r min_read_count` reads.
```{r hmp_plot_all, cache = TRUE}
set.seed(1)
plot_all <- hmp_data %>%
mutate_obs("tax_prop", abundance = rowMeans(hmp_data$data$tax_prop[sample_data$sample_id])) %>%
filter_taxa(abundance >= 0.001) %>%
filter_taxa(taxon_names != "") %>% # Some taxonomic levels are not named
heat_tree(node_size = n_obs,
node_size_range = c(0.01, 0.06),
node_size_axis_label = "Number of OTUs",
node_color = abundance,
node_color_axis_label = "Mean proportion of reads",
node_label = taxon_names,
output_file = result_path("hmp--all_data"))
print(plot_all)
```
Here we can see that there are realtively few abundant taxa and many rare ones.
However, we dont know anything about the species level diversity since the classifications go to genus only.
## Comparing taxon abundance amoung treatments
Assuming that differences in read depth correlates with differences in taxon abundance (a controversial assumption), we can compare taxon abundance amoung treatments using the `compare_groups` function.
This function assumes you have multiple samples per treatment (i.e. group).
It splits the counts up based on which group the samples came from and, for each row of the data (each taxon in this case), for each pair of groups (e.g. Nose vs Saliva samples), it applies a function to generate statistics summerizing the differences in abundance.
You can create a custom function to return your own set of statistics, but the default is to do a `r gloss$add("Wilcoxon Rank Sum test")` on the differences in median abundance for the samples.
See `?compare_groups` for more details.
```{r hmp_treat_comp, cache = TRUE}
hmp_data$data$diff_table <- compare_groups(hmp_data,
data = "tax_prop",
cols = sample_data$sample_id,
groups = sample_data$body_site)
```
We just did **a lot** (`r nrow(hmp_data$data$diff_table)`) of independent statistical tests, which means we probably got many false positives if we consider p < 0.05 to be significant.
To fix this we need to do a `r gloss$add('Multiple comparison corrections', shown = 'correction for multiple comparisions')` to adjust the p-values.
We will also set any differences that are not significant after the correction to `0`, so that they do not show up when plotting.
```{r}
hmp_data <- mutate_obs(hmp_data, "diff_table",
wilcox_p_value = p.adjust(wilcox_p_value, method = "fdr"),
log2_median_ratio = ifelse(wilcox_p_value < 0.05 | is.na(wilcox_p_value), log2_median_ratio, 0))
```
Now we can make a what we call a "differential heat tree matrix", to plot the results of these tests.
```{r}
hmp_data %>%
mutate_obs("tax_prop", abundance = rowMeans(hmp_data$data$tax_prop[sample_data$sample_id])) %>%
filter_taxa(abundance >= 0.001, reassign_obs = c(diff_table = FALSE)) %>%
heat_tree_matrix(data = "diff_table",
node_size = n_obs,
node_size_range = c(0.01, 0.05),
node_label = taxon_names,
node_color = log2_median_ratio,
node_color_range = diverging_palette(),
node_color_trans = "linear",
node_color_interval = c(-7, 7),
edge_color_interval = c(-7, 7),
node_size_axis_label = "Number of OTUs",
node_color_axis_label = "Log2 ratio median proportions",
# initial_layout = "re",
# layout = "da",
key_size = 0.7,
seed = 4,
output_file = result_path("figure_3--hmp_matrix_plot"))
```
## Plot body site differences
The HMP dataset is great for comparing treatments since there are many body sites with many replicates so statistical tests can be used to find correlations between body sites and taxon read abundance (the relationship between read abundance and organism abundance is more fuzzy and open to debate).
The code below applies the Wilcox rank-sum test to differences in median read proportion between every pair of body sties compared.
Since the data is compositional in nature (i.e. not idependent samples) we used a non-parametric test and used median instead of mean read proportion.
```{r hmp_diff_funcs}
plot_body_site_diff <- function(site_1, site_2, output_name, seed = 1) {
set.seed(seed)
hmp_data %>%
mutate_obs("tax_prop", abundance = rowMeans(hmp_data$data$tax_prop[sample_data$sample_id])) %>%
filter_taxa(abundance >= 0.001, reassign_obs = FALSE) %>%
filter_taxa(taxon_names != "", reassign_obs = FALSE) %>% # Some taxonomic levels are not named
filter_obs("diff_table", treatment_1 %in% c(site_1, site_2), treatment_2 %in% c(site_1, site_2)) %>%
heat_tree(node_size_axis_label = "Number of OTUs",
node_size = n_obs,
node_color_axis_label = "Log 2 ratio of median proportions",
node_color = log2_median_ratio,
node_color_range = diverging_palette(),
node_color_trans = "linear",
node_color_interval = c(-5, 5),
edge_color_interval = c(-5, 5),
node_label = taxon_names,
output_file = result_path(paste0(output_name, "--", site_1, "_vs_", site_2)))
}
```
```{r hmp_diff_plot}
plot_body_site_diff("Saliva", "Anterior_nares", "hmp--v35")
plot_body_site_diff("Buccal_mucosa", "Anterior_nares", "hmp--v35")
```
We call these types of graphs **differential heat trees** and they are great for comparing any type of data associated with two samples or treatments.
## Software and packages used
```{r}
sessionInfo()
```
## References