-
Notifications
You must be signed in to change notification settings - Fork 2
/
publication--02--rdp.Rmd
210 lines (165 loc) · 8.13 KB
/
publication--02--rdp.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
---
title: Digital PCR with the RDP database
bibliography: bibliography.bibtex
---
```{r init, echo = FALSE, message = FALSE}
library(knitr)
if (! is.null(current_input())) { # if knitr is being used
knitr::read_chunk("settings.R")
} else {
source("settings.R")
}
```
```{r rendering_settings, include=FALSE}
```
## Requirements
**NOTE:** This analysis requires at least 10Gb of RAM to run.
It uses large files not included in the repository and many steps can take a few minutes to run.
## Parameters
### Analysis input/output
```{r io_settings}
```
### Analysis parameters
These settings are shared between the RDP, SILVA, and Greengenes analyses, since the results of those three analyeses are combined in one plot later.
This ensures that all of the graphs use the same color and size scales, instead of optimizing them for each data set, as would be done automatically otherwise.
```{r database_comparison_settings}
```
## Parse database
The code below parses and subsets the entire RDP non-redundant reference database [@maidak2001rdp].
The input file is not included in this repository because it is 3.6GB, but you can download it from the RDP website here:
https://rdp.cme.msu.edu/misc/resources.jsp
```{r}
library(metacoder)
```
```{r rdp_load, cache = TRUE}
rdp_seq <- ape::read.FASTA(file.path(input_folder, "rdp_current_Bacteria_unaligned.fa"))
rdp <- extract_tax_data(names(rdp_seq),
regex = "\\tLineage(.*)",
key = c("class"),
class_regex = "[;=](.+?);(.+?)",
class_key = c("taxon_name", rdp_rank = "taxon_rank"))
# rdp_seq <- unlist(lapply(as.character(rdp_seq), paste0, collapse = ""))
rdp$data$sequence <- setNames(rdp_seq, rdp$data$tax_data$taxon_id)
rdp$data$class_data <- NULL
print(rdp)
```
## Subset
Next I will subset the taxa in the dataset (depending on parameter settings).
This can help make the graphs less cluttered and make it easier to compare databases.
```{r rdp_subset, cache = TRUE}
if (! is.null(min_seq_count)) {
rdp <- filter_taxa(rdp, n_obs >= min_seq_count)
}
if (just_bacteria) {
rdp <- filter_taxa(rdp, taxon_names == "Bacteria", subtaxa = TRUE)
}
if (! is.null(max_taxonomy_depth)) {
rdp <- filter_taxa(rdp, n_supertaxa <= max_taxonomy_depth)
}
print(rdp)
```
## Remove chloroplast sequences
These are not bacterial and will bias the digital PCR results.
Note that the `invert` option makes it so taxa are included that *did not* pass the filter.
This is different than simply using `name != "Chloroplast", subtaxa = TRUE` since the effects of `invert` are applied after those of `subtaxa`.
```{r rdp_rm_chloro, cache = TRUE}
rdp <- filter_taxa(rdp, taxon_names == "Cyanobacteria/Chloroplast", subtaxa = TRUE, invert = TRUE)
print(rdp)
```
## Plot whole database
Although graphing everything can be a bit overwhelming (`r nrow(rdp$taxon_data)` taxa), it gives an intuitive feel for the complexity of the database:
```{r rdp_plot_all, cache = TRUE}
rdp_plot_all <- heat_tree(rdp,
node_size = n_obs,
node_color = n_obs,
node_size_range = size_range * 2,
edge_size_range = size_range,
node_size_interval = all_size_interval,
edge_size_interval = all_size_interval,
node_color_interval = all_size_interval,
edge_color_interval = all_size_interval,
node_label = taxon_names,
node_label_size_range = label_size_range,
node_label_max = label_max,
node_color_axis_label = "Sequence count",
output_file = result_path("rdp--all"))
print(rdp_plot_all)
```
## PCR
Before doing the digital PCR, I will filter for only full length sequences, since shorter sequences might not have a primer binding site and the digital PCR would fail, not because of an inability of a real primer to bind to real DNA, but because of missing sequence information in the database.
Ideally, this kind of filtering for full length sequences would involve something like a multiple sequences alignment so we don't remove sequences that are actually full length, but just happen to be shorter than the cutoff of `r min_seq_length` naturally.
However, this method is easy and should work OK.
```{r rdp_length_filter, cache = TRUE}
if (! is.null(min_seq_length)) {
long_enough <- which(vapply(rdp$data$sequence, length, numeric(1)) >= min_seq_length)
rdp <- filter_obs(rdp, "sequence", long_enough)
rdp <- filter_obs(rdp, "tax_data", long_enough, drop_taxa = TRUE)
}
print(rdp)
```
Next I will conduct digital PCR with a new set of universal 16S primers [@walters2016improved], allowing for a maximum mismatch of `r max_mismatch`%.
```{r rdp_pcr, cache = TRUE}
rdp_pcr <- primersearch(rdp, sequence,
forward = forward_primer,
reverse = reverse_primer,
mismatch = max_mismatch)
```
Now the object `rdp_pcr` has all the information that `rdp` has plus the results of the digital PCR.
Lets plot the whole database again, but coloring based on digital PCR success.
```{r rdp_plot_pcr_all, cache = TRUE}
rdp_plot_pcr_all <- heat_tree(rdp_pcr,
node_size = seq_count,
node_size_range = size_range * 2,
edge_size_range = size_range,
node_size_interval = all_size_interval,
edge_size_interval = all_size_interval,
node_label = taxon_names,
node_color = prop_amplified,
node_color_range = pcr_success_color_scale,
node_color_trans = "linear",
node_label_size_range = label_size_range,
node_label_max = label_max,
edge_color_interval = c(0, 1),
node_color_interval = c(0, 1),
node_color_axis_label = "Proportion PCR success",
node_size_axis_label = "Sequence count",
output_file = result_path("rdp--pcr_all"))
print(rdp_plot_pcr_all)
```
Since these are universal bacterial primers, we would expect most of the database to amplify.
The plot shows this and very few sequences were not amplified.
If we wanted to look at only those sequences that did not amplify, we could filter taxa by PCR success and plot what remains.
I am including the supertaxa of those taxa that did not amplify since excluding them would split the tree into a cloud of fragments lacking taxonomic context.
```{r, rdp_plot_pcr_fail, cache = TRUE}
rdp_plot_pcr_fail <- rdp_pcr %>%
filter_taxa(prop_amplified < pcr_success_cutoff, supertaxa = TRUE) %>%
heat_tree(node_size = query_count - seq_count,
node_label = taxon_names,
node_color = prop_amplified,
node_size_range = size_range * 2,
edge_size_range = size_range,
node_size_interval = pcr_size_interval,
edge_size_interval = pcr_size_interval,
node_color_range = pcr_success_color_scale,
node_color_trans = "linear",
node_color_interval = c(0, 1),
edge_color_interval = c(0, 1),
node_label_size_range = label_size_range,
node_label_max = label_max,
node_color_axis_label = "Proportion PCR success",
node_size_axis_label = "Sequences not amplified",
output_file = result_path("rdp--pcr_fail"))
print(rdp_plot_pcr_fail)
```
## Save outputs for composite figure
Some results from this file will be combined with others from similar analyses to make a composite figure.
Below, the needed objects are saved so that they can be loaded by another Rmd file.
```{r rdp_save, cache = TRUE}
save(file = file.path(output_folder,"rdp_data.RData"),
rdp_plot_all, rdp_plot_pcr_fail)
```
## Software and packages used
```{r, cache = TRUE}
sessionInfo()
```
## References