publication--02--rdp.Rmd

---
title: Digital PCR with the RDP database
bibliography: bibliography.bibtex
---

```{r init, echo = FALSE, message = FALSE}
library(knitr)
if (! is.null(current_input())) { # if knitr is being used
  knitr::read_chunk("settings.R")
} else {
  source("settings.R")
}
```

```{r rendering_settings, include=FALSE}
```

## Requirements 

**NOTE:** This analysis requires at least 10Gb of RAM to run.
It uses large files not included in the repository and many steps can take a few minutes to run. 


## Parameters

### Analysis input/output

```{r io_settings}
```

### Analysis parameters

These settings are shared between the RDP, SILVA, and Greengenes analyses, since the results of those three analyeses are combined in one plot later. 
This ensures that all of the graphs use the same color and size scales, instead of optimizing them for each data set, as would be done automatically otherwise.

```{r database_comparison_settings}
```


## Parse database

The code below parses and subsets the entire RDP non-redundant reference database [@maidak2001rdp].
The input file is not included in this repository because it is 3.6GB, but you can download it from the RDP website here: 

https://rdp.cme.msu.edu/misc/resources.jsp

```{r}
library(metacoder)
```

```{r rdp_load, cache = TRUE}
rdp_seq <- ape::read.FASTA(file.path(input_folder, "rdp_current_Bacteria_unaligned.fa"))
rdp <- extract_tax_data(names(rdp_seq),
                        regex = "\\tLineage(.*)",
                        key = c("class"),
                        class_regex = "[;=](.+?);(.+?)",
                        class_key = c("taxon_name", rdp_rank = "taxon_rank"))
# rdp_seq <- unlist(lapply(as.character(rdp_seq), paste0, collapse = ""))
rdp$data$sequence <- setNames(rdp_seq, rdp$data$tax_data$taxon_id)
rdp$data$class_data <- NULL
print(rdp)
```

## Subset

Next I will subset the taxa in the dataset (depending on parameter settings). 
This can help make the graphs less cluttered and make it easier to compare databases. 

```{r rdp_subset, cache = TRUE}
if (! is.null(min_seq_count)) {
  rdp <- filter_taxa(rdp, n_obs >= min_seq_count)
}
if (just_bacteria) {
  rdp <- filter_taxa(rdp, taxon_names == "Bacteria", subtaxa = TRUE)
}
if (! is.null(max_taxonomy_depth)) {
  rdp <- filter_taxa(rdp, n_supertaxa <= max_taxonomy_depth)
}
print(rdp)
```


## Remove chloroplast sequences

These are not bacterial and will bias the digital PCR results.
Note that the `invert` option makes it so taxa are included that *did not* pass the filter. 
This is different than simply using `name != "Chloroplast", subtaxa = TRUE` since the effects of `invert` are applied after those of `subtaxa`.

```{r rdp_rm_chloro, cache = TRUE}
rdp <- filter_taxa(rdp, taxon_names == "Cyanobacteria/Chloroplast", subtaxa = TRUE, invert = TRUE)
print(rdp)
```


## Plot whole database

Although graphing everything can be a bit overwhelming (`r nrow(rdp$taxon_data)` taxa), it gives an intuitive feel for the complexity of the database:

```{r rdp_plot_all, cache = TRUE}
rdp_plot_all <- heat_tree(rdp,
                          node_size = n_obs,
                          node_color = n_obs,
                          node_size_range = size_range * 2,
                          edge_size_range = size_range,
                          node_size_interval = all_size_interval,
                          edge_size_interval = all_size_interval,
                          node_color_interval = all_size_interval,
                          edge_color_interval = all_size_interval,
                          node_label = taxon_names,
                          node_label_size_range = label_size_range,
                          node_label_max = label_max,
                          node_color_axis_label = "Sequence count",
                          output_file = result_path("rdp--all"))
print(rdp_plot_all)
```


## PCR

Before doing the digital PCR, I will filter for only full length sequences, since shorter sequences might not have a primer binding site and the digital PCR would fail, not because of an inability of a real primer to bind to real DNA, but because of missing sequence information in the database. 
Ideally, this kind of filtering for full length sequences would involve something like a multiple sequences alignment so we don't remove sequences that are actually full length, but just happen to be shorter than the cutoff of `r min_seq_length` naturally.
However, this method is easy and should work OK.

```{r rdp_length_filter, cache = TRUE}
if (! is.null(min_seq_length)) {
  long_enough <- which(vapply(rdp$data$sequence, length, numeric(1)) >= min_seq_length)
  rdp <- filter_obs(rdp, "sequence", long_enough)
  rdp <- filter_obs(rdp, "tax_data", long_enough, drop_taxa = TRUE)
}
print(rdp)
```

Next I will conduct digital PCR with a new set of universal 16S primers [@walters2016improved], allowing for a maximum mismatch of `r max_mismatch`%.

```{r rdp_pcr, cache = TRUE}
rdp_pcr <- primersearch(rdp, sequence,
                        forward = forward_primer,
                        reverse = reverse_primer,
                        mismatch = max_mismatch)
```

Now the object `rdp_pcr` has all the information that `rdp` has plus the results of the digital PCR.
Lets plot the whole database again, but coloring based on digital PCR success.

```{r rdp_plot_pcr_all, cache = TRUE}
rdp_plot_pcr_all <- heat_tree(rdp_pcr,
                              node_size = seq_count,
                              node_size_range = size_range * 2,
                              edge_size_range = size_range,
                              node_size_interval = all_size_interval,
                              edge_size_interval = all_size_interval,
                              node_label = taxon_names,
                              node_color = prop_amplified,
                              node_color_range =  pcr_success_color_scale,
                              node_color_trans = "linear",
                              node_label_size_range = label_size_range,
                              node_label_max = label_max,
                              edge_color_interval = c(0, 1),
                              node_color_interval = c(0, 1),
                              node_color_axis_label = "Proportion PCR success",
                              node_size_axis_label = "Sequence count",
                              output_file = result_path("rdp--pcr_all"))
print(rdp_plot_pcr_all)
```

Since these are universal bacterial primers, we would expect most of the database to amplify.
The plot shows this and very few sequences were not amplified.
If we wanted to look at only those sequences that did not amplify, we could filter taxa by PCR success and plot what remains. 
I am including the supertaxa of those taxa that did not amplify since excluding them would split the tree into a cloud of fragments lacking taxonomic context.

```{r, rdp_plot_pcr_fail, cache = TRUE}
rdp_plot_pcr_fail <- rdp_pcr %>%
  filter_taxa(prop_amplified < pcr_success_cutoff, supertaxa = TRUE) %>%
  heat_tree(node_size = query_count - seq_count,
            node_label = taxon_names,
            node_color = prop_amplified,
            node_size_range = size_range * 2,
            edge_size_range = size_range,
            node_size_interval = pcr_size_interval,
            edge_size_interval = pcr_size_interval,
            node_color_range =  pcr_success_color_scale,
            node_color_trans = "linear",
            node_color_interval = c(0, 1),
            edge_color_interval = c(0, 1),
            node_label_size_range = label_size_range,
            node_label_max = label_max,
            node_color_axis_label = "Proportion PCR success",
            node_size_axis_label = "Sequences not amplified",
            output_file = result_path("rdp--pcr_fail"))
print(rdp_plot_pcr_fail)
```

## Save outputs for composite figure

Some results from this file will be combined with others from similar analyses to make a composite figure.
Below, the needed objects are saved so that they can be loaded by another Rmd file.

```{r rdp_save, cache = TRUE}
save(file = file.path(output_folder,"rdp_data.RData"),
     rdp_plot_all, rdp_plot_pcr_fail)
```

## Software and packages used

```{r, cache = TRUE}
sessionInfo()
```


## References