publication--05--16s_human_microbiome.Rmd

---
title: "Human microbiome example"
bibliography: bibliography.bibtex
---

```{r init, echo = FALSE, message = FALSE}
library(knitr)
if (! is.null(current_input())) { # if knitr is being used
  knitr::read_chunk("settings.R")
} else {
  source("settings.R")
}
```

```{r rendering_settings, include=FALSE}
```

## Requirements 

**NOTE:** This analysis requires at least 10Gb of RAM to run.
It uses large files not included in the repository and many steps can take a few minutes to run. 

## Parameters

### Analysis input/output

```{r io_settings}
```

### Analysis parameters

```{r hmp_parameters}
matrix_plot_depth <- 6 # The maximum number of ranks to display
color_interval <- c(-4, 4) # The range of values (log 2 ratio of median proportion) to display
min_read_count <- 100 # The minium number of reads needed for a taxon to be graphed
otu_file_url = "http://downloads.hmpdacc.org/data/HMQCP/otu_table_psn_v35.txt.gz"
mapping_file_url = "http://downloads.hmpdacc.org/data/HMQCP/v35_map_uniquebyPSN.txt.bz2"
```

## Downloading the data

This analysis used OTU abundance data from the Human Microbiome Project (HMP) [@human2012structure], which used 16S metabarcoding to explore the bacterial microbiome of various human body parts.
First we will download the files from the HMP website.

```{r hmp_download, warning=FALSE, message=FALSE, cache = TRUE}
download_file <- function(path) {
  temp_file_path <- file.path(tempdir(), basename(path))
  utils::download.file(url = path, destfile = temp_file_path, quiet = TRUE)
  path <- temp_file_path
  return(path)
}
otu_file <- download_file(otu_file_url)
mapping_file <- download_file(mapping_file_url)
```

## Parsing the data

Then we will read these files into R and tidy up the data a bit.
We will use the `readr` package to read these `gloss$add("Tab-delimited text file", shown = "TSV")` files.


```{r hmp_parse_raw_otu, cache = TRUE}
library(readr)
otu_data <- read_tsv(otu_file, skip = 1) # Skip the first line, which is a comment
colnames(otu_data)[1] <- "otu_id" # make column name more R friendly
print(otu_data)
```

The last column "Consensus Lineage" contains the `gloss$add("taxonomic classification")`, so we will use that to convert the rest of the data into a `taxmap` object format.

```{r}
library(metacoder)
```

Since this is tabular data with an embedded taxonomic classification, we use `parse_tax_data`:

```{r hmp_parse_otu_table, cache = TRUE}
hmp_data <- parse_tax_data(otu_data,
                           class_cols = "Consensus Lineage",
                           class_regex = "([a-z]{0,1})_{0,2}(.*)$",
                           class_key = c(hmp_rank = "taxon_rank", hmp_tax = "taxon_name"),
                           class_sep = ";")
hmp_data$data$class_data <- NULL # Delete uneeded regex match table
names(hmp_data$data) <- "otu_count" # Rename abundnace matrix to something more understandable
print(hmp_data)
```

I will save the information on sample characteristics (e.g. body site) in a seperate table, since it is not directly assocaited with taxonomic information.

```{r hmp_sample_data, cache = TRUE}
sample_data <- read_tsv(mapping_file,
                        col_types = "ccccccccc") # Force all columns to be character
colnames(sample_data) <- c("sample_id", "rsid", "visit_no", "sex", "run_center",
                           "body_site", "mislabeled", "contaminated", "description")
print(sample_data)
```

In this analysis, I will be looking at just a subset of the body sites, so I will subset the sample data to just those.

```{r}
library(dplyr)
```

```{r}
sites_to_compare <- c("Saliva", "Tongue_dorsum", "Buccal_mucosa", "Anterior_nares", "Stool")
sample_data <- filter(sample_data, body_site %in% sites_to_compare)
```

Not all the samples in the sample data table appear in the abundance matrix and there are some in the abundance matrix that dont appear in the sample data and they are in a different order.
Making the two correspond will probably make things easier later on.
First, lets identify which columns they both share:

```{r, cache = TRUE}
sample_names <- intersect(colnames(hmp_data$data$otu_count), sample_data$sample_id)
```

We can then use `match` to order and subset the sample data based on the shared samples.

```{r, cache = TRUE}
sample_data <- sample_data[match(sample_names, sample_data$sample_id), ]
```

We can also subset the columns in the abundance matrix to just these shared samples.
We also need to preserve the first two columns, the otu and taxon IDs.

```{r, cache = TRUE}
hmp_data$data$otu_count <- hmp_data$data$otu_count[ , c("taxon_id", "otu_id", sample_names)]
```

## Filtering the taxonomy

Looking at the names of the taxa in the taxonomy, there are a lot of names that appear to be codes for uncharacterized taxa: 

```{r, cache = TRUE}
head(taxon_names(hmp_data), n = 20)
```

Those might be relevant depending on the research question, but for this analysis I choose to remove them, since they do not mean much to me (you might want to keep such taxa in your ananlysis).
We can remove taxa from a `taxmap` object using the `filter_taxa` function.
I will remove them by removing any taxon whose name is not composed of only letters.

```{r, cache = TRUE}
hmp_data <- filter_taxa(hmp_data, grepl(taxon_names, pattern =  "^[a-zA-Z]+$"))
print(hmp_data)
```

Note that we have not removed any OTUs even though there were OTUs assigned to those taxa.
Those OTUs were automatically reassigned to a `gloss$add("supertaxon")` that was not filtered out.
This also removed taxa with no name (`""`), since the `+` in the above regex means "one or more".


## Converting counts to proportions

We can now convert these counts to proportions and a simple alternative to rarefaction using `calc_obs_props`.
This accounts for uneven numbers of sequences for each sample.

```{r hmp_calc_props, cache = TRUE}
hmp_data$data$otu_prop <- calc_obs_props(hmp_data,
                                         data = "otu_count",
                                         cols = sample_data$sample_id)
```


## Calculating abundance per taxon

The input data included read abundance for each sample-OTU combination, but we need the abundances associated with each taxon for graphing.
There will usually be multiple OTUs assigned to the same taxon, especailly for corase taxonomic ranks (e.g. the root will have all OTU indexes), so the abundances at those indexes are are summed to provide the total abundance for each taxon using the `calc_taxon_abund` function.

```{r hmp_calc_tax_abund, cache = TRUE}
hmp_data$data$tax_prop <- calc_taxon_abund(hmp_data, 
                                           data = "otu_prop",
                                           cols = sample_data$sample_id)
```


## Plot of everything

To get an idea of how the data looks overall lets make a plot showing OTU and read abundance of all the data combined.
I will exclude any taxon that has less than `r min_read_count` reads. 

```{r hmp_plot_all, cache = TRUE}
set.seed(1)
plot_all <- hmp_data %>%
  mutate_obs("tax_prop", abundance = rowMeans(hmp_data$data$tax_prop[sample_data$sample_id])) %>%
  filter_taxa(abundance >= 0.001) %>%
  filter_taxa(taxon_names != "") %>% # Some taxonomic levels are not named
  heat_tree(node_size = n_obs,
            node_size_range = c(0.01, 0.06),
            node_size_axis_label = "Number of OTUs",
            node_color = abundance,
            node_color_axis_label = "Mean proportion of reads",
            node_label = taxon_names,
            output_file = result_path("hmp--all_data"))
print(plot_all)
```

Here we can see that there are realtively few abundant taxa and many rare ones.
However, we dont know anything about the species level diversity since the classifications go to genus only. 


## Comparing taxon abundance amoung treatments

Assuming that differences in read depth correlates with differences in taxon abundance (a controversial assumption), we can compare taxon abundance amoung treatments using the `compare_groups` function.
This function assumes you have multiple samples per treatment (i.e. group).
It splits the counts up based on which group the samples came from and, for each row of the data (each taxon in this case), for each pair of groups (e.g. Nose vs Saliva samples), it applies a function to generate statistics summerizing the differences in abundance.
You can create a custom function to return your own set of statistics, but the default is to do a `r gloss$add("Wilcoxon Rank Sum test")` on the differences in median abundance for the samples.
See `?compare_groups` for more details.

```{r hmp_treat_comp, cache = TRUE}
hmp_data$data$diff_table <- compare_groups(hmp_data,
                                           data = "tax_prop",
                                           cols = sample_data$sample_id,
                                           groups = sample_data$body_site)
```

We just did **a lot** (`r nrow(hmp_data$data$diff_table)`) of independent statistical tests, which means we probably got many false positives if we consider p < 0.05 to be significant. 
To fix this we need to do a `r gloss$add('Multiple comparison corrections', shown = 'correction for multiple comparisions')` to adjust the p-values.
We will also set any differences that are not significant after the correction to `0`, so that they do not show up when plotting. 

```{r}
hmp_data <- mutate_obs(hmp_data, "diff_table",
                       wilcox_p_value = p.adjust(wilcox_p_value, method = "fdr"),
                       log2_median_ratio = ifelse(wilcox_p_value < 0.05 | is.na(wilcox_p_value), log2_median_ratio, 0))
```

Now we can make a what we call a "differential heat tree matrix", to plot the results of these tests.

```{r}
hmp_data %>%
  mutate_obs("tax_prop", abundance = rowMeans(hmp_data$data$tax_prop[sample_data$sample_id])) %>%
  filter_taxa(abundance >= 0.001, reassign_obs = c(diff_table = FALSE)) %>%
  heat_tree_matrix(data = "diff_table",
                   node_size = n_obs,
                   node_size_range = c(0.01, 0.05),
                   node_label = taxon_names,
                   node_color = log2_median_ratio,
                   node_color_range = diverging_palette(),
                   node_color_trans = "linear",
                   node_color_interval = c(-7, 7),
                   edge_color_interval = c(-7, 7),
                   node_size_axis_label = "Number of OTUs",
                   node_color_axis_label = "Log2 ratio median proportions",
                   # initial_layout = "re",
                   # layout = "da",
                   key_size = 0.7,
                   seed = 4,
                   output_file = result_path("figure_3--hmp_matrix_plot"))
```

## Plot body site differences

The HMP dataset is great for comparing treatments since there are many body sites with many replicates so statistical tests can be used to find correlations between body sites and taxon read abundance (the relationship between read abundance and organism abundance is more fuzzy and open to debate). 
The code below applies the Wilcox rank-sum test to differences in median read proportion between every pair of body sties compared. 
Since the data is compositional in nature (i.e. not idependent samples) we used a non-parametric test and used median instead of mean read proportion.

```{r hmp_diff_funcs} 
plot_body_site_diff <- function(site_1, site_2, output_name, seed = 1) {
  set.seed(seed)
  hmp_data %>%
    mutate_obs("tax_prop", abundance = rowMeans(hmp_data$data$tax_prop[sample_data$sample_id])) %>%
    filter_taxa(abundance >= 0.001, reassign_obs = FALSE) %>%
    filter_taxa(taxon_names != "", reassign_obs = FALSE) %>% # Some taxonomic levels are not named
    filter_obs("diff_table", treatment_1 %in% c(site_1, site_2), treatment_2 %in% c(site_1, site_2)) %>%
    heat_tree(node_size_axis_label = "Number of OTUs",
              node_size = n_obs,
              node_color_axis_label = "Log 2 ratio of median proportions",
              node_color = log2_median_ratio,
              node_color_range = diverging_palette(),
              node_color_trans = "linear",
              node_color_interval = c(-5, 5),
              edge_color_interval = c(-5, 5),
              node_label = taxon_names,
              output_file = result_path(paste0(output_name, "--", site_1, "_vs_", site_2)))
}
```

```{r hmp_diff_plot}
plot_body_site_diff("Saliva", "Anterior_nares", "hmp--v35")
plot_body_site_diff("Buccal_mucosa", "Anterior_nares", "hmp--v35")
```

We call these types of graphs **differential heat trees** and they are great for comparing any type of data associated with two samples or treatments.

## Software and packages used

```{r}
sessionInfo()
```

## References