workshop--07--diversity_stats.Rmd

---
output: html_document
bibliography: "bibliography.bibtex"
---

```{r setup, include=FALSE}
source("settings.R")
knitr::opts_chunk$set(fig.height = 5, fig.width = 5)
```

# Diversity statistics

## Load example data

If you are starting the workshop at this section, or had problems running code in a previous section, use the following to load the data used in this section.
You can download the "clean_data.Rdata" file <a href="clean_data.Rdata" download="clean_data.Rdata">here</a>.
If `obj` and `sample_data` are already in your environment, you can ignore this and proceed.

```{r}
load("clean_data.Rdata")
```

## Measures of diversity

Diversity in the ecological sense is intuitively understood as the complexity of a community of organisms.
There are many ways to quantify this complexity so that we can compare communities objectively.
The two main categories of methods are known as **alpha diversity** and **beta diversity** [@whittaker1960vegetation].
Alpha diversity measures the diversity within a single sample and is generally based on the number and relative abundance of taxa at some rank (e.g. species or OTUs).
Beta diversity also uses the number of relative abundance of taxa at some rank, but measures variation between samples.
In other words, an alpha diversity statistic describes a single sample and a beta diversity statistic describes how two samples compare.

The `vegan` package is the main tool set used for calculating biological diversity statistics in R.

## Alpha (within sample) diversity 

Common alpha diversity statistics include: 

* **Shannon**: How difficult it is to predict the identity of a randomly chosen individual.
* **Simpson**: The probability that two randomly chosen individuals are the same species.
* **Inverse Simpson**: This is a bit confusing to think about. Assuming a theoretical community where all species were equally abundant, this would be the number of species needed to have the same Simpson index value for the community being analyzed.

There are also some diversity indexes that take into account the taxonomic similarity of the species called "taxonomic diversity" and "taxonomic distinctness", but we will not go into those.

The `diversity` function from the `vegan` package can be used to calculate the alpha diversity of a set of samples.
Like other vegan functions, it assumes that samples are in rows, but they are in columns in our data, so we need to use the `MARGIN = 2` option.
We also need to exclude the taxon ID column by subsetting the columns to only samples (i.e. all column besides the first one).
Since alpha diversity is a per-sample attribute, we can just add this as a column to the sample data table:

```{r message=FALSE}
library(vegan)
sample_data$alpha <- diversity(obj$data$otu_rarefied[, sample_data$SampleID],
                                 MARGIN = 2,
                                 index = "invsimpson")
hist(sample_data$alpha)
```

Adding this as a column to the sample data table makes it easy to graph using `ggplot2`.

```{r message=FALSE}
library(ggplot2)
ggplot(sample_data, aes(x = Site, y = alpha)) + 
  geom_boxplot()
```

We can use `r gloss$add('analysis of variance (ANOVA)')` to tell if at least one of the diversity means is different from the rest.

```{r}
anova_result <- aov(alpha ~ Site, sample_data)
summary(anova_result)
```

That tells us that there is a difference, but does not tell us which means are different.
A `r gloss$add("Tukey's Honest Significant Difference (HSD)")` test can do pairwise comparisons of the means to find this out.
We will use the `HSD.test` function from the `agricolae` package since it provides grouping codes that are useful for graphing.

```{r}
library(agricolae)
tukey_result <- HSD.test(anova_result, "Site", group = TRUE)
print(tukey_result)
```

Looking at the `tukey_result$groups` table it appears that the alpha diversity of sites "Sil" and "Mah" might not be different, but there is evidence that the diversity in site "Jam" is lower.
We can add this information to the graph using the `tukey_result$groups$groups` codes:

```{r }
group_data <- tukey_result$groups[order(rownames(tukey_result$groups)),]
ggplot(sample_data, aes(x = Site, y = alpha)) +
  geom_text(data = data.frame(),
            aes(x = rownames(group_data), y = max(sample_data$alpha) + 1, label = group_data$groups),
            col = 'black',
            size = 10) +
  geom_boxplot() +
  ggtitle("Alpha diversity") +
  xlab("Site") +
  ylab("Alpha diversity index")
```

So that takes care of comparing the alpha diversity of sites, but there are other interesting groupings we can compare, such as the genotype and the type of the sample (roots vs leaves).
We could do the above all over with minor modifications, but one of the benefits of using a programming language is that you can create your own functions to automate repeated tasks.
We can generalize what we did above and put it in a function like so:

```{r}
compare_alpha <- function(sample_data, grouping_var) {
  # Calcuate alpha diversity
  sample_data$alpha <- diversity(obj$data$otu_rarefied[, sample_data$SampleID],
                                 MARGIN = 2,
                                 index = "invsimpson")
  
  # Do ANOVA
  sample_data$grouping <- sample_data[[grouping_var]] # needed for how `aov` works
  anova_result <- aov(alpha ~ grouping, sample_data)
  
  # Do Tukey's HSD test
  tukey_result <- HSD.test(anova_result, "grouping", group = TRUE)

  # Plot result
  group_data <- tukey_result$groups[order(rownames(tukey_result$groups)),]
  my_plot <- ggplot(sample_data, aes(x = grouping, y = alpha)) +
    geom_text(data = data.frame(),
              aes(x = rownames(group_data),
                  y = max(sample_data$alpha) + 1,
                  label = group_data$groups),
              col = 'black',
              size = 10) +
    geom_boxplot() +
    ggtitle("Alpha diversity") +
    xlab(grouping_var) +
    ylab("Alpha diversity index")
  
  # Return plot
  return(my_plot)
}
```

Using this function, we can compare plot the alpha diversities by type of sample and genotype:

```{r }
compare_alpha(sample_data, "Type")
compare_alpha(sample_data, "Genotype")
```

Looks like there is no difference in the alpha diversity between genotypes, but a large difference between the diversity of roots and leaves.

The `phyloseq` package (@mcmurdie2013phyloseq) can be used to quickly plot a variety of alpha diversity indexes per sample using the `plot_richness` function.
First we need to convert the `taxmap` object to a `phyloseq` object, since all of the `phyloseq` functions expect `phyloseq` objects.

```{r message=FALSE, fig.width=10, fig.height=5}
library(phyloseq)
library(metacoder)
ps_obj <- as_phyloseq(obj,
                      otu_table = "otu_rarefied",
                      otu_id_col = "OTU_ID",
                      sample_data = sample_data,
                      sample_id_col = "SampleID")
plot_richness(ps_obj, color = "Type", x = "Site")
```

Each dot is a sample and the error bars in some of the indexes are the standard error.


## Plotting taxon abundance with heat trees 

Alpha diversity statistics capture the diversity of whole samples in a single number, but to see the abundance of each taxon in a group of samples (e.g., root samples), we need to use other techniques.
Stacked barcharts are typically used for this purpose, but we will be using heat trees.
First, we need to calculate the abundance of each taxon for a set of samples from our OTU abundance information.
We can use the `calc_taxon_abund` function to do this, and then find the mean abundance using a sample characteristic:

```{r}
obj$data$tax_abund <- calc_taxon_abund(obj, "otu_props")

obj$data$type_abund <- calc_group_mean(obj, "tax_abund",
                                       cols = sample_data$SampleID,
                                       groups = sample_data$Type)

print(obj$data$type_abund)
```

Now we can use these per-taxon abundances to make heat trees of the primary taxa present in leafs and roots:

```{r fig.width = 10, fig.height = 10}
set.seed(2)
obj %>%
  metacoder::filter_taxa(leaf > 0.001) %>% # metacoder:: needed because of phyloseq::filter_taxa
  heat_tree(node_label = taxon_names,
            node_size = leaf, 
            node_color = leaf, 
            layout = "da", initial_layout = "re", 
            title = "Taxa in leafs")
```

```{r fig.width = 10, fig.height = 10}
set.seed(3)
obj %>%
  metacoder::filter_taxa(root > 0.001) %>% # metacoder:: needed because of phyloseq::filter_taxa
  heat_tree(node_label = taxon_names,
            node_size = root, 
            node_color = root, 
            layout = "da", initial_layout = "re", 
            title = "Taxa in roots")
```

Note that we needed to qualify `filter_taxa` with `metacoder::`.
This is because `phyloseq` is loaded and it also has a function called `filter_taxa`.
When two functions from different packages have the same name, the function from the package that was loaded last is called, unless `package_name::` is added. 


## Beta (between sample) diversity

Beta diversity is a way to quantify the difference between two communities. 
There are many metrics that are used for this, but we will only mention a few of the more popular ones.
A few also incorporate phylogenetic relatedness and require a phylogenetic tree of the organisms in either community to be calculated.

#### Examples of indexes used with presence/absence data:

* **Sørensen**: two times the number of species common to both communities divided by the sum of the number of species in each community. 
* **Jaccard**: the number of species common to both communities divided by the number of species in either community.
* **Unifrac**: The fraction of the phylogenetic tree branch lengths shared by the two communities.

#### Examples of indexes used with count data:

* **Bray–Curtis**: The sum of lesser counts for species present in both communities divided by the sum of all counts in both communities. This can be thought of as a quantitative version of the Sørensen index.
* **Weighted Unifrac**: The fraction of the phylogenetic tree branch lengths shared by the two communities, weighted by the counts of organisms, so more abundant organisms have a greater influence.

The `vegan` function `vegdist` is used to calculate the pairwise beta diversity indexes for a set of samples.
Since this is a pairwise comparison, the output is a triangular matrix.
In R, a `matrix` is like a `data.frame`, but all of the same type (e.g. all numeric), and has some different behavior.

```{r fig.width=7, fig.height=5}
beta_dist <- vegdist(t(obj$data$otu_rarefied[, sample_data$SampleID]),
                     index = "bray")
```

Since `vegdist` does not have a `MARGIN` option like `diversity`, we need to `r gloss$add("transpose")` the matrix with the `t` function.


## Ordination

The typical way beta diversity is plotted is using ordination.
Ordination is a way to display "high dimensional" data in a visible number of dimensions (2 to 3).
Our data is "high dimensional" because we have many samples with many OTUs and the abundance of each OTU can be considered a "dimension".
If we had only two species, we could make a scatter plot of their abundance in each sample and get an idea of how the samples differ.
With thousands of species, this is not possible.
Instead, ordination is used to try to capture the information in many dimensions by in a smaller number of new "artificial" dimensions.

```{r fig.width=7, fig.height=5}
mds <- metaMDS(beta_dist)
```

That transformed our beta diversity matrix into a set of coordinates in two dimensions, which are intended to capture the differences in the data.
However, it is in a format specific to `vegan`, so we will have to convert the data to a form that we can use for plotting.

```{r}
mds_data <- as.data.frame(mds$points)
```

To use the sample data in the plotting, we can combine the coordinate data with the sample data table:

```{r}
mds_data$SampleID <- rownames(mds_data)
mds_data <- dplyr::left_join(mds_data, sample_data)
```

Now that we have the data in a format ggplot2 likes, we can plot it.
Lets plot our two new dimensions and color them by sample type (i.e. leaves vs roots).

```{r fig.width=7, fig.height=5}
library(ggplot2)
ggplot(mds_data, aes(x = MDS1, y = MDS2, color = Type)) +
  geom_point()
```

This shows that leaf and root samples are quite distinct, as we would expect.
We can also color them by Site:

```{r fig.width=7, fig.height=5}
ggplot(mds_data, aes(x = MDS1, y = MDS2, color = Site)) +
  geom_point()
```

It appears that within the leaf and root clusters, we "sub-clusters" corresponding to site.
Finally, lets look at genotype:

```{r fig.width=7, fig.height=5}
ggplot(mds_data, aes(x = MDS1, y = MDS2, color = Genotype)) +
  geom_point()
```

There is no discernible pattern there, suggesting plant genotype does not correspond to community structure.

We can also do the above quickly in `phyloseq` using the `ordinate` and `plot_ordination` functions.
Lets look at the differences between leaf and root samples again, but using a difference index this time.

```{r fig.width=7, fig.height=5}
ps_ord <- ordinate(ps_obj, method = "NMDS", distance = "jsd")
plot_ordination(ps_obj, ps_ord, type = "samples", color = "Type")
```

## Differential heat trees

Beta diversity summarizes the difference between two samples in a single number, but does not tell you why the samples are different.
We developed a plotting technique we call "differential heat trees" to display differences in abundance of each taxon in two samples or groups of samples.
These are like the heat trees shown in the plotting section, but colored with a diverging color scale the indicates which sample each taxon is more abundant in and by how much.
In this way, it is similar to heat maps used for the same purpose in gene expression studies.

### Comparing taxon abundance in two groups

Before we can make a differential heat tree, we need to calculate the difference in abundance for each taxon between groups of samples, such as root vs leaf samples.
There are many ways this can be done, ranging from simple differences in mean read counts, to outputs from specialized programs designed for microbiome data.
We will be using a function in `metacoder` called `compare_groups` to do the comparisons.
We can use the per-taxon read propotions we got with `calc_taxon_abund` eariler for this.
For each taxon, at every rank, the `compare_groups` function compares two groups of counts.
We have to define which sample belongs to which groups using `groups` option:

```{r}
obj$data$diff_table <- compare_groups(obj, data = "tax_abund",
                                      cols = sample_data$SampleID,
                                      groups = sample_data$Type)
print(obj$data$diff_table)
```

By default, `r gloss$add("Wilcoxon Rank Sum test")` is used to test for statistical significance of differences and various summary statistics are reported to measure the size of the difference.
Since we have done many independent tests (one for each taxon), we need to `r gloss$add('Multiple comparison corrections', shown = 'correct for multiple comparisions')`.
We will do that with a false discovery rate (FDR) correction, but other types can be used as well.

```{r}
obj <- mutate_obs(obj, "diff_table",
                  wilcox_p_value = p.adjust(wilcox_p_value, method = "fdr"))
```

The most useful statistic for plotting is the log of ratio of median abundances in the two groups, since it is centered on 0 and is symmetric (e.g., a value of -4 is the same magnitude as 4).
Lets set any differences that are not significant to 0 so all differences shown on the plot are significant.

```{r}
obj$data$diff_table$log2_median_ratio[obj$data$diff_table$wilcox_p_value > 0.05] <- 0
```

Now we have all the info needed to make a differential heat tree.
The standard `heat_tree` function can be used, but a few things need to be done to make an effective differential heat tree:

* A diverging color scale should be used, preferably with a neutral color in the middle, like gray.
* The interval of values displayed must be symmetric around zero so the neutral middle color is centered on zero.
* The node color should be set to the log of ratio of median abundances in the two groups

```{r fig.width = 10, fig.height = 10}
set.seed(1)
heat_tree(obj, 
          node_label = taxon_names,
          node_size = n_obs, # number of OTUs
          node_color = log2_median_ratio, # difference between groups
          node_color_interval = c(-10, 10), # symmetric interval
          node_color_range = c("cyan", "gray", "magenta"), # diverging colors
          node_size_axis_label = "OTU count",
          node_color_axis_label = "Log 2 ratio of median counts")
```

That's not too bad looking, but we can tweak it a bit to make it better by changing the layout, adding a title, and filtering out some of the taxa with odd names (depending on what the plot is for, you might not want to remove these).
Here `mutate_obs` is used to add a temporary variable to the taxmap object that will contain the taxon names with special characters like `[` removed.

```{r fig.width = 10, fig.height = 10}
set.seed(1)
obj %>%
  mutate_obs("cleaned_names", gsub(taxon_names, pattern = "\\[|\\]", replacement = "")) %>%
  metacoder::filter_taxa(grepl(cleaned_names, pattern = "^[a-zA-Z]+$")) %>%
  heat_tree(node_label = cleaned_names,
            node_size = n_obs, # number of OTUs
            node_color = log2_median_ratio, # difference between groups
            node_color_interval = c(-10, 10), # symmetric interval
            node_color_range = c("cyan", "gray", "magenta"), # diverging colors
            node_size_axis_label = "OTU count",
            node_color_axis_label = "Log 2 ratio of median counts",
            layout = "da", initial_layout = "re", # good layout for large trees
            title = "leaf vs root samples")
```

What color corresponds to each group depends on the order they were given in the `compare_groups` function.
Since "leaf" is "treatment_1" in the "diff_table", and "log2_median_ratio" is defined as "log2(treatment_1 / treatment_2)", when a taxon has more counts in leaf samples, the ratio is positive, therefore taxa more abundant in leafs are colored magenta in this case.

### Comparing taxon abundance with more than 2 groups

For pair-wise comparisons with more than two groups we have developed a graphing technique we call a heat tree matrix, in which trees are made for all pairs of sample groupings and arranged in a matrix with a larger key tree that has the taxon names and a legend.
The code to do this is similar to the code for making a single differential heat tree like we did above, but uses the `heat_tree_matrix` function.
First we need to use `compare_groups` to generate data for all pair-wise comparisons for a grouping with more than two treatments.
The code below compares all the sites used in this study to eachother:

```{r}
obj$data$diff_table <- compare_groups(obj, data = "tax_abund",
                                      cols = sample_data$SampleID,
                                      groups = sample_data$Site)
print(obj$data$diff_table)
```

We then need to correct for multiple comparisons and set non-significant differences to zero like we did before:

```{r}
obj <- mutate_obs(obj, "diff_table",
                  wilcox_p_value = p.adjust(wilcox_p_value, method = "fdr"))

obj$data$diff_table$log2_median_ratio[obj$data$diff_table$wilcox_p_value > 0.05] <- 0
```

Finally we call the `heat_tree_matrix` command with the same options that would be used for a single tree.

```{r fig.width = 10, fig.height = 10}
obj %>%
  metacoder::filter_taxa(taxon_ranks == "o", supertaxa = TRUE, reassign_obs = FALSE) %>%
  mutate_obs("cleaned_names", gsub(taxon_names, pattern = "\\[|\\]", replacement = "")) %>%
  metacoder::filter_taxa(grepl(cleaned_names, pattern = "^[a-zA-Z]+$"), reassign_obs = FALSE) %>%
  heat_tree_matrix(data = "diff_table",
                   node_label = cleaned_names,
                   node_size = n_obs, # number of OTUs
                   node_color = log2_median_ratio, # difference between groups
                   node_color_trans = "linear",
                   node_color_interval = c(-3, 3), # symmetric interval
                   edge_color_interval = c(-3, 3), # symmetric interval
                   node_color_range = diverging_palette(), # diverging colors
                   node_size_axis_label = "OTU count",
                   node_color_axis_label = "Log 2 ratio of median counts",
                   layout = "da", initial_layout = "re",
                   key_size = 0.67,
                   seed = 2)
```

We have only compared three groups here, due to the nature of this dataset, so this technique is not much better than 3 separate, full sized graphs in this case, but with more groups it can be a uniquely effective way to show lots of comparisons.
See the end of the [example analysis](02--quick_example.html) for a better example of this technique.


```{r include=FALSE}
save(ps_obj, obj, sample_data, file = "diversity_data.Rdata")
```

## Exercises

In these exercises, we will be using the `ps_obj` and `obj` from the analysis above.
If you did not run the code above or had problems, run the following code to get the objects used.
You can download the "diversity_data.Rdata" file <a href="diversity_data.Rdata" download="diversity_data.Rdata">here</a>.

```{r}
load("diversity_data.Rdata")
```


**1a)** Look at the documentation for the `plot_richness` function from `phyloseq`. Try to make a plot using only the Simpson and inverse Simpson indexes, colored by site and split up by sample type (leaf vs root).

```{r hide_button = TRUE}
plot_richness(ps_obj, color = "Site", x = "Genotype", measures = c("Simpson", "InvSimpson"))
```

**1b)** The Simpson and inverse Simpson indexes display the same information in different ways (if you know one, you can calculate the other). How do they differ?

```{asis hide_button = "Show Answer"}
The Simpson index is bound between 0 and 1, wheras the inverse Simpson index is greater than zero. Also, the inverse Simpson index tends to spread out higher diversity values more. 
```

**2)** Rarefaction and converting counts to proportions are two ways of accounting for unequal sample depth. Although proportions are more intuitive and easier to understand, why might rarefaction be better when calculating diveristy indexes?

```{asis hide_button = "Show Answer"}
Some diversity indexes take into account the number of species. This especially important for those that use presence/absence instead of abundance, such as the Sørensen and Jaccard indexes. When read counts are rarefied, some low counts go to zero, effectively reducing the number of species in the sample, whereas converting to proportions will never make a count go to zero, so will not remove the effect of unequal sampling depth on these indexes.
```

**3a)** Using the techniques presented in the section on plotting the abundance of taxa, make a plot of taxon abundance for the site encoded "Jam".

```{r hide_button = TRUE}
obj$data$tax_abund <- calc_taxon_abund(obj, "otu_props")

obj$data$site_abund <- calc_group_mean(obj, "tax_abund",
                                       cols = sample_data$SampleID,
                                       groups = sample_data$Site)

obj %>%
  metacoder::filter_taxa(Jam > 0) %>% # metacoder:: needed because of phyloseq::filter_taxa
  heat_tree(node_label = taxon_names,
            node_size = Jam, 
            node_color = Jam, 
            layout = "da", initial_layout = "re", 
            title = "Taxa in leafs")
```

**3b)** What are some of the most abundant families in the site "Jam"?

```{asis hide_button = "Show Answer"}
Sphingobacteriaceae, Chitinophagaceae, Sphingomonadaceae, Comamonadaceae, Microbacteriaceae, etc..
```

**3c)** What is the most abundant phylum in the site "Jam"?

```{asis hide_button = "Show Answer"}
Proteobacteria
```


**4a)** The ordination analysis colored by genotype showed no community differences corresponding to genotype. Try to use `compare_groups` to see if any taxa are significantly differentially abundant between genotypes to verify this result.

```{r hide_button = TRUE}
library(dplyr)
obj %>%
  compare_groups(data = "tax_abund",
                 cols = sample_data$SampleID,
                 groups = sample_data$Genotype) %>% 
  mutate(wilcox_p_value = p.adjust(wilcox_p_value, method = "fdr")) %>%
  filter(wilcox_p_value < 0.05)
```

**4b)** How many taxa have significant differences between genotypes?

```{asis hide_button = "Show Answer"}
None
```


## References