bone_marrow/analyze_bone_marrow.Rmd

---
title: "HCA - Bone Marrow Analysis"
output: github_document
---

# Packages and Shared Code

```{r}
suppressPackageStartupMessages(library(rhdf5))
suppressPackageStartupMessages(library(DarkKinaseTools))
suppressPackageStartupMessages(library(tidyverse))
suppressPackageStartupMessages(library(BerginskiRMisc))
suppressPackageStartupMessages(library(here))
```

```{r shared_functions}
source(here('shared_functions.R'))
```

# Data Download

```{r downloading}
if (! file.exists(here('bone_marrow','ica_bone_marrow_h5.h5'))) {
  download.file('https://s3.amazonaws.com/preview-ica-expression-data/ica_bone_marrow_h5.h5',
                here('bone_marrow','ica_bone_marrow_h5.h5'))
}
```

# Data Loading

Loading this data takes some time and lots of memory (> 16 Gigs).
```{r loading}
bone_marrow = read_h5_file_to_tidy(here('bone_marrow','ica_bone_marrow_h5.h5')) %>%
  mutate(patient = gsub(".*(BM[[:digit:]]).*","\\1",barcode))
```


# Exploritory Data Analysis

Let's get some quick summaries of the full data set, namely the total number of mapped reads per cell and the total number of unique genes per cell.
```{r summary_read_count}
per_cell_summary  = bone_marrow %>%
  group_by(barcode) %>%
  summarise(read_count = sum(counts),
            unique_genes = n())

ggplot(per_cell_summary,aes(x=read_count)) + 
  geom_histogram(binwidth=20) + 
  theme_berginski() +
  labs(x='Read Count Sum Per Cell',y='')
```

So there is a fair bit of skew in this data past the apparent double peak at the lower end of the disitribution. Let's take a closer look at the lower end of the distribution:

```{r read_count_zoom}
ggplot(per_cell_summary,aes(x=read_count)) + 
  geom_histogram(binwidth=50) + 
  theme_berginski() +
  labs(x='Read Count Sum Per Cell',y='') +
  xlim(c(0,10000)) +
  geom_vline(xintercept = 800,color='red')
```

There is a clear trough at 800 total mapped reads that I've marked out in red, that will be the lower threshold to toss out cells that didn't property adhere (maybe?) to the beads. The upper boundary doesn't seem as clear to me, but there is probably some barcode sets that represent multiple cells in a single droplet. How about a max read count of 7500?

What about the unique number of genes in each barcode:

```{r gene_count}
ggplot(per_cell_summary,aes(x=unique_genes)) + 
  geom_histogram(binwidth=50) + 
  theme_berginski() +
  labs(x='Unique Genes Per Cell',y='') +
  geom_vline(xintercept = 275, color='red')
```

I don't have a particularly compelling reason to filter on the number of unique genes found per cell, so I don't see any particular reason to filter on this value.

For good measure, let's also take a look at the binned scatterplot comparing the read counts and unique genes.

```{r}
ggplot(per_cell_summary,aes(x=unique_genes,y=read_count)) + 
  geom_hex() + theme_berginski() +
  labs(x='Unique Gene Count',y="Read Count Sum")
```

As expected, low unique gene counts somewhat line up with the read count sum.

# Data Filtering and Results

Let's run the filtering to remove the problematic barcodes and then get out the kinase reads.

```{r filtering}
bone_marrow_filt = bone_marrow %>% filter_single_cell(read_total_range = c(800,7500))

bone_marrow_kinases = bone_marrow_filt %>% filter(ensembl_gene_id %in% all_kinases$ensembl_gene_id) %>%
  left_join(all_kinases)

bone_marrow_dark_kinases = bone_marrow_kinases %>% filter(class == "Dark")
bone_marrow_light_kinases = bone_marrow_kinases %>% filter(class == "Light")

per_kinase_stats = bone_marrow_kinases %>% group_by(gene_names,class) %>%
  summarise(total_reads = sum(counts), 
            total_cells = length(unique(barcode)), 
            read_variance = var(counts))

per_cell_stats = bone_marrow_kinases %>% group_by(barcode,class) %>%
  summarise(num_kinases = length(unique(gene_names)),
            total_kinase_reads = sum(counts))
```

## Summary Results

Now let's run through some basic results:

* `r length(unique(bone_marrow_kinases$ensembl_gene_id))` unique kinases are represented in the data
* `r round(100*sum(bone_marrow_kinases$counts)/sum(bone_marrow_filt$counts),2)`% of the mapped reads are to a kinase
* `r round(100*length(unique(bone_marrow_kinases$barcode))/length(unique(bone_marrow_filt$barcode)))`% of the cells express at least one kinase
* `r round(100*length(unique(bone_marrow_dark_kinases$barcode))/length(unique(bone_marrow_filt$barcode)))`% of the cells express a dark kinase


```{r sanity_checking, eval=FALSE, echo=FALSE}
ggplot(bone_marrow_filt_per_cell_summary,aes(x=unique_genes,y=read_count)) + 
  geom_hex() + theme_berginski() +
  labs(x='Unique Gene Count',y="Read Count Sum")
```

## Summary Distributions

### Per Kinase Plots

I've summed up the total number of reads for each of the kinases across all of the cells in the filtered data set. Let's take a look at the distribution of the summed read counts splitting on the Dark/Light kinase sets.

```{r}
ggplot(per_kinase_stats,aes(x=total_reads,y=stat(density),color=class, fill=class)) + 
  geom_histogram(position="dodge") + theme_berginski() +
  labs(color="Kinase\nType",fill="Kinase\nType",x="Total Reads",y="Density")
```

I've also figured out how many cells express each given kinase. For reference, there are `r length(unique(bone_marrow_filt$barcode))` cells in the data set.

```{r}
ggplot(per_kinase_stats,aes(x=total_cells,y=stat(density),color=class, fill=class)) + 
  geom_histogram(position="dodge") + theme_berginski() +
  labs(color="Kinase\nType",fill="Kinase\nType",x="Total Cell Count",y="Density")
```

### Per Cell Plots

Another question involves looking at the expressed kinases per cell, namely how many of the kinases are present in each cell. For reference, the median cell expresses `r median(per_cell_stats$num_kinases)` kinases.

```{r}
ggplot(per_cell_stats,aes(x=num_kinases,y=stat(density))) + 
  geom_histogram(position="dodge",binwidth=1) + theme_berginski() +
  labs(color="Kinase\nType",fill="Kinase\nType",x="Number of Kinases Per Cell",y="Density")
```

OK, what about the Dark/Light split? The following plot is the same as above, except with the dark/light kinases split into seperate columns. For reference, the median number of dark kinases per cell is `r median(filter(per_cell_stats,class=="Dark")$num_kinases)`, while the median number of light kinases is `r median(filter(per_cell_stats,class=="Light")$num_kinases)`.

```{r}
ggplot(per_cell_stats,aes(x=num_kinases,y=stat(density),color=class,fill=class)) + 
  geom_histogram(position="dodge",binwidth=1) + theme_berginski() +
  labs(color="Kinase\nType",fill="Kinase\nType",x="Number of Kinases Per Cell",y="Density")
```

### Most Expressed Dark Kinases

```{r, echo=F,results='asis'}
top_dark_kinases = per_kinase_stats %>% 
  filter(class == "Dark") %>% 
  arrange(desc(total_reads))

knitr::kable(top_dark_kinases[1:10,])
```

### Most Expressed Light Kinases
```{r, echo=F,results='asis'}
top_light_kinases = per_kinase_stats %>% 
  filter(class == "Light") %>% 
  arrange(desc(total_reads))

knitr::kable(top_light_kinases[1:10,])
```

## Kinase Correlations

I wrote a bit of code to do cross-correlation analysis on the expression levels of each of the kinases compared with one another, but I don't really know what to do with it. It's also time consuming, so I've commented it out.

```{r}
# kinase_correlations = correlate_single_cell_read_counts(bone_marrow_kinases)
# 
# saveRDS(kinase_correlations,here('bone_marrow','bone_marrow_kinase_correlations.rds'))
```