-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathanalyze_bone_marrow.Rmd
192 lines (141 loc) · 7.32 KB
/
analyze_bone_marrow.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
---
title: "HCA - Bone Marrow Analysis"
output: github_document
---
# Packages and Shared Code
```{r}
suppressPackageStartupMessages(library(rhdf5))
suppressPackageStartupMessages(library(DarkKinaseTools))
suppressPackageStartupMessages(library(tidyverse))
suppressPackageStartupMessages(library(BerginskiRMisc))
suppressPackageStartupMessages(library(here))
```
```{r shared_functions}
source(here('shared_functions.R'))
```
# Data Download
```{r downloading}
if (! file.exists(here('bone_marrow','ica_bone_marrow_h5.h5'))) {
download.file('https://s3.amazonaws.com/preview-ica-expression-data/ica_bone_marrow_h5.h5',
here('bone_marrow','ica_bone_marrow_h5.h5'))
}
```
# Data Loading
Loading this data takes some time and lots of memory (> 16 Gigs).
```{r loading}
bone_marrow = read_h5_file_to_tidy(here('bone_marrow','ica_bone_marrow_h5.h5')) %>%
mutate(patient = gsub(".*(BM[[:digit:]]).*","\\1",barcode))
```
# Exploritory Data Analysis
Let's get some quick summaries of the full data set, namely the total number of mapped reads per cell and the total number of unique genes per cell.
```{r summary_read_count}
per_cell_summary = bone_marrow %>%
group_by(barcode) %>%
summarise(read_count = sum(counts),
unique_genes = n())
ggplot(per_cell_summary,aes(x=read_count)) +
geom_histogram(binwidth=20) +
theme_berginski() +
labs(x='Read Count Sum Per Cell',y='')
```
So there is a fair bit of skew in this data past the apparent double peak at the lower end of the disitribution. Let's take a closer look at the lower end of the distribution:
```{r read_count_zoom}
ggplot(per_cell_summary,aes(x=read_count)) +
geom_histogram(binwidth=50) +
theme_berginski() +
labs(x='Read Count Sum Per Cell',y='') +
xlim(c(0,10000)) +
geom_vline(xintercept = 800,color='red')
```
There is a clear trough at 800 total mapped reads that I've marked out in red, that will be the lower threshold to toss out cells that didn't property adhere (maybe?) to the beads. The upper boundary doesn't seem as clear to me, but there is probably some barcode sets that represent multiple cells in a single droplet. How about a max read count of 7500?
What about the unique number of genes in each barcode:
```{r gene_count}
ggplot(per_cell_summary,aes(x=unique_genes)) +
geom_histogram(binwidth=50) +
theme_berginski() +
labs(x='Unique Genes Per Cell',y='') +
geom_vline(xintercept = 275, color='red')
```
I don't have a particularly compelling reason to filter on the number of unique genes found per cell, so I don't see any particular reason to filter on this value.
For good measure, let's also take a look at the binned scatterplot comparing the read counts and unique genes.
```{r}
ggplot(per_cell_summary,aes(x=unique_genes,y=read_count)) +
geom_hex() + theme_berginski() +
labs(x='Unique Gene Count',y="Read Count Sum")
```
As expected, low unique gene counts somewhat line up with the read count sum.
# Data Filtering and Results
Let's run the filtering to remove the problematic barcodes and then get out the kinase reads.
```{r filtering}
bone_marrow_filt = bone_marrow %>% filter_single_cell(read_total_range = c(800,7500))
bone_marrow_kinases = bone_marrow_filt %>% filter(ensembl_gene_id %in% all_kinases$ensembl_gene_id) %>%
left_join(all_kinases)
bone_marrow_dark_kinases = bone_marrow_kinases %>% filter(class == "Dark")
bone_marrow_light_kinases = bone_marrow_kinases %>% filter(class == "Light")
per_kinase_stats = bone_marrow_kinases %>% group_by(gene_names,class) %>%
summarise(total_reads = sum(counts),
total_cells = length(unique(barcode)),
read_variance = var(counts))
per_cell_stats = bone_marrow_kinases %>% group_by(barcode,class) %>%
summarise(num_kinases = length(unique(gene_names)),
total_kinase_reads = sum(counts))
```
## Summary Results
Now let's run through some basic results:
* `r length(unique(bone_marrow_kinases$ensembl_gene_id))` unique kinases are represented in the data
* `r round(100*sum(bone_marrow_kinases$counts)/sum(bone_marrow_filt$counts),2)`% of the mapped reads are to a kinase
* `r round(100*length(unique(bone_marrow_kinases$barcode))/length(unique(bone_marrow_filt$barcode)))`% of the cells express at least one kinase
* `r round(100*length(unique(bone_marrow_dark_kinases$barcode))/length(unique(bone_marrow_filt$barcode)))`% of the cells express a dark kinase
```{r sanity_checking, eval=FALSE, echo=FALSE}
ggplot(bone_marrow_filt_per_cell_summary,aes(x=unique_genes,y=read_count)) +
geom_hex() + theme_berginski() +
labs(x='Unique Gene Count',y="Read Count Sum")
```
## Summary Distributions
### Per Kinase Plots
I've summed up the total number of reads for each of the kinases across all of the cells in the filtered data set. Let's take a look at the distribution of the summed read counts splitting on the Dark/Light kinase sets.
```{r}
ggplot(per_kinase_stats,aes(x=total_reads,y=stat(density),color=class, fill=class)) +
geom_histogram(position="dodge") + theme_berginski() +
labs(color="Kinase\nType",fill="Kinase\nType",x="Total Reads",y="Density")
```
I've also figured out how many cells express each given kinase. For reference, there are `r length(unique(bone_marrow_filt$barcode))` cells in the data set.
```{r}
ggplot(per_kinase_stats,aes(x=total_cells,y=stat(density),color=class, fill=class)) +
geom_histogram(position="dodge") + theme_berginski() +
labs(color="Kinase\nType",fill="Kinase\nType",x="Total Cell Count",y="Density")
```
### Per Cell Plots
Another question involves looking at the expressed kinases per cell, namely how many of the kinases are present in each cell. For reference, the median cell expresses `r median(per_cell_stats$num_kinases)` kinases.
```{r}
ggplot(per_cell_stats,aes(x=num_kinases,y=stat(density))) +
geom_histogram(position="dodge",binwidth=1) + theme_berginski() +
labs(color="Kinase\nType",fill="Kinase\nType",x="Number of Kinases Per Cell",y="Density")
```
OK, what about the Dark/Light split? The following plot is the same as above, except with the dark/light kinases split into seperate columns. For reference, the median number of dark kinases per cell is `r median(filter(per_cell_stats,class=="Dark")$num_kinases)`, while the median number of light kinases is `r median(filter(per_cell_stats,class=="Light")$num_kinases)`.
```{r}
ggplot(per_cell_stats,aes(x=num_kinases,y=stat(density),color=class,fill=class)) +
geom_histogram(position="dodge",binwidth=1) + theme_berginski() +
labs(color="Kinase\nType",fill="Kinase\nType",x="Number of Kinases Per Cell",y="Density")
```
### Most Expressed Dark Kinases
```{r, echo=F,results='asis'}
top_dark_kinases = per_kinase_stats %>%
filter(class == "Dark") %>%
arrange(desc(total_reads))
knitr::kable(top_dark_kinases[1:10,])
```
### Most Expressed Light Kinases
```{r, echo=F,results='asis'}
top_light_kinases = per_kinase_stats %>%
filter(class == "Light") %>%
arrange(desc(total_reads))
knitr::kable(top_light_kinases[1:10,])
```
## Kinase Correlations
I wrote a bit of code to do cross-correlation analysis on the expression levels of each of the kinases compared with one another, but I don't really know what to do with it. It's also time consuming, so I've commented it out.
```{r}
# kinase_correlations = correlate_single_cell_read_counts(bone_marrow_kinases)
#
# saveRDS(kinase_correlations,here('bone_marrow','bone_marrow_kinase_correlations.rds'))
```