-
Notifications
You must be signed in to change notification settings - Fork 0
/
interstitial_report2_cluster_attribution.Rmd
286 lines (189 loc) · 8.65 KB
/
interstitial_report2_cluster_attribution.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
---
title: "scRNAseq Interstitial cells : cluster attribution"
author: "Marion Hardy"
date: "`r Sys.Date()`"
output:
html_document:
toc: yes
theme: spacelab
highlight: monochrome
editor_options:
chunk_output_type: console
markdown:
wrap: 72
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(message = FALSE, cache = T, echo = FALSE, warning = F, cache.lazy = F)
knitr::opts_chunk$set(fig.width=6, fig.height=4)
library(tidyverse)
library(openxlsx)
library(Seurat)
library(scCustomize)
```
# Introduction
Data from our collaborator Panagiotis rds file for a
SingleCellExperiment object containing the single cell data for **the
interstitial cells** of *Hydra Vulgaris* during multiples stages of
regeneration after bisection:
<https://www.dropbox.com/scl/fi/dg76sigjnj5u6qr06xk79/sce_interstitial_Juliano.rds?rlkey=f0y3lqt0wdcwq652zisnrpf9t&dl=0>
BUT they mapped it to *Hydra Magnipapillata (105)* "Drop-seq reads from
15 libraries generated for Hydra vulgaris strain AEP were mapped to the
2.0 genome assembly of closely related Hydra vulgaris strain 105
(available at <https://research.nhgri.nih.gov/hydra/>) and processed
using the Hydra 2.0 gene models. Strain Hydra vulgaris 105 was formerly
referred to as Hydra magnipapillata 105."
The coldata of the object contain cell annotation including
- quality metrics: nFeature nCount (not MT percentage interestingly)
- batch info: either 2869 (3162 barcodes), 3113 (10352 barcodes), 3271
(13279 barcodes), 3357 (3875 barcodes)
- originating experiments (head or foot regeneration)
- experimental time points
- pseudo-axis assignment (vals.axis ranging from 0-1, increasing in
the foot-tentacle direction)
- mitotic and apoptotic signatures indices from 0 to 1
The rowdata contains gene annotation, using Entrez-gene identifiers. I
have also noticed that in the sce objects there's
- PCA, tSNE and UMAP coordinates for reduced dimensions + corrected
for batch values
- assay metafeatures hold gene_id, product, gene, is.rib.prot.gene
(T/F), HypoMarkers (T/F), ccyle (T/F), apopt (T/F) etc
I converted the sce objects 6.6gb into a seurat object 2.2 gb and
checked that all cited parameters could be found in it.
# Data loading
Which dataset set should I use? Batch regressed looks weird but might be interesting depending on how timepoints end up clustering...
Let's make a UMAP with timepoints as labels.
I am using the scaled and batch regressed data.
```{r}
# seurat objects
seurat_f = readRDS("data_output/interstitial/seurat_filtered.rds")
seurat_r = readRDS("data_output/interstitial/seurat_filtered_regressed.rds")
# 105 genome annotations
annot = read.xlsx("./data/mcbi_dataset_MH_annotated.xlsx")
annot$Symbol_updated = as.character(annot$Symbol_updated)
# set up to split the object per head or foot
footcond = c("REG_FOOT_t0","REG_FOOT_t6","REG_FOOT_t12","REG_FOOT_t24",
"REG_FOOT_t48","REG_FOOT_t96")
seurat_r$axis <- ifelse(test = seurat_r$EXP_TIME %in% footcond, yes = "Foot", no = "Head")
# changing labeled time from t6 to t06 so thqt it orders properly in the plots
Idents(seurat_r) = "EXP_TIME"
seurat_r = RenameIdents(seurat_r, "REG_FOOT_t6" = "REG_FOOT_t06")
seurat_r = RenameIdents(seurat_r, "REG_HEAD_t6" = "REG_HEAD_t06")
Idents(seurat_f) = "EXP_TIME"
seurat_f = RenameIdents(seurat_f, "REG_FOOT_t6" = "REG_FOOT_t06")
seurat_f = RenameIdents(seurat_f, "REG_HEAD_t6" = "REG_HEAD_t06")
```
# Subsetting by head/foot and timepoint for both regressed and unregressed
## Not batch regressed
```{r}
Idents(seurat_f) = "EXP_TIME"
DimPlot(seurat_f, label = FALSE)
# Creating the head and foot data subset
hfootf = subset(seurat_f, subset = EXP == "REG_FOOT")
hheadf = subset(seurat_f, subset = EXP == "REG_HEAD")
```
```{r, fig.width=30, fig.height=6}
# Plotting conditions separately
Idents(hfootf) = [email protected]$SCT_snn_res.0.025
Idents(hheadf) = [email protected]$SCT_snn_res.0.025
DimPlot(hfootf, reduction = "umap", split.by = "EXP_TIME", pt.size = 2)
DimPlot(hheadf, reduction = "umap", split.by = "EXP_TIME", pt.size = 2)
```
We don't really see time-dependent clustering here.
## Batch regressed
```{r}
Idents(seurat_r) = "EXP_TIME"
DimPlot(seurat_r, label = FALSE)
# Creating the head and foot data subset
hfoot = subset(seurat_r, subset = axis == "Foot")
hhead = subset(seurat_r, subset = axis == "Head")
```
```{r, fig.width=30, fig.height=6}
# Plotting conditions separately
Idents(hfoot) = [email protected]$SCT_snn_res.0.025
Idents(hhead) = [email protected]$SCT_snn_res.0.025
DimPlot(hfoot, reduction = "umap", split.by = "EXP_TIME", pt.size = 2)
DimPlot(hhead, reduction = "umap", split.by = "EXP_TIME", pt.size = 2)
```
There's much more time dependent-clustering happening here.
I will move forward with the non regressed data unless told otherwise because the clustering at higher resolutions do not indicate that later timepoints have their own clusters so regressing the batch might have induced artefact?
# Cluster attribution
## UMAP at 0.025 resolution
```{r, fig.width=8, fig.height=6}
Idents(seurat_f) = "SCT_snn_res.0.025"
DimPlot(seurat_f, label = TRUE)+
labs(title = "Resolution = 0.025")
```
## Finding markers per clusters
```{r, eval=FALSE}
# Find markers for each clusters------------------------------------------------
# With the ` FindAllMarkers()` function we are comparing
# each cluster against all other clusters to identify potential marker genes.
# The cells in each cluster are treated as replicates, and essentially a
# differential expression analysis is performed with some statistical test.
# By default, it's a wilcoxon rank sum test
# Seurat lab suggested that DE analysis on obj@assay$RNA@data, which is normalized data (not scaled). So you are right, you should NormalizeData and then FindMarkers. However, Seurat 4 (not sure which small version exactly) starts to promotes DE analysis on obj@[email protected], which is person residual. My experience is to try both and take the one that fits your goal because Seurat said both are not incorrect.
# DO NOT RUN THIS COMMAND IF YOU'VE ALREADY DONE IT ONCE, THIS CHUNK IS EVAL = FALSE
markers = FindAllMarkers(object = seurat_f,
logfc.threshold = 0.5,
assay = "SCT",
slot = "scale.data")
```
```{r}
markers = read.csv("./data_output/interstitial/markers_diff_per_cluster.csv")%>%
filter(p_val_adj < 0.05)
# write.csv(markers, "./data_output/interstitial/markers_diff_per_cluster.csv")
markers$gene = as.character(markers$gene)
```
### Markers that seem specific to cluster identity.
```{r}
head(markers)
```
```{r, , fig.width=15, fig.height=10}
target = markers %>%
group_by(cluster) %>%
top_n(n = 5, wt = avg_diff)
target =
target %>%
filter(!duplicated(gene))
target = target[!duplicated(target$gene),]
# Sooo turns out i'm gonna have to manually give these important genes symbols
# so they become readable on a graph
dp = DotPlot_scCustom(seurat_f, features = target$gene,
colors_use = viridis_plasma_dark_high,
x_lab_rotate = T)+
labs(title = "Dotplot for top 5 differentially expressed genes per cluster")
dp
```
```{r, warning=TRUE, fig.width=5, fig.height=4}
theo = read.xlsx("./data/mcbi_dataset_MH_annotated.xlsx", sheet = "celltype_markers")
# Sooo turns out i'm gonna have to manually give these important genes symbols
# so they become readable on a graph
dp = DotPlot_scCustom(seurat_f, features = theo$Symbol,
colors_use = viridis_plasma_dark_high,
x_lab_rotate = T)+
labs(title = "Dotplot for theoretical markers")
dp
```
```{r, fig.height=10, fig.width=8}
p1 = Clustered_DotPlot(seurat_f, features = target$gene, k = 12)
print(p1[[2]])
```
```{r, fig.height=25, fig.width=15}
target <- c("NSP4","midasin","mini-COL8","SP-D-like","FH20-X3","CAII",
"CELA3B","zinc-carboxypeptidase","ANO39","TYN1","H2A.2.2","nas2-X2",
"Lwamide-X1","DMRT1","TUBA4A-X1","DMRT1","HTRA3","polRF-X1",
"nop58-X1","OTP","MEP1A","rhammosyl-O-methyltransferase","hywnt3",
"PPOD1","ks1","hyAlx","ELAV2","POU4","MUC2")
FeaturePlot(object = seurat_f,
features = target,
reduction = "umap",
order = TRUE,
min.cutoff = 'q10',
label = TRUE,
repel = TRUE,
pt.size = 0.75)
```
# Session info
```{r}
sessionInfo()
```