-
Notifications
You must be signed in to change notification settings - Fork 0
/
interstitial_report3.Rmd
214 lines (142 loc) · 6.74 KB
/
interstitial_report3.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
---
title: "scRNAseq Interstitial cells"
author: "Marion Hardy"
date: "`r Sys.Date()`"
output:
html_document:
toc: yes
theme: spacelab
highlight: monochrome
editor_options:
chunk_output_type: console
markdown:
wrap: 72
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(message = FALSE, cache = T, echo = FALSE, warning = F, cache.lazy = F)
knitr::opts_chunk$set(fig.width=10, fig.height=8)
library(tidyverse)
library(Seurat)
library(openxlsx)
```
# Introduction
**This explores the nneighbors parameter for the UMAP.**
Data from our collaborator Panagiotis rds file for a
SingleCellExperiment object containing the single cell data for **the
interstitial cells** of *Hydra Vulgaris* during multiples stages of
regeneration after bisection:
<https://www.dropbox.com/scl/fi/dg76sigjnj5u6qr06xk79/sce_interstitial_Juliano.rds?rlkey=f0y3lqt0wdcwq652zisnrpf9t&dl=0>
BUT they mapped it to *Hydra Magnipapillata (105)* "Drop-seq reads from
15 libraries generated for Hydra vulgaris strain AEP were mapped to the
2.0 genome assembly of closely related Hydra vulgaris strain 105
(available at <https://research.nhgri.nih.gov/hydra/>) and processed
using the Hydra 2.0 gene models. Strain Hydra vulgaris 105 was formerly
referred to as Hydra magnipapillata 105."
The coldata of the object contain cell annotation including
- quality metrics: nFeature nCount (not MT percentage interestingly)
- batch info: either 2869 (3162 barcodes), 3113 (10352 barcodes), 3271
(13279 barcodes), 3357 (3875 barcodes)
- originating experiments (head or foot regeneration)
- experimental time points
- pseudo-axis assignment (vals.axis ranging from 0-1, increasing in
the foot-tentacle direction)
- mitotic and apoptotic signatures indices from 0 to 1
The rowdata contains gene annotation, using Entrez-gene identifiers. I
have also noticed that in the sce objects there's
- PCA, tSNE and UMAP coordinates for reduced dimensions + corrected
for batch values
- assay metafeatures hold gene_id, product, gene, is.rib.prot.gene
(T/F), HypoMarkers (T/F), ccyle (T/F), apopt (T/F) etc
I converted the sce objects 6.6gb into a seurat object 2.2 gb and
checked that all cited parameters could be found in it.
# Data loading
```{r}
# int = readRDS("./data/sce_interstitial_Juliano.rds")
# seurat = as.Seurat(int, counts = "counts", data=NULL)
seurat = readRDS("data/seurat.Rds")
# Update the gene names in the count matrix
# I updated the gene names when they were determined to be gene of interest
# in downstream analyses + cell type markers
annot = read.xlsx("./data/mcbi_dataset_MH_annotated.xlsx")
annot$Symbol[] <- lapply(annot$Symbol, function(x) sub("LOC", "", x, fixed = TRUE)) # removing the LOC prefix
annot$Symbol_updated[] <- lapply(annot$Symbol_updated, function(x) sub("LOC", "", x, fixed = TRUE)) # removing the LOC prefix
annot$Symbol = unlist(annot$Symbol) # unlisting it to look like a mormal column
annot$Symbol_updated = unlist(annot$Symbol_updated)
temp = data.frame(orig_gene = rownames(seurat))
temp2 = data.frame(updated = annot$Symbol_updated,
orig_gene = annot$Symbol) %>%
filter(!duplicated(updated),!duplicated(orig_gene))
# if temp2$updated is NA, fill with temp
temp3 = left_join(temp, temp2) # now temp3 is ordered like the genes in seurat
temp3 =
temp3 %>%
mutate(updated = coalesce(updated,orig_gene)) #works
seurat@assays$originalexp@counts@Dimnames[[1]] <- temp3$updated
seurat@assays$originalexp@data@Dimnames[[1]] <- temp3$updated
seurat@[email protected] = annot
```
```{r, echo=TRUE}
head([email protected])
table([email protected]$batch, [email protected]$EXP_TIME)
```
# QC
## No filtering applied
The gene names looking like LOC1248.... are actually all symbols of
unnamed regions If you go through the list of genes seen, none of them
are mitochondrial but they do have descriptions
- This data probably should be aligned on the matching genome
- Why are there no mitochondrial genes in the Hydra_05_v3 construct?
Annotations can be found here:
<https://www.ncbi.nlm.nih.gov/datasets/gene/GCF_022113875.1/>
This means that I cannot regress out the variability that comes from the
presence of mitochondrial genes since it seems like there aren't any. I
can still regress out a possible batch effect though.
```{r, fig.width=8, fig.height=5}
# Here, we filter away cells that have unique feature counts(genes) over 6000
# or less than 200. Too many features could indicate a doublet too little might indicate an empty droplet
# We should also filter away cells that have > 10% mitochondrial counts
seurat_f = subset(seurat, subset = nFeature_originalexp > 200 & nFeature_originalexp < 6000)
```
# Checking optimal number of neighbors
We want to check if these very thin clusters are true population sor just the effect of unoptimized number of neighbors during UMAP projection.
## PCA
```{r, fig.width=4, fig.height=4, echo=FALSE}
seurat_f = SCTransform(seurat_f, assay = "originalexp", verbose = F)
# These are now standard steps in the Seurat workflow for visualization and clustering
Idents(seurat_f) = [email protected]$batch
seurat_f <- RunPCA(seurat_f, features = VariableFeatures(object = seurat_f))
DimPlot(seurat_f, reduction = "pca")+
labs(title = "PCA with batch legend")
```
## UMAP
```{r, fig.width=12, fig.height=5}
Idents(seurat_f) = [email protected]$EXP_TIME
seurat_f <- RunUMAP(seurat_f, dims = 1:28, seed.use = 054057, n.neighbors = 5)
p1 = DimPlot(seurat_f, reduction = "umap")+
labs(title = "n neighbors = 5")
seurat_f <- RunUMAP(seurat_f, dims = 1:28, seed.use = 054057, n.neighbors = 10)
p2 = DimPlot(seurat_f, reduction = "umap")+
labs(title = "n neighbors = 10")
seurat_f <- RunUMAP(seurat_f, dims = 1:28, seed.use = 054057, n.neighbors = 50)
p3 = DimPlot(seurat_f, reduction = "umap")+
labs(title = "n neighbors = 50")
seurat_f <- RunUMAP(seurat_f, dims = 1:28, seed.use = 054057, n.neighbors = 100)
p4 = DimPlot(seurat_f, reduction = "umap")+
labs(title = "n neighbors = 100")
p1+p2
p3+p4
```
```{r, fig.width=7, fig.height=5}
seurat_f <- FindNeighbors(seurat_f, dims = 1:28, n.neighbors = 50)
seurat_f <- FindClusters(seurat_f, resolution = 0.025)
DimPlot(seurat_f, label = TRUE)+
labs(title = "Resolution = 0.025")
```
```{r, eval = FALSE}
write_rds(seurat_f, "./data_output/interstitial/seurat_filtered.rds")
```
Here's what, based on the PCA we **are** expecting a central cluster tapering off into four different populations + another separate cluster. And that's what we see whatever the number of neighbors.
# Session info
```{r}
sessionInfo()
```