-
Notifications
You must be signed in to change notification settings - Fork 0
/
prediction.Rmd
435 lines (331 loc) · 20.3 KB
/
prediction.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
---
title: "Analysis of a systemic bleomycin model of scleroderma: Predicting lung function from skin expression"
author: "[Ron Ammar](mailto:[email protected])"
date: '`r format(Sys.time(), "%A, %B %e, %Y")`'
header-includes:
- \usepackage{fancyhdr}
- \usepackage{lastpage}
- \usepackage{pdflscape}
- \usepackage{titling}
- \pagestyle{fancy}
- \fancyhead[LE,RO]{}
- \rhead{\includegraphics[width=3.5cm]{bms_logo.jpg}}
- \fancyfoot[C]{\textit{BMS Confidential}}
- \fancyfoot[LE,RO]{\thepage\ of \pageref{LastPage}}
- \newcommand{\blandscape}{\begin{landscape}}
- \newcommand{\elandscape}{\end{landscape}}
- \pretitle{\begin{center}\LARGE\includegraphics[width=12cm]{bms_logo.jpg}\\[\bigskipamount]}
- \posttitle{\end{center}}
bibliography:
- ../../resources/references/google_scholar_library.bib
- ../../resources/references/online_resources.bib
- ../../resources/references/pubmed_export_with_cite_key_reformatted.bib
- ../../resources/references/r_citations.bib
- ../../resources/references/biorxiv_library.bib
- ../../resources/references/manually_edited_library.bib
csl: ../../resources/references/elsevier-harvard2.csl
output:
pdf_document:
toc: true
toc_depth: 4 # default is depth of 3
number_sections: true
---
```{r, echo=FALSE}
# Clear the current session, to avoid errors from persisting data structures
# NOTE: when using parameterized RMarkdown, we cannot remove the params declared
# in the YAML header.
rm(list=ls())
# Free up memory by forcing garbage collection
invisible(gc())
# Pretty printing in knitr
library(printr)
# Manually set the seed to an arbitrary number for consistency in reports
set.seed(1234, kind="Mersenne-Twister", normal.kind="Inversion")
# Do not convert character vectors to factors unless explicitly indicated
options(stringsAsFactors=FALSE)
startTime <- Sys.time()
```
```{r, global_options, include=FALSE}
# use include=FALSE to have the chunk evaluated, but neither the code nor its output displayed.
knitr::opts_chunk$set(echo=FALSE, message=FALSE, fig.align="center",
fig.width=12, fig.height=8,
fig.path='figs/',
dpi=300, # knitr default is 72
dev=c('pdf', 'png')) # output figures as both pdf and png
```
```{r, load_libraries_setwd}
library(caret)
library(doParallel)
library(ggplot2)
library(glmnet)
library(parallel)
library(tidyr)
library(dplyr) # attach dplyr environment last so it's first in search path
if ("rstudioapi" %in% installed.packages()[, "Package"] & rstudioapi::isAvailable() & interactive()) {
# When in RStudio, dynamically sets working directory to path of this script
setwd(dirname(rstudioapi::getActiveDocumentContext()$path))
# Set the default ggplot theme to "theme_bw" with default font size
theme_set(theme_bw())
} else {
# Set the default ggplot theme to "theme_bw" with specified base font size
theme_set(theme_bw(20))
}
# Check if stash is mounted
if (length(list.files("/stash", all.files=TRUE, include.dirs=TRUE, no..=TRUE)) == 0) {
warning("/stash is not mounted")
}
##### Constants
NUM_CORES <- detectCores(logical=FALSE) # Use all *physical* cores
CV_FOLDS <- 5
CV_REPEATS <- 100 # See Harrell's comment: https://stats.stackexchange.com/q/61090/84115
##### Parallelization
cl <- makeCluster(NUM_CORES)
registerDoParallel(cl)
```
\newpage
# Background
Systemic sclerosis (SSc, the systemic form of scleroderma) is a chronic progressive disease characterized by three main features: vascular injury, immunological abnormalities, and fibrosis of the skin and various internal organs. Fibrosis is the hallmark feature of SSc and is responsible for a majority of the morbidity and mortality associated with this disease. SSc patients with significant internal organ involvement have a 10-year survivial rate of only 38% (Korman B et al. Curr Rheumatol Rep (2015) 17:21). The mechanism underlying the development of fibrosis remains unclear, and currently there are no effective therapies for this disease.
A well-characterized mouse model of SSc involves daily subcutaneous injections of the antitumor antibiotic bleomycin (BLM), which leads to localized dermal fibrosis as well as pulmonary fibrosis. In-house, we have termed this the “Systemic Bleomycin Model” to distinguish it from the already-established Intratracheal (IT) Model. Our lab plans to utilize this model, which mimics several key features of human SSc, to examine the pathological mechanisms underlying the development and progression of fibrosis in SSc. We have just established this model, and we have conducted preliminary analyses to characterize the model. As part of this characterization – and in an effort to achieve the objectives listed below – the current project aims to conduct a thorough transcriptomics analysis of skin and lung samples from these mice.
It is well established that the TGF-$\beta$ signaling pathway is required for bleomycin-induced fibrosis in this model. Thus, in the present study we used the ALK5 (TGF-$\beta$ receptor I) inhibitor SB525334 to determine whether we could cause regression of dermal and pulmonary fibrosis by inhibiting TGF-$\beta$ signaling.
# Purpose
The objectives of this study are to:
1. Characterize the development of dermal and pulmonary fibrosis in the systemic bleomycin mouse model.
1. **Compare & contrast lung and skin signatures to determine if a skin biopsy could function as a surrogate for a lung-based diagnostics (eg. CT scan).**
1. Characterize the (potential) resolution of dermal and pulmonary fibrosis in the systemic bleomycin mouse model.
1. Identify potential therapeutic targets in SSc.
1. Achieve better understanding the effects of the ALK5 inhibitor SB525334 in the lungs and skin.
1. In terms of pulmonary fibrosis, compare the results from this analysis to those obtained from Sumanta Mukherjee’s IT bleomycin study.
# Experimental Design
![](../../resources/experimental_design.png)
![](../../resources/group_and_animal_numbers.png)
Female C57BL/6 mice (~12 weeks old) were used for this study. An electric shaver was used to shave the interscapular region on the back of each mouse, and a non-toxic permanent marker was used to draw two small circles on the skin within the shaved area on each mouse. Fibrosis was induced by daily subcutaneous injections of either bleomycin (10mg/kg/day; 100µl per injection site of 1mg/ml solution for these ~20gm mice) or PBS (vehicle control) within the marked interscapular regions. Injections began on Day 0 and were done five times per week for two weeks, using a 27-gauge needle. (Bhattacharyya S. et al., Sci Transl Med 2014 Apr 16;6(232):232ra50; Huang J. et al., Ann Rheum Dis 2015 Apr 9. pii: annrheumdis-2014-207109).
Beginning on Day 12, the ALK5 inhibitor SB525334 (30mpk) or vehicle was delivered (PO, BID dosing). Note that Groups 1-2 and Groups 3-4 were sacrificed on Day 7 and Day 14, respectively, and thus did not receive oral dosing. Groups 5-7 were sacrificed on Day 21. The last oral doses were administered on Day 27. Groups 8-10 were sacrificed on Day 28. Groups 11-12 (which specifically serve to assess resolution in this model, and do not examine the effect of SB525334) were sacrificed on Day 42.
Please see original design PowerPoint deck for additional information on groups (12 groups, n = 8-10 per group at onset of study; a total of seven animals died during the study).
\newpage
# Analysis
## A survey of available biological data for the SSc model
```{r, load_data, results="hide", warning=FALSE}
source("../../resources/load_data.R", chdir=TRUE)
```
The amount of collagen in the mouse lung and skin is measured via hydroxyproline (*hypro*). In the lung, we observe slightly elevated hypro on days 14 and 21 in the bleomycin-treated mice compared to PBS or the ALK5 inhibitor SB525334. In skin, the trend is not as apparent as SB525334 results in lower hypro at day 21, but by day 28 there is almost no distinction between it and vehicle.
```{r, hyroxyproline}
forPlotting <- biodata %>%
mutate(Treatment..PO.=ifelse(is.na(Treatment..PO.), "none", Treatment..PO.)) %>%
group_by(Treatment..SC., Treatment..PO., Timepoint) %>%
summarize(mean_skin=mean(Skin.Hypro..ug.HP...ug.skin., na.rm=TRUE),
sd_skin=sd(Skin.Hypro..ug.HP...ug.skin., na.rm=TRUE),
mean_lung=mean(Lung.Hypro..ug.HP....ml...il...pcl.., na.rm=TRUE),
sd_lung=sd(Lung.Hypro..ug.HP....ml...il...pcl.., na.rm=TRUE)) %>%
gather(measurement, value, mean_skin:sd_lung) %>%
separate(measurement, c("measurement", "tissue"), sep="_") %>%
spread(measurement, value)
ggplot(forPlotting, aes(x=Timepoint, y=mean, color=Treatment..PO.)) +
facet_grid(tissue ~ Treatment..SC., scales="free_y") +
geom_point(pch=1, stroke=1) +
geom_errorbar(aes(ymin=mean-sd, ymax=mean+sd), width=1.5) +
labs(y="hydroxyproline")
```
Of the measurements available for mouse lung function, *inspiratory capacity* (IC) is considered to be the most analogous to *forced vital capacity* (FVC). We observe that IC is higher for PBS-treated mice than for bleomycin-treated mice. Interestingly, bleomycin-treated mice also treated with the ALK5 inhibitor SB525334 appear to show slightly higher IC. Where possibly, we have fit a line to show that there is a slight decreased in IC from day 21 to day 28. As might be expected, by day 42 the bleomycin-treated mice appear to have higher IC, suggesting fibrotic resolution is underway.
```{r, inspiratory_capacity, warning=FALSE}
ggplot(biodata, aes(x=Timepoint, y=Inspiratory.Capacity, color=Treatment..SC.)) +
facet_wrap(~ Treatment..PO.) +
geom_point() +
geom_smooth(method="lm")
```
## Modeling inspiratory capacity
To model IC, we fit a linear regression model to the expression data, computed with an elastic net regularization path. We model each tissue separately.
The objective of this exercise is to determine if a skin biopsy could function as a surrogate for a lung-based diagnostics (eg. CT scan).
### Modeling IC with lung RNA-Seq
For our first pass, we construct a model of IC using lung RNA-Seq data. Our hypothesis is that lung RNA-Seq should be indicative of lung phenotypic measures, including IC. Ideally, skin RNA-Seq would model IC as well as lung, at best.
Before proceeding, we need to optimize the 2 parameters to `glmnet`'s elastic nets: $\alpha$, the elastic net mixing parameter, and $\lambda$, the penalty. We achieve this using a ***grid search*** over both parameters. The first step to perform the grid search is to use the same balanced cross-validation to preserve class distribution in the various folds for training. Aside from choosing the $\lambda$ penalty parameter, the grid search tests whether the $\alpha$ parameter has a strong effect on the performance of the model.
```{r, model_data_lung_ic}
modelDat <- doe %>%
filter(Tissue == "lung") %>%
left_join(select(biodata, animal_number, Inspiratory.Capacity), by="animal_number") %>%
filter(!is.na(Inspiratory.Capacity))
x <- t(logCPM[, modelDat$ObservationID])
y <- modelDat$Inspiratory.Capacity
```
```{r, create_folds}
# In glmnet, one can specify which fold specific observations are in. This vector
# is effectively an index with each position reflecting an observation and each
# value at the position representing the specific fold ID. In the cv.glmnet()
# code definition, if foldid is not specified, it is automatically generated
# based on the size of the data and the number of folds. This is shown below:
#
# N <- nrow(iris) # using the iris dataset as an example
# nfolds <- 5
# foldid <- sample(rep(seq(nfolds), length = N)) # code from cv.glmnet()
#
# To read more, see my gist data_splitting.Rmd script:
# http://biogit.pri.bms.com/gist/ammarr/ad682402ad7c014b3c3b
# k-fold balanced cross validation
createGlmnetFolds <- function(response, numFolds=CV_FOLDS) {
testFoldList <- createFolds(response, k=numFolds)
testFolds <- rep(-1, length(response))
# Format the fold vector as input for foldid parameter in cv.glmnet()
for (i in seq_len(numFolds)) {
testFolds[testFoldList[[i]]] <- i
}
return(testFolds)
}
```
```{r, grid_search_lung_ic}
gridSearch <- function(x, y, errorType="mse", n=50) {
testFolds <- createGlmnetFolds(y)
alphaSearchSpace <- seq(0, 1, length=n)
lambdaSearchSpace <- 2 ^ seq(0, -13, length=n) # trying to use the bounds of glmnet
errorDF <- foreach (a = alphaSearchSpace, .inorder=FALSE, .combine=bind_rows,
.packages=c("glmnet", "dplyr")) %dopar% {
# NOTE: cv.glmnet() is also parallelized. I've observed this *significantly*
# improves performance.
cvfit <- cv.glmnet(x, y, foldid=testFolds, family="gaussian",
type.measure=errorType, lambda=lambdaSearchSpace, alpha=a,
parallel=TRUE)
return(data.frame(alpha=a, lambda=cvfit$lambda, error=cvfit$cvm, nzero=cvfit$nzero))
}
ggplot(errorDF, aes(x=alpha, y=log(lambda, base=2), z=error, label=nzero)) +
geom_raster(aes(fill=error)) +
scale_fill_distiller(palette="Spectral") +
# Use a formula to make the fewer non-zero coefs larger while making 0 tiny
geom_text(aes(size=(1/(ifelse(nzero == 0, 100, nzero))))) +
#geom_contour(aes(col=..level..)) +
labs(x=expression(alpha), y=expression(log[2] * lambda), fill="Error",
title="glmnet grid search performance annotated with # of non-zero coefficients") +
guides(size=FALSE)
}
gridSearch(x, y)
```
Based on this grid search, we observe that maximal performance can be obtained at any $\alpha > 0$. It demonstrates that model performance is not associated with $\alpha$, and we are free to set it to a high value to obtain a *lasso*-like model with reduced features (fewer genes). This is useful when deriving a minimal set of biomarkers for the IC.
In order to construct our final model, we use the `caret` package and a fixed $\alpha$ value of 0.95, to optimize $\lambda$ across `r CV_REPEATS` repeats of `r CV_FOLDS`-fold cross-validation.
```{r, repeated_cv_lung_ic, warning=FALSE}
buildModel <- function(x, y, alpha=0.95, lambdaUpper=0, lambdaLower=-13, n=50,
modelName="repeated_cv") {
# Define training control and model parameters
train_control <- trainControl(method="repeatedcv", number=CV_FOLDS, repeats=CV_REPEATS)
lambdaSearchSpace <- 2 ^ seq(lambdaUpper, lambdaLower, length=n)
modelParameters <- data.frame(alpha=alpha, lambda=lambdaSearchSpace)
model <- train(x, y, method="glmnet", trControl=train_control, tuneGrid=modelParameters)
#print(model)
# Save model for subsequent testing during development
if (!is.null(modelName)) {
saveRDS(model, paste0(modelName, "_" , round(unclass(Sys.time())),".rds"))
}
g1 <- ggplot(model$results, aes(x=log(lambda, base=2), y=RMSE)) +
geom_point(col="tomato") +
geom_errorbar(aes(ymin=RMSE-RMSESD, ymax=RMSE+RMSESD)) +
geom_vline(xintercept=log(model$bestTune$lambda, base=2), lty=2) +
labs(x=expression(log[2] * lambda))
g2 <- ggplot(model$results, aes(x=log(lambda, base=2), y=Rsquared)) +
geom_point(col="tomato") +
geom_errorbar(aes(ymin=Rsquared-RsquaredSD, ymax=Rsquared+RsquaredSD)) +
geom_vline(xintercept=log(model$bestTune$lambda, base=2), lty=2) +
labs(x=expression(log[2] * lambda), y=expression(R^2))
return(list(model=model, rmse_plot=g1, rsquared_plot=g2))
}
out <- buildModel(x, y, modelName="lung_ic_repeated_cv")
plot(out$rmse_plot)
plot(out$rsquared_plot)
```
```{r, repeated_cv_lung_ic_performance}
# We extract the best model with the best tuning parameters from the caret
# training model object to retrieve the coefficients.
getCoefficients <- function(model) {
coefficients <- coef(model$finalModel, s=model$bestTune$lambda)
# Get the non-zero coefficients
coefficients <- coefficients[which(abs(coefficients) > 0), , drop=FALSE]
coefficients <- data.frame(GeneID=rownames(coefficients), coef=coefficients[, 1]) %>%
left_join(select(annotation, GeneID, GeneName, GeneDescription), by="GeneID")
return(arrange(coefficients, (desc(coef))))
}
getCoefficients(out$model)
yHat <- predict(out$model, newdata = x, verbose=T)
```
In-sample testing yields an $R^2 = `r (cor(y, yHat)) ^ 2`$
### Modeling IC with skin RNA-Seq
Next, we construct a model of IC using *skin* RNA-Seq.
```{r, skin_ic_model, warning=FALSE}
modelDat <- doe %>%
filter(Tissue == "skin") %>%
left_join(select(biodata, animal_number, Inspiratory.Capacity), by="animal_number") %>%
filter(!is.na(Inspiratory.Capacity))
x <- t(logCPM[, modelDat$ObservationID])
y <- modelDat$Inspiratory.Capacity
gridSearch(x, y)
out <- buildModel(x, y, modelName="skin_ic_repeated_cv")
plot(out$rmse_plot)
plot(out$rsquared_plot)
getCoefficients(out$model)
yHat <- predict(out$model, newdata = x, verbose=T)
```
In-sample testing yields an $R^2 = `r (cor(y, yHat)) ^ 2`$
### Modeling IC data: Partitioning the data for more accurate testing
To prevent overfitting and obtain a more realistic estimate of model performance, we train the model on a subset of the `r length(y)` observations, and test on the remainder.
```{r, ic_models_split, warning=FALSE}
assessPerformanceWithSplit <- function(splitTissue, trainingPercentage=0.9, n=100) {
# Note: trainingPercentage of 0.9 means 90% for training, 10% for testing
perfStartTime <- Sys.time()
modelDat <- doe %>%
filter(Tissue == splitTissue) %>%
left_join(select(biodata, animal_number, Inspiratory.Capacity), by="animal_number") %>%
filter(!is.na(Inspiratory.Capacity))
x <- t(logCPM[, modelDat$ObservationID])
y <- modelDat$Inspiratory.Capacity
performance <- foreach (i = seq_len(n), .inorder=FALSE, .packages=c("caret"),
.combine=c, # combine results into vector not list
# .export deals with vars from global env that somehow
# aren't showing up in the foreach. I can't even...
.export=c("buildModel", "CV_REPEATS", "CV_FOLDS")) %dopar% {
trainIndexes <- createDataPartition(y, p=trainingPercentage)$Resample1
xTrain <- x[trainIndexes, ]
yTrain <- y[trainIndexes]
xTest <- x[-trainIndexes, ]
yTest <- y[-trainIndexes]
# Don't need grid search for each model- have already checked that a=0.95 is fine
#gridSearch(xTrain, yTrain)
out <- buildModel(xTrain, yTrain, lambdaUpper=-5, lambdaLower=-10, n=10, modelName=NULL)
yHat <- predict(out$model, newdata=xTest, verbose=T)
return((cor(yTest, yHat)) ^ 2)
}
message(paste("Performance estimate runtime =", format(Sys.time() - perfStartTime)))
return(performance)
}
performanceLung <- assessPerformanceWithSplit("lung")
performanceSkin <- assessPerformanceWithSplit("skin")
forPlotting <- data.frame(lung=performanceLung, skin=performanceSkin) %>%
gather(tissue, RSquared)
ggplot(forPlotting, aes(x=tissue, y=RSquared, fill=tissue)) +
geom_boxplot(width=0.2) +
guides(fill=FALSE) +
labs(y=expression(R^2))
rm(forPlotting)
```
We observe that lung RNA-Seq data is generally more informative for predicting IC ($median(R^2_{lung}) \approx `r round(median(performanceLung), 1)`$). However, skin RNA-Seq is also informative ($median(R^2_{skin}) \approx `r round(median(performanceSkin), 1)`$) and may be used as a proxy for IC.
With only `r length(y)` samples, we cannot construct a robust model of IC, but these results are encouraging and suggest that with more data, we may be able to model IC using expression data alone.
# Future Directions
We can also consider modeling lung hydroxyproline. We would benefit of this is there are many more measurements of hypro (n=`r length(na.omit(biodata$Lung.Hypro..ug.HP....ml...il...pcl..))`) vs IC (n=`r length(y)`).
```{r, end_parallelization}
stopCluster(cl)
```
# References
<!-- From https://stackoverflow.com/questions/41532707/include-rmd-appendix-after-references
div tells Pandoc to include the refs here, rather than at the end of the document. -->
<div id="refs"></div>
------
\newpage
# System Information
***Time required to process this report:*** *`r format(Sys.time() - startTime)`*
***R session information:***
```{r, echo_session_info}
sessionInfo()
```
```{r, domino, results="asis"}
if (Sys.getenv("DOMINO_WORKING_DIR") != "") {
cat(paste("\n\n## Domino environment details",
paste("**Domino User:**", Sys.getenv("DOMINO_PROJECT_OWNER")),
paste("**Domino Project:**", Sys.getenv("DOMINO_PROJECT_NAME")),
paste("**Domino Run ID:**", Sys.getenv("DOMINO_RUN_ID")),
paste("**Domino Run #:**", Sys.getenv("DOMINO_RUN_NUMBER")),
sep="\n\n"))
}
```