-
Notifications
You must be signed in to change notification settings - Fork 1
/
WEHIweek4.Rmd
393 lines (279 loc) · 12 KB
/
WEHIweek4.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
---
title: "WEHI Introduction to R - Week 4"
author: Lucy Liu
output:
pdf_document: default
html_document:
keep_md: true
---
```{r}
library(knitr)
opts_chunk$set(message=FALSE)
```
This lesson and all the code was written by [Brendan Ansell](https://github.com/bansell) - I have just added my own teaching notes.
# Tidy data
Go over tidy data. Each column is a variable (you can think of as thing that was measured) and each row is an observations.
We learnt about 2 functions
* one takes our data from wide to long - what was the name? gather
* the other long to wide - what was the name? spread
Questions?
# Join
Cover the join() function in the slides.
Ask what they expect to get when using join on the Marvel dataframe.
## Practice
Now we are going to practice using join.
Brendan sent around the link to download the data. Did anyone have trouble - let the helpers know. This is what your data should look like in your computer - navigate to folder and set working directory.
We want to set our working directory - this means that R will look here by default to import data and save any plots or tables we produce goes here by default.
Let's read in our data. Does anyone remember the function?
```{r}
# first load library
library(tidyverse)
```
```{r, eval=FALSE}
# Note how R helps us.
GO_list <- read_csv("Week4_biolData/GO_join/GO_list.csv")
GO_entrez <- read_csv("Week4_biolData/GO_join/GO_entrez.csv")
Hs_annot <- read_csv("Week4_biolData/GO_join/Hs_annot.csv")
GO_desc <- read_csv("Week4_biolData/GO_join/GO_desc.csv")
```
Let's have a look at our dataframes.
* GO_list is just 1 column with 5 GO terms
* GO_entrez has all the GO terms, and gives the matching entrez gene IDs
* Hs_annot has all the entrez gene IDs and gives the matching gene name and description
* GO_desc gives us the description of all the GO terms
What we want to do now, is add all of this information to our dataframe of just 5 GO terms.
Let's add the matching entrez gene ID's first.
Which join should I use?
Which dataframes do I want to join?
```{r, eval=FALSE}
left_join(GO_list, GO_entrez)
```
Now how to I save this as a variable so it comes up here (envir) and I can use it later?
```{r, eval=FALSE}
GO_entrezJoin <- left_join(GO_list, GO_entrez)
```
Now our code worked because there is a column called go_id in both GO_list and GO_entrez and they are spelt exactly the same.
If we want to join 2 dataframes but the columns we want to join them with do not match, we have to specifically tell R this.
What is the column that we want to join by here?
```{r, eval=FALSE}
left_join(GO_entrezJoin, Hs_annot, by = c("entrez_id" = "ENTREZID"))
```
Now how do I save this as a variable?
```{r, eval=FALSE}
GO_entrez_genename <- left_join(GO_entrezJoin, Hs_annot, by = c("entrez_id" = "ENTREZID"))
```
***
Challenge: Join the GO_entrez_genename dataframe to GO_desc
***
# Save image
(skip if necessary)
If you want to save everything here (envir) - all the objects that you have created, you can use the save.image() function. All you have to tell it, is what you want to name the file. It ends with Rdata.
So if I open a new RStudio without all these objects or I delete them all (this sweeping broom here does this), I can load all my objects again using load().
```{r, eval=FALSE}
save.image(file="myEnv_objects.Rdata")
load("myEnv_objects.Rdata")
```
# For loops
For loops are an important programming concept and it's present in pretty much all programming languages. The idea is that you can tell the computer to repeat the same task many times without typing the command many times.
--show structure of for loop on whiteboard--
You start with the word for, then brackets. Here is the name for the thing that will change at every loop - we can call it anything but we just have to name it so we can refer to it later. Here is the vector (remember a vector is just a ordered list of things - like a column in excel) that tells the for loop how many times to run and what this here changes to every time.
Now we need curly brackets and here, is the body of the for loop - we tell it what to do at every loop.
Let's have a look at an example. First do you remember what this gives us?
```{r, eval=FALSE}
1:5
```
What do you think will happen if I run this code?
```{r}
for (i in 1:5){
print(i)
}
```
So how many times has the for loop run? Why?
What if I do this? So I create a vector like this:
```{r}
helpers <- c("Anna", "Quentin","luyi","luke")
```
What happens if I do this? Which part of the for loop tells it what i is every time?
```{r, eval=FALSE}
for (i in helpers){
print(i)
}
```
What happens if I do this?
```{r, eval=FALSE}
for (i in helpers){
print(name)
}
```
Let's make it a little interesting with the paste function. Now how can I get help on a function?
You just type in ? and the function name and the help page will come up. Let's see what help says. Generally the help pages are hard to understand but you will get used to it.
What do you think this will do?
```{r}
paste("hello","world")
```
What about this? Now do you remember what this gives us?
```{r, eval=FALSE}
1:5
```
```{r}
paste("Number", 1:5)
```
Paste as a sep argument. What do you think this will do?
```{r}
paste("Number", 1:5, sep = "-")
```
-- See slides
Let's use it in a for loop. We have some scientists here working out what the saying is. And we just want to make a for loop so they don't have to write the whole thing out every time.
Now what is the thing that is changing every time?
Which part of the for loop tells what i changes to each time? Remember how I made a vector of names here - now I need to make a vector of 'liquids' and put that in the for loop.
***
Challenge: Get R to print this out in a for loop
***
tea break?? - write on board the dplyr functions, filter, select, mutate, summarise
# Compound plates
We're going to come back to our for loop at the end but now we're going to loop at some compound screening data (go to slides).
I know very little about this so excuse my lack of science - the experiment involves a 1584 well plate. A different compound is put in each well and we get an intensity value for each well, which tells us how many cells there are - I think its basically a measure of cell death. Big number means cells are alive and small number means the cells have died.
Let's read the data in. What function does this?
```{r}
plate <- read_csv("Week4_biolData/screening_plates/plate1.csv")
```
This is great but it's not 'tidy' data. We want every row to be ONE observation. What function can we use to turn this data into tidy data?
We want our data to look like R1, C1 number....
Which columns do we want to gather? Now what do we want to call these column names?
```{r, eval=FALSE}
plate %>%
gather(2:49, key = "Col", value = "Signal")
```
This is great - but we have to do this annoying thing to plot it. We have to remove the R and the C from the row and col columns for plotting. Any idea why? The reason is that it wouldn't be in the correct order. When you plot it it would order R1, R10, R11... because it want to put it in alphabetical order.
```{r}
plate %>%
gather(2:49, key = "Col", value = "Signal") %>%
mutate(ROW = str_remove(ROW, "R")) %>%
mutate(ROW = as.integer(ROW))
```
***
Challenge: Now I want you to do the same with the col column - remove the c and turn it into an integer data type. Then save it as a variable called "plate_clean"
***
This looks good - it's doing what we want. Now how to I save this as a variable so I can refer to it again?
```{r}
plate_clean <- plate %>%
gather(2:49, key = "Col", value = "Signal") %>%
mutate(ROW = str_remove(ROW, "R")) %>%
mutate(ROW = as.integer(ROW)) %>%
mutate(Col = str_remove(Col, "C")) %>%
mutate(Col = as.integer(Col))
```
Now let's plot all these numbers on a graph.
```{r}
plate_clean %>%
ggplot(aes(x=Col, y=ROW, fill=Signal)) +
geom_tile(col="grey")
```
This is great! But we want row 1 to be up here, to match our plate. To do this we add:
```{r, eval=FALSE}
plate_clean %>%
ggplot(aes(x=Col, y=ROW, fill=Signal)) +
geom_tile(col="grey") +
scale_y_reverse()
```
You may have noticed the colours aren't that different. If we wanted different intensities to pop, we set the colour gradient - instead of being light blue to dark blue, we could go blue to white to red - where signals around the median are white.
```{r}
plate_clean %>%
ggplot(aes(x=Col, y=ROW, fill=Signal)) +
geom_tile(col="grey") +
scale_y_reverse() +
scale_fill_gradient2(low="blue",mid="white",high="red", midpoint = 2e5)
```
How would I be able to find the median value of signal? There are a few ways to do it.
```{r, eval=FALSE}
plate_clean %>% summarize(median=median(Signal))
```
Now let's save our graph! We will make a folder called "plate_analysis" and then save things to here!
```{r, eval=FALSE}
ggsave("plate_analysis/signal_plate1.pdf", width=8, height = 5)
```
Nearly at the end now. The last thing we will do is to work out a z score for each well and identify "hits" - wells that really different from the negative control.
First we have to mark wells as -ve control, +ve control or test.
Col 1,23,25 and 47 are positive controls (100% kill, or no cells (empty wells)) - show on graph which ones. Negative controls (no drug) are cols 2,24,26 and 48 -the columns next to the postive controls.
```{r, eval=FALSE}
#lets label the columns as test, positive or negative control:
plate_clean %>%
mutate(wellTag = case_when(Col %in% c(1,23,25,47) ~ 'posCTRL',
Col %in% c(2,24,26,48) ~ 'negCTRL',
TRUE ~ 'test'))
```
We can also double check that our code worked by using the count function
```{r, eval=FALSE}
plate_clean %>%
mutate(wellTag = case_when(Col %in% c(1,23,25,47) ~ 'posCTRL',
Col %in% c(2,24,26,48) ~ 'negCTRL',
TRUE ~ 'test')) %>%
count(wellTag)
```
Now how do we save this as a variable?
```{r}
plateTagged <- plate_clean %>%
mutate(wellTag = case_when(Col %in% c(1,23,25,47) ~ 'posCTRL',
Col %in% c(2,24,26,48) ~ 'negCTRL',
TRUE ~ 'test'))
```
Now we know which well is which, we want to get the mean and sd of the negative controls
How can we get just the negCTRL rows? (remember double == when comparing things)
And then how can we get the mean and standard deviation? Which function
***
Challenge: Find the mean and sd of the negative control wells
***
Remember to save as a variable as we want to use these numbers again.
```{r}
ctrl_metrics <- plateTagged %>%
filter(wellTag=="negCTRL" ) %>%
summarize(mean_negCTRL = mean(Signal),
sd_negCTRL = sd(Signal))
```
How do we get just the rows with 'test'?
Then we want a new column with the z score.
mutate(columnname = formula)
The formula will be signal - meannegcontrol/sd
Now how do we get to the negcontrol mean and sd?? This is what this dataframe looks like.
```{r, eval=FALSE}
ctrl_metrics
```
There are a few ways to do it.
```{r, eval=FALSE}
ctrl_metrics$mean_negCTRL
```
***
Challenge: Calculate z score for test wells - with new column
***
Remember to save as variable.
```{r}
plate_zScores <- plateTagged %>%
filter(wellTag == "test") %>%
mutate(z_score = (Signal - ctrl_metrics$mean_negCTRL) / ctrl_metrics$sd_negCTRL)
```
Now we are going to say hits are ones where the z score is greater than 4 or less than -4 (multiple testing).
What function should I use?
```{r}
hits <- plate_zScores %>%
filter(z_score < -4 | z_score > 4) %>%
arrange(z_score)
```
Write it out to a csv file. Function is write_csv - what do you think we need to tell it??
```{r, eval=FALSE}
write_csv(hits, path = "plate_analysis/hits_plate1.csv")
```
## For loops
Now for our grand finale - we are going to go back to for loops. We have 5 plates! We want to do this for every plate!
So what have we done?
1. read in data
2. make plot
3. save plot
4. get z scores
5. save csv
Which of these do we need to change???
Plot density
(skip if not time)
```{r, eval=FALSE}
plate_zScores %>% ggplot(aes(x = z_score)) + geom_density()
ggsave("plate_analysis/signal_density_plate1.pdf",width=6,height=4)
```