-
Notifications
You must be signed in to change notification settings - Fork 10
/
class3.qmd
325 lines (222 loc) · 12.8 KB
/
class3.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
---
title: "Let's Get Plotting!"
subtitle: "Data Visualization, Day 1"
author: "Sarah Parker, JP Flores, Austin Daigle"
format:
html:
toc: true
---
For this lesson, we will be using a .bed file of genetic variants from <https://marianattestad.com/blog>. We will need to read in this data using the `read.table()` function, then rename the columns with the `name()` function:
```{r}
# Specify the URL of the dataset
url <- "https://raw.githubusercontent.com/How-to-Learn-to-Code/Rclass-DataScience/main/data/DataVizDay1-files/variants_from_assembly.bed"
variants <- read.table(url, sep="\t", quote='', stringsAsFactors=TRUE,header=FALSE)
names(variants) <- c("chrom","start","stop","name","size","strand","type","ref.dist","query.dist")
```
Let's take a look at this dataset and what kind of research questions we could explore using this data.
```{r}
head(variants)
```
There are 9 different columns in this dataset, the genomic position (chrom, start, stop), name, size, strand, distance to reference (ref.dist) and distance to query (query.dist). What are some questions we could ask about this data?
Some examples:
- What is the distribution of distances to the reference?
- Are the sizes of variants of different types different?
We can quickly explore questions like these by creating quick visualizations of the data.
## Let's Get Plotting!
**Choosing a plot type**
Data visualizations can tell us about the relationships of different variables in a data set. There are 3 main categories of these relationships, each answering a different type of question about the data.
1. The variation *within* a single variable
- How do expression levels of a gene vary among patient samples?
2. The co-variation *between* a continuous and categorical variable
- How does beak size compare between penguins living on different islands?
3. The co-variation *between* two continuous variables
- How does trunk thickness relate to the age of a tree?
::: {#tip-example .callout-tip}
## Differentiating between continuous variables and categorical variables that are represented by a number
- If you can replace the number with a descriptor and it still makes sense, it is a categorical variable
- you can tell R that this is a categorical variable by using `factor()` or `character()`
i.e. chromosome 1 and chromosome 2 could be re-labeled as chromosome A and chromosome B or "first chromosome" and "second chromosome" without fundamentally changing the information
- If you can add/subtract two values and it still makes sense, it is a continuous variable
i.e. subtracting chromosome 6 - 2 = 4, this 4 doesn't mean anything, but subtracting size 317 - 185 = 132, this means one variant is 132 bp larger than the other
:::
There are 2 main ways we create plots in R
1. Using base R functions (i.e. `plot()`)
2. Using tidyverse functions (i.e. `ggplot()`)
- This requires loading the `ggplot2` package with `library(ggplot2)`
### Numeric vs Numeric
Let's try this with a plot type you are likely very familiar with: a scatterplot
A scatterplot looks at co-variation between 2 numeric variables, so what are 2 numeric variables we have in this dataset?
Let's try plotting `ref.dist` vs `query.dist`:
#### Base R
For base R, the `plot()` function takes in vectors of x and y values to plot.
Q: How do we extract the entire column of `ref.dist` and `query.dist` from our dataset, `variants`?
A: `variants$ref.dist` and `variants$query.dist`
```{r}
## create a scatterplot
plot(x = variants$ref.dist, y = variants$query.dist)
```
That's great! But it's not easy to understand what these x and y axes are, so let's relabel them by changing the parameters `xlab` and `ylab` inside the `plot()` function call
```{r}
## create a scatterplot with better axes labels
plot(x = variants$ref.dist, y = variants$query.dist,
xlab = "Reference distance", ylab = "Query Distance")
```
Great! We can also add a title and subtitle with `main` and `sub`, try this on your own!
Other plot types:
| Function | Plot Type |
|---------------------------------|-------------|
| `plot()` | scatterplot |
| `lines()` or `plot(type = "l")` | line plot |
To plot a line on top of points, you can run `lines()` with the same data immediately following `plot()`. Notice that the line connects *all* points, leading to a bit of a jumbled mess. How do you think we can fix this?
```{r}
## line plot
plot(x = variants$ref.dist, y = variants$query.dist,
xlab = "Reference distance", ylab = "Query Distance")
lines(x = variants$ref.dist, y = variants$query.dist)
```
#### ggplot
For more complex layered plots, we can also use `ggplot()` from the `ggplot2` package. "gg" stands for "grammar of graphics" and plots are built a bit like sentences with different parts building on each other.
To start, let's load in the `ggplot2` package. You will only need to do this once.
```{r}
## load the ggplot2 library
library(ggplot2)
```
Now each time we want to make a plot, you will start by using `ggplot()`.
```{r}
ggplot()
```
We haven't given this function any data, so right now we just have an empty grey box.
The first layer we can add is the data. `ggplot()` requires the name of the dataset once, then you can just use column names throughout the rest of the code instead of using `dataset$var1`, `dataset$var2`, `dataset$var3` , etc.
```{r}
ggplot(variants)
```
We still have a grey box! Although we have told the function we want to use data from the `variants` dataset, we didn't tell it *which* data we want to use. Any time you are referencing a column name to set the position, color, size, etc. of a point, you need to wrap it inside `aes()`, which is short for "aesthetics". This looks something like this:
```{r}
ggplot(variants, aes(x = ref.dist, y = query.dist))
```
More than a grey box! Now that we know which variables we are plotting and the dataset they come from, we have the framework to add the next layer: geometry. The geometry, as you might guess, refers to what type of shapes to put on the plot. Is it a line? A point? A bar? For scatterplots, we want the data represented as points, so we will use `geom_point()`.
Whenever we add a `ggplot` layer, we will connect it to the current plot using a `+` sign:
```{r}
ggplot(variants, aes(x = ref.dist, y = query.dist)) +
geom_point() # plot as points
```
Voila! We have our scatterplot! Now we can continue adding layers to change things like the labels. Use the `labs()` function to change the x, y, and optionally the title and subtitle:
```{r}
ggplot(variants, aes(x = ref.dist, y = query.dist)) +
geom_point() + # plot as points
labs(x = "Reference Distance",
y = "Query Distance",
title = "Plot Title",
subtitle = "Plot Subtitle")
```
ggplots will all follow this general formula, changing the `geom` for different plot types.
Other plot types:
| Function | Plot Type |
|-----------------|--------------------------------|
| `geom_point()` | scatterplot |
| `geom_line()` | line plot |
| `geom_smooth()` | line plot of smoothed averages |
To plot a line on top of points, you can add a second geom, using `geom_point()` + `geom_line()`.
```{r}
ggplot(variants, aes(x = ref.dist, y = query.dist)) +
geom_point() + # plot as points
geom_line() + # plot as line
labs(x = "Reference Distance",
y = "Query Distance",
title = "Plot Title",
subtitle = "Plot Subtitle")
ggplot(variants, aes(x = ref.dist, y = query.dist)) +
geom_smooth() + # plot as smooth line
labs(x = "Reference Distance",
y = "Query Distance",
title = "Plot Title",
subtitle = "Plot Subtitle")
```
### Numeric Distribution
Sometimes we want to look at the distribution or spread of values for one continuous variable. For example, in our data, what is the distribution of sizes for our variants?
There are several plot types we can use to explore this question
| Plot Type | Base R Function | ggplot Function |
|--------------|--------------------|-------------------------------|
| histogram | `hist()` | `geom_histogram()` |
| density plot | `plot(density())` | `geom_density()` |
| boxplot | `boxplot()` | `geom_boxplot()` |
| violin plot | not available | `geom_violin()` |
| bar plot | `barplot(table())` | `geom_bar(stat = "identity")` |
::: callout-tip
Try `?function_name()` to learn more about the different parameters used to customize each function
:::
#### Base R
In base R, we can pass each plotting function the vector of values that we want to plot. In this case, we want to plot all of the values in the `size` column of `variants`, so we will pass in `variants$size` to our plotting functions.
```{r}
## histogram
hist(variants$size)
## density plot
plot(density(variants$size))
## boxplot
boxplot(variants$size)
```
Try adding labels and titles as described before.
#### ggplot
In ggplot, we start by calling the base `ggplot()` function with the entire dataset, `variants`, then we can set the aesthetics of the x value to our column of interest with `aes(x = size)`. Then, we can add unique geoms for each plot type.
```{r}
## histogram
ggplot(variants, aes(x = size)) +
geom_histogram()
## density
ggplot(variants, aes(x = size)) +
geom_density()
## boxplot
ggplot(variants, aes(x = size)) +
geom_boxplot()
## violin plot
ggplot(variants, aes(x = size)) +
geom_violin(aes(y = 1))
# violin plots are meant to compare categories, but we can tell it we only want one plot by setting the aesthetic of `y` to the value 1
```
Try adding labels and titles described before. You can also easily change the look of ggplots with different themes, try adding `theme_` and look at the different autofill options. See more about built-in themes [here](https://ggplot2.tidyverse.org/reference/ggtheme.html) and more ways to customize themes [here](https://ggplot2.tidyverse.org/reference/ggtheme.html).
```{r}
ggplot(variants, aes(x = size)) +
geom_density() +
theme_classic()
```
### Continuous vs Categorical
Sometimes we want to see how the counts or the distribution of a continuous variable change between different categorical groups. Often, we use a barplot for this, but there are several other plot types we can use
| Plot Type | Base R Function | ggplot Function |
|--------------|---------------------------------------------|------------------|
| bar plot | `barplot(name = cat_var, value = cont_var)` | `geom_bar()` |
| boxplots | `boxplot(cont_var ~ cat_var)` | `geom_boxplot()` |
| violin plots | not available | `geom_violin()` |
where `cont_var` is the continuous variable and `cat_var` is the categorical variable.
#### Base R
In base R, if we want to plot a numeric by a categorical variable, we will use the `~` symbol to represent "by"
For example, if we wanted to plot size by strand, we would do
```{r}
boxplot(variants$size ~ variants$type)
```
To make a barplot, we have to first count how many instances there are of each category using the `table()` function. First, we start with subsetting our data to only the columns we are interested in, then we pass this smaller dataset into the `table()` function.
```{r}
# subset all rows, only column "type"
smallData <- variants[,"type"]
variantsCount <- table(smallData)
```
Then, we use the `barplot()` function, setting the `height` of bars to the counts in the new table, and the `name` of the bars to the names of the table. The height of the bars is the total summed size for each variant type.
```{r}
barplot(height = variantsCount, names = names(variantsCount))
```
#### ggplot
When using ggplot, the total counts per category will be calculated for us. So, we can create a base plot, setting x to the categorical variable and y to the continuous variable, then add `+geom_boxplot()` to make a boxplot.
```{r}
## density
ggplot(variants, aes(x = type, y = size)) +
geom_boxplot()
```
To make a barplot, we need to give it a bit more information. The height of the barplot is based on certain statistics, such as sum or mean. Since these are summary statistics, we will add `stat = "summary"` and `fun = "mean"` if we want the bar height to relate to the means of each category. How does this compare to the sums?
```{r}
## boxplot
ggplot(variants, aes(x = type, y = size)) +
geom_bar(stat = "summary", fun = "mean")
```
## Additional Resources
Try changing the colors using one of these tutorials:
<http://www.sthda.com/english/wiki/wiki.php?title=ggplot2-colors-how-to-change-colors-automatically-and-manually>
[https://www.datanovia.com/en/blog/ggplot-colors-best-tricks-you-will-love](https://www.datanovia.com/en/blog/ggplot-colors-best-tricks-you-will-love/#:~:text=Change%20ggplot%20colors%20by%20assigning,or%20to%20the%20fill%20arguments.)