-
Notifications
You must be signed in to change notification settings - Fork 25
/
README.Rmd
430 lines (333 loc) · 13.6 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
---
output: github_document
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r setup, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "70%",
dpi = 150
)
set.seed(1)
```
# ggupset
Plot a combination matrix instead of the standard x-axis and create UpSet plots with ggplot2.
<img src="man/figures/README-violinexample-1.png" width="70%" />
## Installation
You can install the released version of ggupset from [CRAN](https://cran.r-project.org/package=ggupset) with:
``` r
# Download package from CRAN
install.packages("ggupset")
# Or get the latest version directly from GitHub
devtools::install_github("const-ae/ggupset")
```
## Example
This is a basic example which shows you how to solve a common problem:
```{r example}
# Load helper packages
library(ggplot2)
library(tidyverse, warn.conflicts = FALSE)
# Load my package
library(ggupset)
```
In the following I will work with a tidy version of the movies dataset from
ggplot. It contains a list of all movies in IMDB, their release data and other
general information on the movie. It also includes a `list` column that contains
annotation to which genre a movie belongs (Action, Drama, Romance etc.)
```{r}
tidy_movies
```
`ggupset` makes it easy to get an immediate impression how many movies are in each
genre and their combination. For example there are slightly more than 1200 Dramas
in the set, more than 1000 which don't belong to any genre and ~170 that are Comedy
and Drama.
```{r}
tidy_movies %>%
distinct(title, year, length, .keep_all=TRUE) %>%
ggplot(aes(x=Genres)) +
geom_bar() +
scale_x_upset(n_intersections = 20)
```
## Adding Numbers on top
The best feature about `ggupset` is that it plays well with existing tricks from `ggplot2`. For example, you can easily add the size of the counts on top of the bars with this trick from [stackoverflow](https://stackoverflow.com/a/26556180/604854)
```{r}
tidy_movies %>%
distinct(title, year, length, .keep_all=TRUE) %>%
ggplot(aes(x=Genres)) +
geom_bar() +
geom_text(stat='count', aes(label=after_stat(count)), vjust=-1) +
scale_x_upset(n_intersections = 20) +
scale_y_continuous(breaks = NULL, lim = c(0, 1350), name = "")
```
## Reshaping quadratic data
Often enough the raw data you are starting with is not in such a neat tidy
shape. But that is a prerequisite to make such `ggupset` plots, so how can you get
from wide dataset to a useful one? And how to actually create a `list`-column, anyway?
Imagine we measured for a set of genes if they are a member of certain pathway.
A gene can be a member of multiple pathways and we want to see which pathways
have a large overlap. Unfortunately, we didn't record the data in a tidy format
but as a simple matrix.
A ficitional dataset of this type is provided as `gene_pathway_membership` variable
```{r}
data("gene_pathway_membership")
gene_pathway_membership[, 1:7]
```
We will now turn first turn this matrix into a tidy tibble and then plot it
```{r}
tidy_pathway_member <- gene_pathway_membership %>%
as_tibble(rownames = "Pathway") %>%
gather(Gene, Member, -Pathway) %>%
filter(Member) %>%
select(- Member)
tidy_pathway_member
```
`tidy_pathway_member` is already a very good starting point for plotting with
`ggplot`. But we care about the genes that are members of multiple pathways so
we will aggregate the data by `Gene` and create a `list`-column with the `Pathway`
information.
```{r}
tidy_pathway_member %>%
group_by(Gene) %>%
summarize(Pathways = list(Pathway))
```
```{r}
tidy_pathway_member %>%
group_by(Gene) %>%
summarize(Pathways = list(Pathway)) %>%
ggplot(aes(x = Pathways)) +
geom_bar() +
scale_x_upset()
```
## What if I need more flexibility?
The first important idea is to realize that a list column is just as good as a character
vector with the list elements collapsed
```{r}
tidy_movies %>%
distinct(title, year, length, .keep_all=TRUE) %>%
mutate(Genres_collapsed = sapply(Genres, function(x) paste0(sort(x), collapse = "-"))) %>%
select(title, Genres, Genres_collapsed)
```
We can easily make a plot using the strings as categorical axis labels
```{r}
tidy_movies %>%
distinct(title, year, length, .keep_all=TRUE) %>%
mutate(Genres_collapsed = sapply(Genres, function(x) paste0(sort(x), collapse = "-"))) %>%
ggplot(aes(x=Genres_collapsed)) +
geom_bar() +
theme(axis.text.x = element_text(angle=90, hjust=1, vjust=0.5))
```
Because the process of collapsing list columns into delimited strings is fairly generic,
I provide a new scale that does this automatically (`scale_x_mergelist()`).
```{R}
tidy_movies %>%
distinct(title, year, length, .keep_all=TRUE) %>%
ggplot(aes(x=Genres)) +
geom_bar() +
scale_x_mergelist(sep = "-") +
theme(axis.text.x = element_text(angle=90, hjust=1, vjust=0.5))
```
But the problem is that it can be difficult to read those labels.
Instead I provide a third function that replaces the axis labels
with a combination matrix (`axis_combmatrix()`).
```{R}
tidy_movies %>%
distinct(title, year, length, .keep_all=TRUE) %>%
ggplot(aes(x=Genres)) +
geom_bar() +
scale_x_mergelist(sep = "-") +
axis_combmatrix(sep = "-")
```
One thing that is only possible with the `scale_x_upset()` function is to automatically order
the categories and genres by `freq` or by `degree`.
```{R}
tidy_movies %>%
distinct(title, year, length, .keep_all=TRUE) %>%
ggplot(aes(x=Genres)) +
geom_bar() +
scale_x_upset(order_by = "degree")
```
## Styling
To make publication ready plots, you often want to have complete control
how each part of a plot looks. This is why I provide an easy way to style
the combination matrix. Simply add a `theme_combmatrix()` to the plot.
```{r}
tidy_movies %>%
distinct(title, year, length, .keep_all=TRUE) %>%
ggplot(aes(x=Genres)) +
geom_bar() +
scale_x_upset(order_by = "degree") +
theme_combmatrix(combmatrix.panel.point.color.fill = "green",
combmatrix.panel.line.size = 0,
combmatrix.label.make_space = FALSE)
```
## Maximum Flexibility
Sometimes the limited styling options using `combmatrix.panel.point.color.fill` are not enough. To fully customize the combination matrix plot, `axis_combmatrix` has an `override_plotting_function` parameter, that allows us to plot anything in place of the combination matrix.
Let us first reproduce the standard combination plot, but use the `override_plotting_function` parameter to see how it works:
```{r}
tidy_movies %>%
distinct(title, year, length, .keep_all=TRUE) %>%
ggplot(aes(x=Genres)) +
geom_bar() +
scale_x_mergelist(sep = "-") +
axis_combmatrix(sep = "-", override_plotting_function = function(df){
ggplot(df, aes(x= at, y= single_label)) +
geom_rect(aes(fill= index %% 2 == 0), ymin=df$index-0.5, ymax=df$index+0.5, xmin=0, xmax=1) +
geom_point(aes(color= observed), size = 3) +
geom_line(data= function(dat) dat[dat$observed, ,drop=FALSE], aes(group = labels), linewidth= 1.2) +
ylab("") + xlab("") +
scale_x_continuous(limits = c(0, 1), expand = c(0, 0)) +
scale_fill_manual(values= c(`TRUE` = "white", `FALSE` = "#F7F7F7")) +
scale_color_manual(values= c(`TRUE` = "black", `FALSE` = "#E0E0E0")) +
guides(color="none", fill="none") +
theme(
panel.background = element_blank(),
axis.text.x = element_blank(),
axis.ticks.y = element_blank(),
axis.ticks.length = unit(0, "pt"),
axis.title.y = element_blank(),
axis.title.x = element_blank(),
axis.line = element_blank(),
panel.border = element_blank()
)
})
```
We can use the above template, to specifically highlight for example all sets that include the _Action_ category.
```{r}
tidy_movies %>%
distinct(title, year, length, .keep_all=TRUE) %>%
ggplot(aes(x=Genres)) +
geom_bar() +
scale_x_mergelist(sep = "-") +
axis_combmatrix(sep = "-", override_plotting_function = function(df){
print(class(df))
print(df)
df %>%
mutate(action_movie = case_when(
! observed ~ "not observed",
map_lgl(labels_split, ~ "Action" %in% .x) ~ "Action",
observed ~ "Non-Action"
)) %>%
ggplot(aes(x = at, y = single_label)) +
geom_rect(aes(fill = index %% 2 == 0), ymin=df$index-0.5, ymax=df$index+0.5, xmin=0, xmax=1) +
geom_point(aes(color = action_movie), size = 3) +
geom_line(data= function(dat) dat[dat$observed, ,drop=FALSE], aes(group = labels, color = action_movie), linewidth= 1.2) +
ylab("") + xlab("") +
scale_x_continuous(limits = c(0, 1), expand = c(0, 0)) +
scale_fill_manual(values= c(`TRUE` = "white", `FALSE` = "#F7F7F7")) +
scale_color_manual(values= c("Action" = "red", "Non-Action" = "black", "not observed" = "lightgrey")) +
guides(fill="none") +
theme(
legend.position = "bottom",
panel.background = element_blank(),
axis.text.x = element_blank(),
axis.ticks.y = element_blank(),
axis.ticks.length = unit(0, "pt"),
axis.title.y = element_blank(),
axis.title.x = element_blank(),
axis.line = element_blank(),
panel.border = element_blank()
)
}) +
theme(combmatrix.label.total_extra_spacing = unit(30, "pt"))
```
The `override_plotting_function` is incredibly powerful, but also an advanced feature that comes with pitfalls. Use at your own risk.
## Alternative Packages
There is already a package called `UpSetR` ([GitHub](https://github.com/hms-dbmi/UpSetR),
[CRAN](https://cran.r-project.org/package=UpSetR)) that provides very similar functionality
and that heavily inspired me to write this package.
It produces a similar plot with an additional view that shows the overall size
of each genre.
```{r}
# UpSetR
tidy_movies %>%
distinct(title, year, length, .keep_all=TRUE) %>%
unnest(cols = Genres) %>%
mutate(GenreMember=1) %>%
pivot_wider(names_from = Genres, values_from = GenreMember, values_fill = list(GenreMember = 0)) %>%
as.data.frame() %>%
UpSetR::upset(sets = c("Action", "Romance", "Short", "Comedy", "Drama"), keep.order = TRUE)
# ggupset
tidy_movies %>%
distinct(title, year, length, .keep_all=TRUE) %>%
ggplot(aes(x=Genres)) +
geom_bar() +
scale_x_upset(order_by = "degree", n_sets = 5)
```
The `UpSetR` package provides a lot convenient helpers around this kind of plot; the main
advantage of my package is that it can be combined with any kind of ggplot
that uses a categorical x-axis. This additional flexibility can be useful if
you want to create non-standard plots. The following plot for example shows
when movies of a certain genre were published.
```{r violinexample}
tidy_movies %>%
distinct(title, year, length, .keep_all=TRUE) %>%
ggplot(aes(x=Genres, y=year)) +
geom_violin() +
scale_x_upset(order_by = "freq", n_intersections = 12)
```
# Advanced examples
#### 1. Complex experimental design
The combination matrix axis can be used to show complex experimental designs,
where each sample got a combination of different treatments.
```{r}
df_complex_conditions
df_complex_conditions %>%
mutate(Label = pmap(list(KO, DrugA, Timepoint), function(KO, DrugA, Timepoint){
c(if(KO) "KO" else "WT", if(DrugA == "Yes") "Drug", paste0(Timepoint, "h"))
})) %>%
ggplot(aes(x=Label, y=response)) +
geom_boxplot() +
geom_jitter(aes(color=KO), width=0.1) +
geom_smooth(method = "lm", aes(group = paste0(KO, "-", DrugA))) +
scale_x_upset(order_by = "degree",
sets = c("KO", "WT", "Drug", "8h", "24h", "48h"),
position="top", name = "") +
theme_combmatrix(combmatrix.label.text = element_text(size=12),
combmatrix.label.extra_spacing = 5)
```
#### 2. Aggregation of information
`dplyr` currently does not support list columns
as grouping variables. In that case it makes
sense to collapse it manually and use the
`axis_combmatrix()` function to get a good looking
plot.
```{r}
# Percentage of votes for n stars for top 12 genres
avg_rating <- tidy_movies %>%
mutate(Genres_collapsed = sapply(Genres, function(x) paste0(sort(x), collapse="-"))) %>%
mutate(Genres_collapsed = fct_lump(fct_infreq(as.factor(Genres_collapsed)), n=12)) %>%
group_by(stars, Genres_collapsed) %>%
summarize(percent_rating = sum(votes * percent_rating)) %>%
group_by(Genres_collapsed) %>%
mutate(percent_rating = percent_rating / sum(percent_rating)) %>%
arrange(Genres_collapsed)
avg_rating
# Plot using the combination matrix axis
# the red lines indicate the average rating per genre
ggplot(avg_rating, aes(x=Genres_collapsed, y=stars)) +
geom_tile(aes(fill=percent_rating)) +
stat_summary_bin(aes(y=percent_rating * stars), fun = sum, geom="point",
shape="—", color="red", size=6) +
axis_combmatrix(sep = "-", levels = c("Drama", "Comedy", "Short",
"Documentary", "Action", "Romance", "Animation", "Other")) +
scale_fill_viridis_c()
```
## Saving Plots
There is an important pitfall when trying to save a plot with a combination matrix.
When you use `ggsave()`, ggplot2 automatically saves the last plot that was created.
However, here `last_plot()` refers to only the combination matrix. To store the full
plot, you need to explicitly assign it to a variable and save that.
```{r warning=FALSE}
pl <- tidy_movies %>%
distinct(title, year, length, .keep_all=TRUE) %>%
ggplot(aes(x=Genres)) +
geom_bar() +
scale_x_upset(n_intersections = 20)
ggsave("/tmp/movie_genre_barchart.png", plot = pl)
```
## Session Info
```{r}
sessionInfo()
```