-
Notifications
You must be signed in to change notification settings - Fork 1
/
07-more-visualisation.qmd
899 lines (676 loc) · 44.4 KB
/
07-more-visualisation.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
# Scatterplots, boxplots, and violin-boxplots {#sec-07-more-viz}
```{r setup, include=FALSE}
knitr::opts_chunk$set(warning = FALSE,
message = FALSE,
echo = TRUE)
```
Back in Chapter 3, we introduced you to data visualisation in R/RStudio using the package <pkg>ggplot2</pkg>. You developed foundational skills in using the layering system and customising your plots, but we only covered visualising one variable at a time in a histogram or barplot. This gets you a long way, but you typically want to visualise the relationship or difference between variables.
In this chapter, we develop your data visualisation skills to cover scatterplots, boxplots, and violin-boxplots. This will give you the skills to visualise your data in the next few chapters on inferential statistics. This is the final data visualisation specific chapter in the book, but by learning these core concepts, you will be able to find out how to create additional types of data visualisation independently.
**Chapter Intended Learning Outcomes (ILOs)**
By the end of this chapter, you will be able to:
- Create and edit a scatterplot to visualise the relationship between two continuous variables.
- Create and edit a boxplot to visualise summary statistics for a continuous outcome.
- Create and edit a violin-boxplot to visualise the density of data points in a continuous outcome.
- Customise your plots to include colourblind-friendly colour palettes and facet your data to visualise multiple independent variables.
## Chapter preparation
### Introduction to the data set
For this chapter, we are using open data from @zhang_present_2014. The abstract of their article is:
> Although documenting everyday activities may seem trivial, four studies reveal that creating records of the present generates unexpected benefits by allowing future rediscoveries. In Study 1, we used a time-capsule paradigm to show that individuals underestimate the extent to which rediscovering experiences from the past will be curiosity provoking and interesting in the future. In Studies 2 and 3, we found that people are particularly likely to underestimate the pleasure of rediscovering ordinary, mundane experiences, as opposed to extraordinary experiences. Finally, Study 4 demonstrates that underestimating the pleasure of rediscovery leads to time-inconsistent choices: Individuals forgo opportunities to document the present but then prefer rediscovering those moments in the future to engaging in an alternative fun activity. Underestimating the value of rediscovery is linked to people’s erroneous faith in their memory of everyday events. By documenting the present, people provide themselves with the opportunity to rediscover mundane moments that may otherwise have been forgotten.
In summary, they were interested in whether people could predict how interested they would be in rediscovering past experiences. They call it a "time capsule" effect, where people store photos or messages to remind themselves of past events in the future.
At the start of the study (time 1), participants in a romantic relationship wrote about two kinds of experiences. An "extraordinary" experience with their partner on Valentine's day and an "ordinary" experience one week before. They were then asked how enjoyable, interesting, and meaningful they predict they will find these recollections in three months time (time 2). Three months later, Zhang et al. randomised participants into one of two groups. In the "extraordinary" group, they reread the extraordinary recollection. In the "ordinary" group, they reread the ordinary recollection. All the participants completed measures on how enjoyable, interesting, and meaningful they found the experience, but this time what they actually felt, rather than what they predict they will feel.
They predicted participants in the ordinary group would underestimate their future feelings (i.e., there would be a bigger difference between time 1 and time 2 measures) compared to participants in the extraordinary group. In this chapter, we focus on a composite measure which took the mean of items on interest, meaningfulness, and enjoyment.
### Organising your files and project for the chapter
Before we can get started, you need to organise your files and project for the chapter, so your working directory is in order.
1. In your folder for research methods and the book `ResearchMethods1_2/Quant_Fundamentals`, create a new folder called `Chapter_07_dataviz`. Within `Chapter_07_dataviz`, create two new folders called `data` and `figures`.
2. Create an R Project for `Chapter_07_dataviz` as an existing directory for your chapter folder. This should now be your working directory.
3. Create a new R Markdown document and give it a sensible title describing the chapter, such as `07 Scatterplots Boxplots Violins`. Delete everything below line 10 so you have a blank file to work with and save the file in your `Chapter_07_dataviz` folder.
4. We are working with a new data set, so please save the following data file: [Zhang_2014.csv](data/Zhang_2014.csv). Right click the link and select "save link as", or clicking the link will save the files to your Downloads. Make sure that you save the file as ".csv". Save or copy the file to your `data/` folder within `Chapter_07_dataviz`.
You are now ready to start working on the chapter!
### Activity 1 - Read and wrangle the data
As the first activity, try and test yourself by completing the following task list to practice your data wrangling skills. Create an object called `zhang_data` to be consistent with the tasks below. If you want to focus on data visualisation, then you can just type the code in the solution.
::: {.callout-tip}
#### Try this
To wrangle the data, complete the following tasks:
1. Load the <pkg>tidyverse</pkg> package.
2. Read the data file `data/Zhang_2014.csv`.
3. Select the following columns:
- `Gender`
- `Age`
- `Condition`
- `T1_Predicted_Interest_Composite` renamed to `time1_interest`
- `T2_Actual_Interest_Composite` renamed to `time2_interest`.
4. There is currently no identifier, so create a new variable called `participant_ID`. Hint: try `participant_ID = row_number()`.
5. Recode two variables to be easier to understand and visualise:
- Gender: 1 = "Male", 2 = "Female".
- Condition: 1 = "Ordinary", 2 = "Extraordinary".
Your data should now be in wide format and ready to create a scatterplot.
:::
::: {.callout-caution collapse="true"}
#### Show me the solution
You should have the following in a code chunk:
```{r}
# Load the tidyverse package below
library(tidyverse)
# Load the data file
# This should be the Zhang_2014.csv file
zhang_data <- read_csv("data/Zhang_2014.csv")
# Wrangle the data for plotting.
# select and rename key variables
# mutate to add participant ID and recode
zhang_data <- zhang_data %>%
select(Gender,
Age,
Condition,
time1_interest = T1_Predicted_Interest_Composite,
time2_interest = T2_Actual_Interest_Composite) %>%
mutate(participant_ID = row_number(),
Condition = case_match(Condition,
1 ~ "Ordinary",
2 ~ "Extraordinary"),
Gender = case_match(Gender,
1 ~ "Male",
2 ~ "Female"))
```
:::
### Activity 2 - Explore the data
::: {.callout-tip}
#### Try this
After the wrangling steps, try and explore `zhang_data` to see what variables you are working with. For example, opening the data object as a tab to scroll around, explore with `glimpse()`, or try plotting some of the individual variables to see what they look like using visualisation skills from Chapter 3.
:::
In `zhang_data`, we have the following variables:
| Variable | Type | Description |
|:--------------:|:---------------------------------|:-------------------------------|
| Gender | `r typeof(zhang_data$Gender)`| Participant gender: Male (1) or Female (2) |
| Age | `r typeof(zhang_data$Age)`| Participant age in years. |
| Condition | `r typeof(zhang_data$Condition)`| Condition participant was randomly allocated into: Ordinary (1) or Extraordinary (2). |
| time1_interest | `r typeof(zhang_data$time1_interest)`| How interested they predict they will find the recollection on a 1 (not at all) to 7 (extremely) scale. This measure is the mean of enjoyment, interest, and meaningfulness. |
| time2_interest | `r typeof(zhang_data$time2_interest)`| How interested they actually found the recollection on a 1 (not at all) to 7 (extremely) scale. This measure is the mean of enjoyment, interest, and meaningfulness. |
| participant_ID | `r typeof(zhang_data$participant_ID)`| Our new participant ID as an integer from 1 to 130. |
We will use this data set to demonstrate different ways of visualising continuous variables, either combining multiple continuous variables in a scatterplot or splitting continuous variables into categories in a boxplot or violin-boxplot.
## Scatterplots {#viz-a3}
The first visualisation is a `r glossary("scatterplot", def = "Plotting two variables on the x- and y-axis to show the correlation/relationship between the variables.")` to show the relationship between two continuous variables. One variable goes on the x-axis and the other variables goes on the y-axis. Each dot then represents the intersection of those two variables per observation/participant. You will use these plots often when reporting a correlation or regression.
### Activity 3 - Creating a basic scatterplot
Let us start by making a scatterplot of `Age` and `time1_interest` to see if there is any relationship between the two. We need to specify both the x- and y-axis variables, but the only difference to what we created in Chapter 3 is using a new layer `geom_point`.
```{r scat1}
zhang_data %>%
ggplot(aes(x = time1_interest, y = Age)) +
geom_point()
```
### Activity 4 - Editing axis labels
This plot is great for some exploratory data analysis, but it looks a little untidy to put into a report. We can use the `scale_x_continuous` and `scale_y_continuous` layers to control the tick marks, as well as the axis name.
```{r scat2}
zhang_data %>%
ggplot(aes(x = time1_interest,y = Age)) +
geom_point() +
scale_x_continuous(name = "Time 1 interest score (1-7)",
breaks = c(1:7)) + # tick marks from 1 to 7
scale_y_continuous(name = "Age",
limits = c(15, 45), # change limits to 15 to 45
breaks = seq(from = 15, # sequence from 15
to = 45, # to 45
by = 5)) # in steps of 5
```
To break down these new arguments/functions in the layers:
- `breaks` set the tick marks on the plot. We demonstrate two ways of setting this. On the x-axis, we just manually set values for 1 to 7. On the y-axis, we use a second function to set the breaks.
- `seq()` creates a sequence of numbers and can save a lot of time when you need to add lots of values. We set three arguments, `from` for the starting point, `to` for the end point, and `by` for the steps the sequence goes up in.
- `limits` controls the start and end point of the graph scale. In the original graph, we can see there are points below 20 and above 40, so we might want to increase the `limits` of the graph to include a wider range.
::: {.callout-important}
#### Error mode
When controlling the limits of the graph, sometimes you want to decrease the `limits` range to zoom in on an element of the data. If you decrease the range which cuts off some data points, you must be very careful as it actually cuts off data which you would receive a warning about:
```{r warning=TRUE}
zhang_data %>%
ggplot(aes(x = time1_interest,y = Age)) +
geom_point() +
scale_y_continuous(name = "Age",
limits = c(30, 40)) # in steps of 5
```
You must be very careful when truncating axes, but if you *do* need to do it, there is a different function layer to use:
```{r}
zhang_data %>%
ggplot(aes(x = time1_interest,y = Age)) +
geom_point() +
coord_cartesian(ylim = c(30, 40))
```
:::
### Activity 5 - Adding a regression line
It is often useful to add a regression line or line of best fit to a scatterplot. You can add a regression line with the `geom_smooth()` layer and by default will also provide a 95% confidence interval ribbon. You can specify what type of line you want to draw, most often you will need `method = "lm"` for a linear model or a straight line.
```{r scat3}
zhang_data %>%
ggplot(aes(x = time1_interest,y = Age)) +
geom_point() +
scale_x_continuous(name = "Time 1 interest score (1-7)",
breaks = c(1:7)) + # tick marks from 1 to 7
scale_y_continuous(name = "Age",
limits = c(15, 45), # change limits to 15 to 45
breaks = seq(from = 15, # sequence from 15
to = 45, # to 45
by = 5)) + # in steps of 5
geom_smooth(method = "lm")
```
With the regression line, we can see there is very little relationship between age and interest score at time 1.
::: {.callout-important}
Remember, you can save your plots using the function `ggsave()`. You can use the function after creating the last plot, or saving your plot as an object and using the `plot` argument. You have a `Figures/` directory for the chapter, so try and save the plots you make to remind yourself later.
:::
::: {.callout-tip}
#### Try this
So far, we made a scatterplot of age against interest at time 1. Now, create a scatterplot on your own using the two interest rating variables: `time1_interest` and `time2_interest`.
After you made the scatterplot, it looks like there is a `r mcq(c(answer = "positive", "negative"))` relationship between interest ratings at time 1 and time 2.
:::
::: {.callout-caution collapse="true"}
#### Show me the solution
You should have the following in a code chunk:
```{r}
zhang_data %>%
ggplot(aes(x = time1_interest, y = time2_interest)) +
geom_point() +
scale_x_continuous(name = "Time 1 interest score (1-7)",
breaks = c(1:7)) + # tick marks from 1 to 7
scale_y_continuous(name = "Time 2 interest score (1-7)",
breaks = c(1:7)) + # tick marks from 1 to 7
geom_smooth(method = "lm")
```
:::
### Activity 6 - Creating a grouped scatterplot
Before we move on, we can add a third variable to show how the relationship might differ for different groups within our data. We can do this by adding the `colour` argument to `aes()` and setting it as whatever variable we would like to distinguish between. In this case, we will see how the relationship between age and interest at time 1 differs for the male and female participants. There are a few participants with missing gender, so we will first filter them out.
```{r, scat4, fig.cap = "Grouped scatterplot", warning=FALSE}
zhang_data %>%
drop_na(Gender) %>%
ggplot(aes(x = time1_interest, y = Age, colour = Gender)) +
geom_point() +
scale_x_continuous(name = "Mean interest score (1-7)",
breaks = c(1:7)) +
scale_y_continuous(name = "Age") +
geom_smooth(method = "lm")
```
::: {.callout-tip}
#### Try this
For your independent scatterplot of the two interest rating variables: `time1_interest` and `time2_interest`, add a `colour` argument using the `Condition` variable. This will show the relationship between time 1 and time 2 interest separately for participants in the ordinary and extraordinary groups.
:::
::: {.callout-caution collapse="true"}
#### Show me the solution
You should have the following in a code chunk:
```{r}
zhang_data %>%
ggplot(aes(x = time1_interest, y = time2_interest, colour = Condition)) +
geom_point() +
scale_x_continuous(name = "Time 1 interest score (1-7)",
breaks = c(1:7)) + # tick marks from 1 to 7
scale_y_continuous(name = "Time 2 interest score (1-7)",
breaks = c(1:7)) + # tick marks from 1 to 7
geom_smooth(method = "lm")
```
:::
## Boxplots {#viz-a4}
The next visualisation is the `r glossary("boxplot", def = "Visualising a continuous variable by five summary statistics: the median centre line, the first and third quartile, and 1.5 times the first and third quartiles.")` which presents a range of summary statistics for your outcome, which you can split between different groups on the x-axis, or add further variables to divide by. For the boxplot element, you get five summary statistics: the median centre line, the first and third quartile as the box (essentially, the interquartile range), and 1.5 times the first and third quartiles as the whiskers extending from the box. If there are any values beyond the whiskers, you see the individual data points and this is one definition of an outlier (more on that in Chapter 11)
### Activity 7 - Creating a basic boxplot
Before we create the boxplot, we need a final data wrangling step. At the moment, we have `time1_interest` and `time2_interest` in wide format, but to plot together, we need to express it as a single variable. For that, we must restructure the data. This is why we spent so much time on data wrangling, as you might need to quickly restructure your data to plot certain elements.
::: {.callout-tip}
#### Try this
To wrangle the data, gather the variables `time1_interest` and `time2_interest`. Create a new object called `zhang_data_long` and use the names `Time` and `Interest` for your column names to be consistent with the demonstrations below.
:::
::: {.callout-caution collapse="true"}
#### Show me the solution
You should have the following in a code chunk:
```{r}
# gather the data to convert to long format
zhang_data_long <- zhang_data %>%
pivot_longer(cols = time1_interest:time2_interest,
names_to = "Time",
values_to = "Interest")
```
:::
If you only want to visualise one continuous variable, we need one variable on the y-axis and a new function layer `geom_boxplot()`.
```{r bp1}
zhang_data_long %>%
ggplot(aes(y = Interest)) +
geom_boxplot() +
scale_y_continuous(name = "Interest score (1-7)",
breaks = c(1:7))
```
Typically, you want to compare the outcome between one or more categories, so we can add a categorical variable like gender to the x-axis, removing the missing values first.
```{r}
zhang_data_long %>%
drop_na(Gender) %>%
ggplot(aes(y = Interest, x = Gender)) +
geom_boxplot() +
scale_y_continuous(name = "Interest score (1-7)",
breaks = c(1:7))
```
### Activity 8 - Adding colour to variables
It is not as important when you only have one variable on the x-axis, but one useful feature is adding colour to distinguish between categories. You can control this by adding a variable to the `fill` argument within `aes()`.
By default, we get a legend which is redundant when we only have different colours on the x-axis, so we can turn it off by adding `guides(fill = FALSE)` as a layer.
```{r bp4}
zhang_data_long %>%
drop_na(Gender) %>%
ggplot(aes(y = Interest, x = Gender, fill = Gender)) +
geom_boxplot() +
scale_y_continuous(name = "Interest score (1-7)",
breaks = c(1:7)) +
guides(fill = FALSE) # remove the legend
```
::: {.callout-important}
#### Error mode
You might have noticed we have now used two different arguments to control the colour. In scatterplots, we used `colour`. In boxplots, we used `fill`. It is one of those concepts that takes time to recognise which you need, depending on the type of geom you are using. Roughly, `colour` is when you want to control the outline or symbol, like the points. Whereas `fill` is when you want the inside of a geom coloured. You can see the difference here by controlling `fill` first:
```{r}
zhang_data_long %>%
drop_na(Gender) %>%
ggplot(aes(y = Interest, x = Gender, fill = Gender)) +
geom_boxplot() +
scale_y_continuous(name = "Interest score (1-7)",
breaks = c(1:7))
```
Then `colour`:
```{r}
zhang_data_long %>%
drop_na(Gender) %>%
ggplot(aes(y = Interest, x = Gender, colour = Gender)) +
geom_boxplot() +
scale_y_continuous(name = "Interest score (1-7)",
breaks = c(1:7))
```
:::
### Activity 9 - Controlling colours
<pkg>ggplot2</pkg> has a default colour scheme which is fine for quick plots, but it is useful to control the colour scheme. You can do this manually by editing `scale_fill_discrete()` and choosing colours through the `type` argument (you can do this through character names or choosing a HEX code: [https://r-charts.com/colors/](https://r-charts.com/colors/){target="_blank"}).
```{r}
zhang_data_long %>%
drop_na(Gender) %>%
ggplot(aes(y = Interest, x = Gender, fill = Gender)) +
geom_boxplot() +
scale_y_continuous(name = "Interest score (1-7)",
breaks = c(1:7)) +
scale_fill_discrete(type = c("blue", "pink"))
```
Alternatively (and what we recommend), you can use `scale_fill_viridis_d()`. This function does exactly the same thing but it uses a colour-blind friendly palette (which also prints in black and white). There are 5 different options for colours and you can see them by changing `option` to A, B, C, D or E. We like option E with `alpha = 0.6` (to control transparency and soften the tone) but play around with the options to see what you prefer.
```{r, fig.cap= "Boxplots with friendly colours"}
zhang_data_long %>%
drop_na(Gender) %>%
ggplot(aes(y = Interest, x = Gender, fill = Gender)) +
geom_boxplot() +
scale_y_continuous(name = "Interest score (1-7)",
breaks = c(1:7)) +
scale_fill_viridis_d(option = "E",
alpha = 0.6) +
guides(fill = "none")
```
::: {.callout-tip}
#### Try this
For your independent boxplot, use `zhang_data_long` to visualise `Interest` as your continuous variable and `Condition` for different categories. This will show the difference in interest rating between those in the ordinary and extraordinary groups.
Comparing the ordinary and extraordinary groups, it looks like `r mcq(c("ordinary score higher on average", answer = "very little difference on average", "extraordinary score higher on average"))`.
:::
::: {.callout-caution collapse="true"}
#### Show me the solution
You should have the following in a code chunk:
```{r}
zhang_data_long %>%
ggplot(aes(y = Interest, x = Condition, fill = Condition)) +
geom_boxplot() +
scale_y_continuous(name = "Interest score (1-7)",
breaks = c(1:7)) +
scale_fill_viridis_d(option = "E",
alpha = 0.6) +
guides(fill = "none")
```
:::
### Activity 10 - Ordering categories
When we plot variables like `Gender` on the x-axis, R has an internal order it sets unless you create a factor. The default is alphabetical or numerical. In previous plots, it displayed Female then Male, as F comes before M.
Controlling the order of categories is an important design choice to communicate your message, and the most direct way is controlling the factor order before plotting. Here, we add `mutate()` in a pipe and manually set the factor levels, just be careful as it is case sensitive to the values in your data.
```{r}
zhang_data_long %>%
drop_na(Gender) %>%
mutate(Gender = factor(Gender,
levels = c("Male", "Female"))) %>%
ggplot(aes(y = Interest, x = Gender, fill = Gender)) +
geom_boxplot() +
scale_y_continuous(name = "Interest score (1-7)",
breaks = c(1:7)) +
scale_fill_viridis_d(option = "E",
alpha = 0.6) +
guides(fill = "none")
```
### Activity 11- Boxplots for multiple factors
When you only have one independent variable, using the `fill` argument to change the colour can be a little redundant as the colours do not add any additional information. It makes more sense to use colour to represent a second variable.
For this example, we will use `Condition` and `Time` as variables. `fill()` now specifies a second independent variable, rather than repeating the variable on the x-axis as in the previous plot, so we do not want to deactivate the legend.
```{r bp5}
zhang_data_long %>%
ggplot(aes(y = Interest, x = Condition, fill = Time)) +
geom_boxplot() +
scale_y_continuous(name = "Interest score (1-7)",
breaks = c(1:7)) +
scale_fill_viridis_d(option = "E",
alpha = 0.6)
```
As a final point here, the `fill` values on the legend are not the most professional looking. Like reordering factors, the easiest way of addressing this is editing the underlying data before piping to <pkg>ggplot2</pkg>.
```{r}
zhang_data_long %>%
mutate(Time = case_match(Time,
"time1_interest" ~ "Time 1",
"time2_interest" ~ "Time 2")) %>%
ggplot(aes(y = Interest, x = Condition, fill = Time)) +
geom_boxplot() +
scale_y_continuous(name = "Interest score (1-7)",
breaks = c(1:7)) +
scale_fill_viridis_d(option = "E",
alpha = 0.6)
```
## Violin-boxplots {#viz-a7}
Boxplots are great for your own exploratory data analysis but you do not often see them reported in isolation. They visualise summary statistics, but you do not get much sense of the underlying distribution of values. When you want to communicate continuous outcomes, researchers in psychology are using `r glossary("violin-boxplots", def = "A combination of a violin plot to show the density of data points and a boxplot to show summary statistics of distribution.")` more often. This combines both elements: a violin plot to show the distribution of the data, and a boxplot to add summary statistics. This is where <pkg>ggplot2</pkg> comes into it's own as we can add and customise several layers.
### Activity 12 - Creating a basic violin plot
Violin plots get their name as they look something like a violin when the data are roughly normally distributed. They show density, so the fatter the violin element, the more data points there are for that value. Compared to the boxplot, the only difference is changing the layer to `geom_violin()`.
```{r vp1}
zhang_data_long %>%
drop_na(Gender) %>%
ggplot(aes(y = Interest, x = Gender)) +
geom_violin() +
scale_y_continuous(name = "Interest score (1-7)",
breaks = c(1:7))
```
The distribution of values is great, but sometimes it might be useful to also add the underlying data points. These are all important design choices as it can be useful when you have smaller amounts of data, but overwhelming when you have thousands of data points. So, keep in mind what you want to communicate. Here, we use the layer `geom_jitter()` to jitter the points slightly, so they are not all in a vertical line and we get a better sense of the density.
```{r vp2}
zhang_data_long %>%
drop_na(Gender) %>%
ggplot(aes(y = Interest, x = Gender)) +
geom_violin() +
geom_jitter(height = 0, # do not jitter height
width = .1) + # jitter width of points
scale_y_continuous(name = "Interest score (1-7)",
breaks = c(1:7))
```
::: {.callout-important}
It is important to remember that R is very literal. <pkg>ggplot2</pkg> works on a system of layers. It will add new geoms on top of existing ones and it will not stop to think whether this is a good idea. Try running the code above but put `geom_jitter()` first and then add `geom_violin()`. The order of your layers matters.
:::
### Activity 13 - Creating a violin-boxplot
Instead of adding the data points in a layer, we can add a boxplot to create the violin-boxplot. This way, we get distribution information from the violin layer and summary statistics from the boxplot layer.
```{r}
zhang_data_long %>%
drop_na(Gender) %>%
ggplot(aes(y = Interest, x = Gender)) +
geom_violin() +
geom_boxplot() +
scale_y_continuous(name = "Interest score (1-7)",
breaks = c(1:7))
```
On it's own, this does not look great. We can edit the settings to reduce the width of the boxplots, add a colour scheme, and add transparency to the violin layer to make it easier to see the boxplot.
```{r}
zhang_data_long %>%
drop_na(Gender) %>%
ggplot(aes(y = Interest, x = Gender, fill = Gender)) +
geom_violin(alpha = 0.5) +
geom_boxplot(width = 0.2) +
scale_fill_viridis_d(option = "E") +
scale_y_continuous(name = "Interest score (1-7)",
breaks = c(1:7)) +
guides(fill = "none")
```
The boxplot uses the median for the centre line, but in your report you might be presenting means per category which will be slightly different. One further variation is removing the centre median line, and replacing it with the mean and 95% confidence interval (more on that in the lectures and Chapter 8). This way, you get three layers: the violin plot for the density, the boxplot for distribution summary statistics, and the mean and 95% confidence interval.
This code uses two calls to `stat_summary()` which is a layer to add summary statistics. The first layer draws a `point` to represent the mean, and the second draws an `errorbar` that represents the 95% confidence interval around the mean.
```{r vbp1}
zhang_data_long %>%
drop_na(Gender) %>%
ggplot(aes(y = Interest, x = Gender, fill = Gender)) +
geom_violin(alpha = 0.5) +
geom_boxplot(width = 0.2,
fatten = NULL) +
stat_summary(fun = "mean",
geom = "point") +
stat_summary(fun.data = "mean_cl_boot", # confidence interval
geom = "errorbar",
width = 0.1) +
scale_fill_viridis_d(option = "E") +
scale_y_continuous(name = "Interest score (1-7)",
breaks = c(1:7)) +
guides(fill = "none")
```
::: {.callout-warning}
When you run the line `stat_summary(fun.data = "mean_cl_boot", geom = "errorbar", width = .1)` for the first time, you might be prompted to install the R package <pkg>Hmisc</pkg>. If you are on your own computer, follow the instructions in the Console to install the package. If you are on a university computer, this should already be installed.
:::
::: {.callout-tip}
#### Try this
For your independent violin-boxplot, use `zhang_data_long` to visualise `Interest` as your continuous variable and `Condition` for different categories on the x-axis. Try and create the plot to look like this, so you might need to play around with different themes:
```{r echo=FALSE}
zhang_data_long %>%
ggplot(aes(y = Interest, x = Condition, fill = Condition)) +
geom_violin(alpha = 0.5) +
geom_boxplot(width = 0.2,
fatten = NULL) +
stat_summary(fun = "mean",
geom = "point") +
stat_summary(fun.data = "mean_cl_boot",
geom = "errorbar",
width = .1) +
scale_fill_viridis_d(option = "E") +
scale_y_continuous(name = "Interest score (1-7)",
breaks = c(1:7)) +
theme_minimal() +
guides(fill = "none")
```
:::
::: {.callout-caution collapse="true"}
#### Show me the solution
You should have the following in a code chunk:
```{r, eval=FALSE}
zhang_data_long %>%
ggplot(aes(y = Interest, x = Condition, fill = Condition)) +
geom_violin(alpha = 0.5) +
geom_boxplot(width = 0.2,
fatten = NULL) +
stat_summary(fun = "mean",
geom = "point") +
stat_summary(fun.data = "mean_cl_boot",
geom = "errorbar",
width = .1) +
scale_fill_viridis_d(option = "E") +
scale_y_continuous(name = "Interest score (1-7)",
breaks = c(1:7)) +
theme_minimal() +
guides(fill = "none")
```
:::
### Activity 14 - Adding additional variables
Like boxplots, we can add a second grouping variable to `fill` instead of just using it for colour.
```{r}
zhang_data_long %>%
ggplot(aes(y = Interest, x = Condition, fill = Time)) +
geom_violin(alpha = 0.5) +
geom_boxplot(width = 0.2,
fatten = NULL) +
stat_summary(fun = "mean",
geom = "point") +
stat_summary(fun.data = "mean_cl_boot",
geom = "errorbar",
width = .1) +
scale_fill_viridis_d(option = "E") +
scale_y_continuous(name = "Interest score (1-7)",
breaks = c(1:7)) +
theme_minimal()
```
However, unless you are trying to recreate a Kandinsky painting in <pkg>ggplot2</pkg>, that does not look quite right. This is because we have multiple layers that each plot separate groups in different ways. To make it all fall into line, we need to add a constant value to offset the elements. We start off by defining a position dodge value as an object. This way, we can use the object name later, and we only need to edit it in one place if we wanted to change the value.
```{r}
# specify as an object, so we only change it in one place
dodge_value <- 0.9
zhang_data_long %>%
ggplot(aes(y = Interest, x = Condition, fill = Time)) +
geom_violin(alpha = 0.5) +
geom_boxplot(width = 0.2,
fatten = NULL,
position = position_dodge(dodge_value)) +
stat_summary(fun = "mean",
geom = "point",
position = position_dodge(dodge_value)) +
stat_summary(fun.data = "mean_cl_boot",
geom = "errorbar",
width = .1,
position = position_dodge(dodge_value)) +
scale_fill_viridis_d(option = "E") +
scale_y_continuous(name = "Interest score (1-7)",
breaks = c(1:7))
```
This looks much better! Remember, if you want to change the legend labels, the easiest way is recoding the data before piping to <pkg>ggplot2</pkg>.
Finally, we might want to add a third variable to group the data by. There is a facet function that produces different plots for each level of a grouping variable which can be very useful when you have more than two factors. The following code shows interest ratings for all three variables we have worked with: Condition, Time, and Gender.
```{r facet1, message = FALSE}
# specify as an object, so we only change it in one place
dodge_value <- 0.9
zhang_data_long %>%
drop_na(Gender) %>%
ggplot(aes(y = Interest, x = Condition, fill = Time)) +
geom_violin(alpha = 0.5) +
geom_boxplot(width = 0.2,
fatten = NULL,
position = position_dodge(dodge_value)) +
stat_summary(fun = "mean",
geom = "point",
position = position_dodge(dodge_value)) +
stat_summary(fun.data = "mean_cl_boot",
geom = "errorbar",
width = .1,
position = position_dodge(dodge_value)) +
facet_wrap(~ Gender) + # facet by Gender
scale_fill_viridis_d(option = "E") +
scale_y_continuous(name = "Interest score (1-7)",
breaks = c(1:7))
```
Facets work in the same way as adding a variable to `fill`. It is not easy to change the labels within <pkg>ggplot2</pkg>, you are better off editing the values in your data first.
## Test yourself
To end the chapter, we have some knowledge check questions to test your understanding of the concepts we covered in the chapter. We then have some error mode tasks to see if you can find the solution to some common errors in the concepts we covered in this chapter.
### Knowledge check
**Question 1**. You want to plot several summary statistics including the median for your outcome, which <pkg>ggplot2</pkg> layer could you use?
`r longmcq(sample(c(answer = "geom_boxplot()", "geom_point()", "geom_violin()")))`
**Question 2**. You want to create a scatterplot to show the correlation between two continuous variables, which <pkg>ggplot2</pkg> layer could you use?
`r longmcq(sample(c(answer = "geom_point()", "geom_violin()", "geom_boxplot()")))`
**Question 3**. You want to show the density of values in your outcome, which <pkg>ggplot2</pkg> layer could you use?
`r longmcq(sample(c(answer = "geom_violin()", "geom_point()", "geom_boxplot()")))`
**Question 4**. To separate a scatterplot into different groups, you could specify a grouping variable using the `fill` argument to change the colour of the points? `r torf(FALSE)`
::: {.callout-caution collapse="true"}
#### Explain this answer
This was a sneaky one, but relates to the error mode warning within the chapter. There are two ways to add a grouping variable for separate colours: `colour` and `fill`. In this scenario, `colour` would change the colour of the points, whereas `fill` would only change the colour of the regression line and its 95% confidence interval ribbon. Sometimes you need to play around with the settings to produce the effects you want.
:::
**Question 5**. The order of layers is important in <pkg>ggplot2</pkg>. Which order of layers would show individual data points on top of a boxplot?
`r longmcq(sample(c(answer = "data %>% ggplot() + geom_boxplot() + geom_jitter()", "data + ggplot() + geom_boxplot() + geom_jitter()", "data + ggplot() + geom_jitter() + geom_boxplot()", "data %>% ggplot() + geom_jitter() + geom_boxplot()")))`
::: {.callout-caution collapse="true"}
#### Explain this answer
In addition to the layer order, we also added an error mode feature to recognise when you need to use the pipe `%>%` vs the `+`.
- `data %>% ggplot() + geom_boxplot() + geom_jitter()` was the correct answer as we add data point after the boxplot.
- `data + ggplot() + geom_boxplot() + geom_jitter()` had the right order, but we used `+` instead of the pipe between the data and the initial `ggplot()` function.
- `data + ggplot() + geom_jitter() + geom_boxplot()` and `data %>% ggplot() + geom_jitter() + geom_boxplot()` both had the wrong layer order as the boxplot would overlay the points.
:::
### Error mode
The following questions are designed to introduce you to making and fixing errors. For this topic, we focus on the new types of data visualisation. Remember to keep a note of what kind of error messages you receive and how you fixed them, so you have a bank of solutions when you tackle errors independently.
Create and save a new R Markdown file for these activities. Delete the example code, so your file is blank from line 10. Create a new code chunk to load <pkg>tidyverse</pkg> and wrangle the data files:
```{r eval=FALSE}
# Load the tidyverse package below
library(tidyverse)
# Load the data file
# This should be the Zhang_2014.csv file
zhang_data <- read_csv("data/Zhang_2014.csv")
# Wrangle the data for plotting.
# select and rename key variables
# mutate to add participant ID and recode
zhang_data <- zhang_data %>%
select(Gender,
Age,
Condition,
time1_interest = T1_Predicted_Interest_Composite,
time2_interest = T2_Actual_Interest_Composite) %>%
mutate(participant_ID = row_number(),
Condition = case_match(Condition,
1 ~ "Ordinary",
2 ~ "Extraordinary"),
Gender = case_match(Gender,
1 ~ "Male",
2 ~ "Female"))
# gather the data to convert to long format
zhang_data_long <- zhang_data %>%
pivot_longer(cols = time1_interest:time2_interest,
names_to = "Time",
values_to = "Interest")
```
Below, we have several variations of a code chunk error or misspecification. Copy and paste them into your R Markdown file below the code chunk to load <pkg>tidyverse</pkg> and the data files. Once you have copied the activities, click knit and look at the error message you receive. See if you can fix the error and get it working before checking the answer.
**Question 6**. Copy the following code chunk into your R Markdown file and press knit. This code... works, but it does not look quite right? Why are the tick marks not displaying properly?
````{verbatim, lang = "markdown"}
```{r}
zhang_data %>%
ggplot(aes(x = time1_interest, y = time2_interest)) +
geom_point() +
theme_classic() +
scale_x_discrete(name = "Time 1 interest score (1-7)",
breaks = c(1:7)) + # tick marks from 1 to 7
scale_y_discrete(name = "Time 2 interest score (1-7)",
breaks = c(1:7)) + # tick marks from 1 to 7
geom_smooth(method = "lm")
```
````
::: {.callout-caution collapse="true"}
#### Explain the solution
In this example, we used the wrong function for continuous variables. We used `scale_x_discrete` and `scale_y_discrete`, instead of `scale_x_continuous` and `scale_y_continuous`. We must honour the variable type when we customise the plot, so think about what type of variable is on each axis and which function lets you edit it.
```{r eval = FALSE}
zhang_data %>%
ggplot(aes(x = time1_interest, y = time2_interest)) +
geom_point() +
theme_classic() +
scale_x_continuous(name = "Time 1 interest score (1-7)",
breaks = c(1:7)) + # tick marks from 1 to 7
scale_y_continuous(name = "Time 2 interest score (1-7)",
breaks = c(1:7)) + # tick marks from 1 to 7
geom_smooth(method = "lm")
```
:::
**Question 7**. Copy the following code chunk into your R Markdown file and press knit. You should receive an error like `Error in "fortify()":! "data" must be a <data.frame>, or an object coercible by "fortify()"` which is a little cryptic.
````{verbatim, lang = "markdown"}
```{r}
zhang_data_long +
ggplot(aes(y = Interest, x = Condition, fill = Condition)) +
geom_boxplot() +
scale_y_continuous(name = "Interest score (1-7)",
breaks = c(1:7)) +
scale_fill_viridis_d(option = "E",
alpha = 0.6) +
guides(fill = FALSE)
```
````
::: {.callout-caution collapse="true"}
#### Explain the solution
Once we start using a mixture of <pkg>tidyverse</pkg> functions, it is important to remember which uses a pipe `%>%` between layers, and which uses `+`. Here, we tried using the `+` between the data object and the initial `ggplot()` layer. We need a pipe here or it thinks you are trying to set the data argument using `aes()`.
```{r eval = FALSE}
zhang_data_long %>%
ggplot(aes(y = Interest, x = Condition, fill = Condition)) +
geom_boxplot() +
scale_y_continuous(name = "Interest score (1-7)",
breaks = c(1:7)) +
scale_fill_viridis_d(option = "E",
alpha = 0.6) +
guides(fill = FALSE)
```
:::
**Question 8**. Copy the following code chunk into your R Markdown file and press knit. We want to change the order of the categories to present males then female. This code...works, but is it doing what we think it is doing?
````{verbatim, lang = "markdown"}
```{r}
zhang_data_long %>%
drop_na(Gender) %>%
ggplot(aes(y = Interest, x = Gender)) +
geom_violin() +
scale_y_continuous(name = "Interest score (1-7)",
breaks = c(1:7)) +
scale_x_discrete(labels = c("Male", "Female"))
```
````
::: {.callout-caution collapse="true"}
#### Explain the solution
We have introduced this error several times, but we see it so often it is worth reinforcing. When we change the labels, this is really just to tidy things up. The underlying data does not change, we are just trying to communicate it clearer. If we want to change the order of categories, we must change the underlying order of the data as a factor or R will default to alphabetical/numerical. So, we mutate Gender as a factorm, then pipe to <pkg>ggplot2</pkg>.
```{r eval = FALSE}
zhang_data_long %>%
mutate(Gender = factor(Gender,
levels = (c("Male", "Female")))) %>%
drop_na(Gender) %>%
ggplot(aes(y = Interest, x = Gender)) +
geom_violin() +
scale_y_continuous(name = "Interest score (1-7)",
breaks = c(1:7))
```
:::
## Words from this Chapter
Below you will find a list of words that were used in this chapter that might be new to you in case it helps to have somewhere to refer back to what they mean. The links in this table take you to the entry for the words in the [PsyTeachR Glossary](https://psyteachr.github.io/glossary/){target="_blank"}. Note that the Glossary is written by numerous members of the team and as such may use slightly different terminology from that shown in the chapter.
```{r gloss, echo=FALSE, results='asis'}
glossary_table()
```
## End of chapter
Well done, you have completed the second chapter dedicated to data visualisation! This is a key area for psychology research and helping to communicate your findings to your audience. Data visualisation also comes with a lot of responsibility. There are lots of design choices to make and help communicate your findings as effectively and transparently as possible. We could dedicate a whole book to data visualisation possibilities in R and <pkg>ggplot2</pkg>, so we have added a range of further reading sources in the [Additional Resources](#additional-resources) appendix.
In the next chapter, we start on inferential statistics introducing you to the concept of regression by focusing on one continuous predictor variable.