-
Notifications
You must be signed in to change notification settings - Fork 1
/
06-wrangling-3.qmd
1034 lines (733 loc) · 51.4 KB
/
06-wrangling-3.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# Data Wrangling 3: Pivots and Pipes {#06-wrangling-03}
```{r setup, include=FALSE, message=FALSE, warning=FALSE}
knitr::opts_chunk$set(echo = TRUE)
# Load the tidyverse package below
library(tidyverse)
# Load the data files
# This should be the Alter_2024_demographics.csv file
demog <- read_csv("data/Alter_2024_demographics.csv")
# This should be the Alter_2024_scales.csv file
scales <- read_csv("data/Alter_2024_scales.csv")
```
In the last chapter, we added two more functions to your data wrangling toolkit. You learnt how to filter data to retain or remove observations and summarise data to calculate different summary statistics.
In this chapter, we start with another recap of all your data wrangling skills so far on a new data set. There is no substitute for practice when it comes to data skills. By applying your skills to new data, you can transfer your knowledge to novel scenarios and develop independence. We then add two more sets of data wrangling functions. First, we demonstrate pivots to restructure your data from wide format into long format, and vice versa. Second, we show how you can string together multiple functions from the <pkg>tidyverse</pkg> using pipes. These help streamline your code to avoid creating lots of intermediary objects.
After this chapter, you will be ready for the first data analysis journey chapter: [Analysis Journey 1: Data Wrangling](#journey-01-wrangling). This is where you can test the skills you have developed so far on a more independent task as a bridge between the core chapters and your assessments.
**Chapter Intended Learning Outcomes (ILOs)**
By the end of this chapter, you will be able to:
- Apply all your data wrangling skills to a new data set.
- Restructure data into different formats, such as long and wide form.
- Understand the tidy data structure for how most of the <pkg>tidyverse</pkg> functions expect your data to look.
- String together multiple functions using pipes.
## Chapter preparation
### Introduction to the data set
For this chapter, we are using open data from @alter_vssl_2024. The abstract of their article is:
> The biggest difference in statistical training from previous decades is the increased use of software. However, little research examines how software impacts learning statistics. Assessing the value of software to statistical learning demands appropriate, valid, and reliable measures. The present study expands the arsenal of tools by reporting on the psychometric properties of the Value of Software to Statistical Learning (VSSL) scale in an undergraduate student sample. We propose a brief measure with strong psychometric support to assess students' perceived value of software in an educational setting. We provide data from a course using SPSS, given its wide use and popularity in the social sciences. However, the VSSL is adaptable to any statistical software, and we provide instructions for customizing it to suit alternative packages. Recommendations for administering, scoring, and interpreting the VSSL are provided to aid statistics instructors and education researchers understand how software influences students' statistical learning.
To summarise, they developed a new scale to measure students' perceived value of software to learning statistics - Value of Software to Statistical Learning (VSSL). The authors wanted to develop this scale in a way that could be adapted to different software, from SPSS in their article (which some of you may have used in the past), to perhaps R in future. Alongside data from their new scale, they collected data from other scales measuring a similar kind of construct (e.g., Students' Attitudes toward Statistics and Technology) and related constructs (e.g., Quantitative Attitudes).
In this chapter, we will wrangle their data to reinforce skills from Chapter 4 and 5. Scale data is extremely common to work with in psychology and there is a high likelihood you will use one or more in your dissertation or future careers. After recapping skills from the past two chapters on this new data set, we will add more data wrangling functions to your toolkit.
### Organising your files and project for the chapter
Before we can get started, you need to organise your files and project for the chapter, so your working directory is in order.
1. In your folder for research methods and the book `ResearchMethods1_2/Quant_Fundamentals`, you should have a folder from chapter 4 called `Chapter_04_06_datawrangling` where you created an R Project.
2. Create a new R Markdown document and give it a sensible title describing the chapter, such as `06 Data Wrangling 3`. Delete everything below line 10 so you have a blank file to work with and save the file in your `Chapter_04_06_datawrangling` folder.
4. We are working with a new data set separated into two files. The links are data file one ([Alter_2024_demographics.csv](data/Alter_2024_demographics.csv)) and data file two ([Alter_2024_scales.csv](data/Alter_2024_scales.csv)). Right click the links and select "save link as", or clicking the links will save the files to your Downloads. Make sure that both files are saved as ".csv". Save or copy the file to your `data/` folder within `Chapter_04_06_datawrangling`.
You are now ready to start working on the chapter!
## Recapping all the previous <pkg>dplyr</pkg> functions
In this first section, we will prepare the data for some analysis later by practicing the data wrangling skills you learnt in Chapters 4 and 5 on this new data set.
### Activity 1 - Load <pkg>tidyverse</pkg> and read the data files
As the first activity, load <pkg>tidyverse</pkg> and read the two data files. As a prompt, save the data files to these object names to be consistent with the activities below, but you can check your answer below if you are stuck.
```{r eval=FALSE}
# Load the tidyverse package below
?
# Load the data files
# This should be the Alter_2024_demographics.csv file
demog <- ?
# This should be the Alter_2024_scales.csv file
scales <- ?
```
::: {.callout-tip collapse="true"}
#### Show me the solution
You should have the following in a code chunk:
```{r eval=FALSE}
# Load the tidyverse package below
library(tidyverse)
# Load the data files
# This should be the Alter_2024_demographics.csv file
demog <- read_csv("data/Alter_2024_demographics.csv")
# This should be the Alter_2024_scales.csv file
scales <- read_csv("data/Alter_2024_scales.csv")
```
:::
### Activity 2 - Explore `demog` and `scales`
The data from @alter_vssl_2024 is split into two data files. In `demog`, we have the participant ID (`StudentIDE`) and several demographic variables. The columns (variables) we have in the data set are:
| Variable | Type | Description |
|:--------------:|:---------------------------------|:-------------------------------|
| StudentIDE | `r typeof(demog$StudentIDE)`| Participant number |
| GenderE | `r typeof(demog$GenderE)`| Gender: 1 = Female, 2 = Male, 3 = Non-Binary |
| RaceEthE | `r typeof(demog$RaceEthE)`| Race: 1 = Black/African American, 2 = Hispanic/Other Latinx, 3 = White, 4 = Multiracial, 5 = Asian/Pacific Islander, 6 = Native American/Alaska Native, 7 = South/Central American |
| GradeE | `r typeof(demog$GradeE)`| Expected grade: 1 = A, 2 = B, 3 = C, 4 = D, 5 = F |
| StuStaE | `r typeof(demog$StuStaE)`| Student status: 1 = Freshman, 2 = Sophomore, 3 = Junior, 4 = Senior or Higher |
| GPAE | `r typeof(demog$GPAE)`| Expected Grade Point Average (GPA) |
| MajorE | `r typeof(demog$MajorE)`| Degree major |
| AgeE | `r typeof(demog$AgeE)`| Age in years |
In `scales`, we then have the participant ID (`StudentIDE`) and all the individual scale items. The columns (variables) we have in the data set are:
| Variable | Type | Description |
|:--------------:|:---------------------------------|:-------------------------------|
| StudentIDE | `r typeof(demog$StudentIDE)`| Participant number |
| MA1E to MA8E | `r typeof(scales$MA1E)`| **Enjoyment of Mathematics and statistics**, not analysed in this study. |
| QANX1E to QANX4E | `r typeof(scales$QANX1E)`| **Quantitative anxiety**: four items scored on a 5-point Likert scale ranging from 1 (Not at all Anxious) to 5 (Extremely Anxious) |
| QINFL1E to QINFL7E | `r typeof(scales$QINFL1E)`| **Quantitative attitudes**: seven items scored on a 5-point Likert scale ranging from 1 (Strongly Disagree) to 5 (Strongly Agree) |
| QSF1E to QSF4E | `r typeof(scales$QSF1E)`| **Study motivation**, not analysed in this study. |
| QHIND1E to QHIND5E | `r typeof(scales$QHIND1E)`| **Quantitative hindrances**: five items scored on a 5-point Likert scale ranging from 1 (Strongly Disagree) to 5 (Strongly Agree) |
| QSC1E to QSC4E | `r typeof(scales$QSC1E)`| **Mathematical self-efficacy**, not analysed in this study. |
| QSE1E to QSE6E | `r typeof(scales$QSE1E)`| **Mathematical ability**, not analysed in this study. |
| SPSS1E to SPSS10E | `r typeof(scales$SPSS1E)`| **VSSL scale on SPSS**: 10 items scored on a 5-point Likert scale ranging from 1 (Never True) to 5 (Always True) |
::: {.callout-tip}
#### Try this
Now we have introduced the two data sets, explore them using different methods we introduced. For example, opening the data objects as a tab to scroll around, explore with `glimpse()`, or even try plotting some of the variables to see what they look like using visualisation skills from Chapter 3.
:::
### Activity 3 - Joining the two data sets using `inner_join()`
At the moment, we have two separate data sets, but it will make things easier to join them together so we have both demographic information and the participants' responses to the scales.
We did not recap joining data sets in the last chapter, so you might need to revisit Chapter 4 - [Joining two data frames](#04-joins) - for a recap.
Create a new data object called `full_data` and see if you can spot a common variable between both data sets that you can use an identifier.
```{r eval=FALSE}
# join demog and scales by a common identifier
full_data <- ?
```
::: {.callout-tip collapse="true"}
#### Show me the solution
You should have the following in a code chunk:
```{r}
# join demog and scales by a common identifier
full_data <- inner_join(x = demog,
y = scales,
by = "StudentIDE")
```
:::
### Activity 4 - Selecting a range of columns using `select()`
There are some scales in the data that @alter_vssl_2024 did not analyse, so we can get rid of them to declutter. Furthermore, the purpose of their study was to validate the new VSSL scale and they found some items did not make the cut. Create a new object called `full_data_select` and retain the following variables from your new `full_data` object:
- `StudentIDE`
- `GenderE`
- `RaceEthE`
- `AgeE`
- `QANX1E` to `QINFL7E`
- `QHIND1E` to `QHIND5E`
- `SPSS1E`, `SPSS4E`, `SPSS5E`, `SPSS6E`, `SPSS7E`, `SPSS8E`, `SPSS9E`.
Remember: you can select variables either by **retaining** the variables you want to keep, or **removing** the variables you want to remove. You should have **27** columns remaining.
```{r eval=FALSE}
# select the key variables listed above
full_data_select <- ?
```
::: {.callout-tip collapse="true"}
#### Show me the solution
You should have the following in a code chunk if you chose to retain:
```{r}
# select the key variables listed above
full_data_select <- select(full_data,
StudentIDE,
GenderE,
RaceEthE,
AgeE,
QANX1E:QINFL7E,
QHIND1E:QHIND5E,
SPSS1E, SPSS4E, SPSS5E, SPSS6E, SPSS7E, SPSS8E, SPSS9E)
```
or the following if you chose to remove:
```{r eval=FALSE}
# select the key variables listed above
full_data_select <- select(full_data,
-GradeE,
-StuStaE,
-GPAE,
-MajorE,
-MA1E:-MA8E,
-QSF1E:-QSF4E,
-QSC1E:-QSE6E,
-SPSS2E, -SPSS3E, -SPSS10E)
```
There are a similar number to retain or remove, so there is no real time saving one way or the other.
:::
### Activity 5 - Reorder observations using `arrange()`
For a quick check of the data, order the values of `AgeE` using the object `full_data_select` and answer the following questions:
1. The youngest participant is `r fitb(18)` years old.
2. The old participant is `r fitb(39)` years old.
```{r eval=FALSE}
# youngest participants
?
# oldest participants
?
```
::: {.callout-tip collapse="true"}
#### Show me the solution
You should have the following in a code chunk:
```{r eval=FALSE}
# youngest participants
arrange(full_data_select,
AgeE)
# oldest participants
arrange(full_data_select,
desc(AgeE))
```
:::
### Activity 6 - Modifying or creating variables using `mutate()`
At the moment, we have categorical variables such gender (`GenderE`) and race (`RaceEthE`) which have numerical codes. When it comes to summarising or plotting later, this would not be the easiest to understand.
Using the `full_data_select` object, use `mutate()` to recode these two existing variables and replace the numbers with labels and create a new object `full_data_mutate`. As a reminder of what each number refers to:
`GenderE`
- 1 = Female
- 2 = Male
- 3 = Non-binary
`RaceEthE`
- 1 = Black/African American
- 2 = Hispanic/Other Latinx
- 3 = White
- 4 = Multiracial
- 5 = Asian/Pacific Islander
- 6 = Native American/Alaska Native
- 7 = South/Central American
```{r eval=FALSE}
# recode gender and race to labels
full_data_mutate <- ?
```
::: {.callout-tip collapse="true"}
#### Show me the solution
You should have the following in a code chunk (some lines wrap due to being quite long, but it will look right if you copy and paste it to your RStudio):
```{r}
# recode gender and race to labels
full_data_mutate <- mutate(full_data_select,
GenderE = case_match(GenderE,
1 ~ "Female",
2 ~ "Male",
3 ~ "Non-binary"),
RaceEthE = case_match(RaceEthE,
1 ~ "Black/African American",
2 ~ "Hispanic/Other Latinx",
3 ~ "White",
4 ~ "Multiracial",
5 ~ "Asian/Pacific Islander",
6 ~ "Native American/Alaska Native",
7 ~ "South/Central American"))
```
:::
#### Bonus activity - reverse coding scales {#06-reverse-coding}
For a bonus activity, we want to demonstrate a super common task when working with scale data. Often, scales will **`r glossary("reverse code", def = "Having two similar questions, one expressed in a positive way, and another expressed in a negative way.")`** some items to express the same idea in opposite ways: one positive and one negative. If the scale is measuring a consistent construct, the responses should be more positive in one and more negative in the other. If you analysed this immediately, you would get two opposing answers, so a key data wrangling step is reverse coding some items so all the numbers mean a similar thing.
In @alter_vssl_2024, the three VSSL items we removed were the ones which needed to be reverse coded, but it is a good excuse to practice. Using the `scales` object, what function could you use to recode existing responses? Hint: we want to recode 1 to 5, 2 to 4, etc.
```{r eval=FALSE}
# recode items 2, 3, and 10
scales_reverse <- mutate(scales,
SPSS2_R = ?,
SPSS3_R = ?,
SPSS10_R = ?)
```
::: {.callout-tip collapse="true"}
#### Show me the solution
Based on what we covered before, we expect you will have completed a perfectly accurate but long process of recoding each item one by one:
```{r eval=FALSE}
# recode items 2, 3, and 10
scales_reverse <- mutate(scales,
SPSS2_R = case_match(SPSS2E,
1 ~ 5,
2 ~ 4,
3 ~ 3,
4 ~ 2,
5 ~ 1),
SPSS3_R = case_match(SPSS3E,
1 ~ 5,
2 ~ 4,
3 ~ 3,
4 ~ 2,
5 ~ 1),
SPSS10_R = case_match(SPSS10E,
1 ~ 5,
2 ~ 4,
3 ~ 3,
4 ~ 2,
5 ~ 1))
```
However, there is a neat shortcut where you can subtract the response from the biggest scale unit plus 1. For example, if you have a 5-point scale, you would subtract the response from 6, if you have a 7-point scale, from 8 etc.
```{r eval=FALSE}
# Reverse code by subtracting responses from 6
scales_reverse <- mutate(scales,
SPSS2_R = 6 - SPSS2E,
SPSS3_R = 6 - SPSS3E,
SPSS10_R = 6 - SPSS10E)
```
Explore your new data object to see what the new reverse coded variables look like.
:::
### Activity 7 - Removing or retaining observations using `filter()`
To practice filtering data to retain specific participants, imagine we wanted to focus on two specific groups of people.
First, we just want to explore the data of "Non-binary" participants. Second, we want to explore the data of "Female", "Asian/Pacific Islander" participants. Use `filter()` on the `full_data_mutate` object to create two objects: `NB_participants` and `F_asian_participants`.
```{r eval=FALSE}
# non-binary participants
NB_participants <- ?
# female, Asian/Pacific Islander participants
F_asian_participants <- ?
```
After creating the objects, answer the following questions:
1. We have `r fitb(1)` non-binary participant(s) in the data set.
2. We have `r fitb(9)` female, Asian/Pacific Islander participant(s) in the data set.
::: {.callout-tip collapse="true"}
#### Show me the solution
You should have the following in a code chunk:
```{r}
# non-binary participants
NB_participants <- filter(full_data_mutate,
GenderE == "Non-binary")
# female, Asian/Pacific Islander participants
F_asian_participants <- filter(full_data_mutate,
GenderE == "Female",
RaceEthE == "Asian/Pacific Islander")
```
:::
#### Bonus activity - Removing NAs with `drop_na()` {#06-drop-NAs}
One concept we will spend more time on in [Chapter 11 - Screening Data]() - is removing participants who do not provide an answer. We delve more into the decision making in the course materials, but there is a handy function in <pkg>tidyr</pkg> called `drop_na()`. You could do this using filter, but the standalone function streamlines things. If you run the function on your whole data set, it will remove observations with one or more NAs in all their variables:
```{r}
# remove observations with any NAs
no_NAs <- drop_na(full_data_mutate)
```
However, often you do not want to remove all variables with an NA as there might be valuable information elsewhere. You can add one or more variables to ask `drop_na()` to only remove NAs present in those specific variables:
```{r}
# remove observations with any NAs
age_NAs <- drop_na(full_data_mutate,
AgeE)
```
This impacts the number of participants we remove as we had **171** when we removed all NAs, but **179** when we only removed NAs in age.
### Activity 8 - Summarising data using `count()` and `summarise()`
#### Counting observations
As the final recap activity, it is time to calculate some summary statistics to understand our data set. First, use `count()` on the `full_data_mutate` object to answer the following questions:
1. How many observations do we have of each gender? `r fitb(29)` males, `r fitb(149)` females, and `r fitb(1)` non-binary.
2. How many observations do we have of each race? `r fitb(131)` white, `r fitb(13)` Black/African American, and `r fitb(1)` NA with missing data.
```{r eval=FALSE}
# count each group in GenderE
?
# count each group in RaceE
?
```
::: {.callout-tip collapse="true"}
#### Show me the solution
You should have the following in a code chunk:
```{r eval=FALSE}
# count each group in GenderE
count(full_data_mutate,
GenderE)
# count each group in RaceE
count(full_data_mutate,
RaceEthE)
```
:::
#### Summarising observations
One useful demographic summary is the mean and standard deviation (*SD*) of participant ages. We have covered the function for the mean (`mean()`) several times, but a key part of coding is knowing what you want, but not the function to do it. So, in the process of the next answer, try and find the function for the standard deviation on your own. If you are really stuck though, you can see the hint below.
::: {.callout-note collapse="true"}
#### Give me a hint for the SD function
```{r eval=FALSE}
# Function for the standard deviation
sd()
```
:::
```{r eval=FALSE}
# Mean and SD age
mean_age <- summarise(full_data_mutate,
mean_age = ?,
SD_age = ?)
```
::: {.callout-tip collapse="true"}
#### Show me the solution
You should have the following in a code chunk:
```{r eval=FALSE}
# Mean and SD age
mean_age <- summarise(full_data_mutate,
mean_age = mean(AgeE, na.rm = TRUE),
SD_age = sd(AgeE, na.rm = TRUE))
```
Remember, if there are NAs present in the data like this, you need to add `na.rm = TRUE` or handle NAs prior to applying the function.
:::
::: {.callout-important}
#### Error mode
As a transition point to restructuring data, imagine we wanted to calculate the sum score of the items to calculate a number for the whole scale per participant. Based on how we have used `mutate()` or `summarise()` before, you might try:
```{r}
sum_VSSL <- mutate(full_data_mutate,
VSSL = sum(c(SPSS1E, SPSS4E, SPSS5E, SPSS6E, SPSS7E, SPSS8E, SPSS9E), na.rm = TRUE))
```
However, if you look at the object, the VSSL column is the same for every participant (4413) which does not look right? This is due to how functions work within `mutate()`. It is essentially applying the `sum()` function to all the columns first and adding them together, rather than summing the values of each column within each participant.
We can fix this problem by restructuring the data.
:::
## Restructuring data using `pivot_longer()` and `pivot_wider()`
Apart from joining two data sets, we have pretty much just worked with the data files as they come to us where each row represents one observation/participant and each column represents one variable. That is great but there are scenarios where you get data sets in messier formats that do not follow this pattern. Furthermore, you might need to restructure your data to perform certain functions, like taking the mean/sum of many columns per participant or visualising multiple elements. Before we work on the data wrangling side, we need a brief explanation of data formats.
### Tidy data
For most of this book, we use a type of data organisation known as `r glossary("tidy data")`. Any data in this format is easily processed through the <pkg>tidyverse</pkg> family of packages. However, the data you work with will not always be formatted in the most efficient way possible. If that happens, then our first step is to put it into a tidy data format. There are two fundamental principles defining tidy data:
1. Each variable must have its own column.
2. Each observation must have its own row.
@wickham_tidy_2014 adds the following principle:
3. Each type of observation unit forms a table.
[Grolemund and Wickham (2023)](https://r4ds.hadley.nz/data-tidy.html){target="_blank"} restate this third principle as: "Each value must have its own cell (i.e. no grouping two variables together, e.g. time/date in one cell)" where a cell is where any specific row and column meet. A single data point in a data frame / tibble is a cell for example. The Grolemund and Wickham (2023) book is a very useful source for further reading and it is free, but browsing the chapter on tidy data will help you visualise how you want to arrange data.
::: {.callout-note}
If you have worked with any kind of data before, particularly if you have used Excel, it is likely that you will have used **wide format** or **long format** data. In wide format, each participant's data is all in one row with multiple columns for different data points. This means that the data set tends to be very wide and you will have as many rows as you have participants.
Long format is where each **row** is a single observation, typically a single trial in an experiment or a response to a single item on a questionnaire. When you have multiple trials per participant, you will have multiple rows for the same participant. To identify participants, you would need a variable with some kind of participant id, which can be as simple as a distinct integer value for each participant. In addition to the participant identifier, you would have any measurements taken during each observation (e.g., response time) and what experimental condition the observation was taken under.
In wide format data, each **row** corresponds to a single participant, with multiple observations for that participant spread across columns. So for instance, with survey data, you would have a separate column for each survey question.
Tidy data is a mix of both of these approaches and most functions in the <pkg>tidyverse</pkg> assume the tidy format, so typically the first thing you need to do when you get data is think about what format you need your data to perform the functions and analyses you want. For some functions, you need your data in wide format, and in others you need your data in long format. This means being able to quickly restructure your data is a key skill.
:::
### Activity 9: Gathering with `pivot_longer()`
In it's current format, we have wide data where each row is a separate participant and each column is a separate variable. We can use the function `r glossary("pivot_longer()", def = "Gather data by increasing the number of rows and decreasing the number of columns.")` from the <pkg>tidyr</pkg> package within <pkg>tidyverse</pkg>.
The pivot functions can be easier to show than explain first, so type and run the following code using the `full_data_mutate` object:
```{r}
full_data_long <- pivot_longer(data = full_data_mutate,
cols = SPSS1E:SPSS9E,
names_to = "Question",
values_to = "Response")
```
To break down the code:
- We create a new data object called `full_data_long` by applying the `pivot_longer()` function to `full_data_mutate`.
- In the `cols` argument, we specify the columns we want to gather. We use the colon method here like `select()` to choose the 7 columns for the VSSL items. If the columns are not in order, you could use the `c()` method instead (e.g., `cols = c(SPSS1E, SPSS9E)`).
- The `names_to` argument is what your first new column will be called. All the column names you selected in `cols` will be pivoted into this new column, so call it something sensible you will remember later. Here, we call the new column "Question".
- The `values_to` argument is what your second new column will be called. For all the columns you gather, the response of each participant will be in one column stacked on top of each other next to its label in "Question". You also need to call this something memorable, like "Response" here.
Now, explore the new `full_data_long` object you just created and compare it to `full_data_mutate`. Instead of **181** rows, we now have **1267** rows. Instead of 27 variables, we now have 22 variables. We have 181 participants who responded to 7 VSSL items, so we pivot the data into long format to get 181 * 7 = **1267** rows.
Visually, you can see the difference with a preview of just the participant ID and VSSL items here:
::: {.panel-tabset}
#### Original wide format
```{r echo=FALSE}
full_data_mutate %>%
select(StudentIDE, SPSS1E:SPSS9E) %>%
head(n = 10)
```
#### New long format
```{r echo=FALSE}
full_data_long %>%
select(StudentIDE, Question, Response) %>%
head(n = 10)
```
:::
Now we have our data in long form, we can calculate summary statistics for participants using `group_by()` and `summarise()`. First, we group the data by the participant ID, as we want one value per participant:
```{r}
# group full_data_long by StudentIDE
longdata_grouped <- group_by(full_data_long,
StudentIDE)
```
Second, we create a new variable using `summarise()` to take the sum of all the items. This will create the VSSL scale score consistent with @alter_vssl_2024:
```{r}
# Calculate the sum of VSSL items by taking the sum of Response
VSSL_sum <- summarise(longdata_grouped,
VSSL_sum = sum(Response))
```
Our new object goes from 1267 rows back to 181 as we grouped by the participant ID and took the sum of `Response`. This means we apply the function we provide `summarise()` to all the rows we want to group by, in this case across all 7 VSSL items. Your new object has just two columns: `StudentIDE` and `VSSL_sum` and should look like the following extract:
```{r echo=FALSE}
head(VSSL_sum)
```
At this point, you could join the object to `full_data_mutate` to add the scale score to all the other variables.
::: {.callout-tip}
#### Try this
We calculated the VSSL scale score by pivoting longer, grouping the data, and taking the sum of the 7 items. To test your understanding, complete the same steps to calculate the scale score of **Quantitative anxiety** using the four columns `QANX1E` to `QANX4E`. The scale score here also involves taking the sum of the columns. Use the `full_data_mutate` object as your starting point for the data.
Check your attempt with the solution below when you have tried on your own.
```{r eval=FALSE}
# gather the four quant anxiety items to long form
quant_anxiety_long <- ?
# group the long data by participant ID
quant_anxiety_group <- ?
# calculate the sum quant anxiety per participant
sum_quant_anxiety <- ?
```
To check your answers:
1. Participant 1 has a sum quantitative anxiety score of `r fitb(6)`
2. Participant 5 has a sum quantitative anxiety score of `r fitb(16)`
:::
::: {.callout-caution collapse="true"}
#### Solution
We complete the task in three steps. First, we pivot longer using the four columns QANX1E to QANX4E to create the new `quant_anxiety_long` object. Second, we group that new long data object by the participant ID. Third, we calculate the sum quantitative anxiety score by taking the sum of the responses per participant ID.
```{r}
# gather the four quant anxiety items to long form
quant_anxiety_long <- pivot_longer(data = full_data_mutate,
cols = QANX1E:QANX4E,
names_to = "Question",
values_to = "Response")
# group the long data by participant ID
quant_anxiety_group <- group_by(quant_anxiety_long,
StudentIDE)
# calculate the sum quant anxiety per participant
sum_quant_anxiety <- summarise(quant_anxiety_group,
sum_quant_anxiety = sum(Response))
```
:::
### Spreading with `pivot_wider()`
You might also find yourself in situations where you must restructure data in the opposite direction: from long to wide. There is a complementary function called `r glossary("pivot_wider()", def = "Spread data by decreasing the number of rows and increasing the number of columns.")` where you can spread values from one column to multiple columns. You need two columns in your long form data set, one for the variable names which will be your new column names, then one for the responses which will be the values in each cell.
To demonstrate this function, we will transform `full_data_long` back to wide format so we have 7 columns of VSSL items:
```{r}
full_data_wide <- pivot_wider(data = full_data_long,
names_from = "Question",
values_from = "Response")
```
To break down the code:
- We create a new object `full_data_wide` by applying the function `pivot_wider()` to `full_data_long`.
- In the `names_from` argument, we add the column name "Question" which contains the names of the variables you want as your new column names.
- In the `values_from` argument, we add the column name "Response" which contains the values of the variables which will be the cells of your data frame.
The new object `full_data_wide` should now look exactly the same as the `full_data_mutate` object we started with.
::: {.callout-note}
#### Do I need to add quotes to the column names?
You might have noticed we added quotes around the column names to specify the `names_from` and `values_from` arguments. When we specify columns in <pkg>tidyverse</pkg> functions, we do not need to add the quotes, we can just type the name and it will work (`names_from = "Question"` and `names_from = Question` would both work here). However, in other functions outside the <pkg>tidyverse</pkg>, you normally need to add the quotes around column names. When to add quotes or not can take a while to get used to, so this is just a note to highlight you might try one method and it does not work, but you can try the other method if you get an error.
:::
::: {.callout-tip}
#### Try this
In the `pivot_longer()` section, you should have created a new object `quant_anxiety_long` if you completed the "Try this" activity. To test your understanding of `pivot_wider()`, spread the four items and responses of **Quantitative anxiety** back to wide format. Use the `quant_anxiety_long` object as your starting point and create a new object called `quant_anxiety_wide`.
Check your attempt with the solution below when you have tried on your own.
```{r eval=FALSE}
# spread the quant anxiety items back to wide form
quant_anxiety_wide <- ?
```
:::
::: {.callout-caution collapse="true"}
#### Solution
This task follows the exact format as `full_data_wide` if you named your variables the same as ours. It is just important the `names_from` and `values_from` columns are the same as those you used in `quant_anxiety_long`.
```{r}
# spread the quant anxiety items back to wide form
quant_anxiety_wide <- pivot_wider(quant_anxiety_long,
names_from = "Question",
values_from = "Response")
```
:::
## Combining several functions with pipes
In this final section on data wrangling, we are not covering new functions, but a new way of working. So far, we have created lots of new objects by applying individual <pkg>tidyverse</pkg> functions, but there is a way to string together several functions and streamline your code. We wanted to introduce you to the individual functions first to develop your fundamentals skills and understanding of what the functions do, but now we can be a little more efficient.
Instead of creating several objects, you can use `r glossary("pipe", display = "pipes")`. We write pipes as `%>%` and you can read them as "and then". Pipes allow you to string together 'sentences' of code into 'paragraphs' so that you do not need to create intermediary objects.
This is another one of those concepts that is initially easier to show than tell:
```{r}
# Create an object starting with demog
full_data_pipe <- demog %>%
# Join with scales
inner_join(y = scales,
by = "StudentIDE") %>%
# Select key columns
select(StudentIDE,
GenderE,
RaceEthE,
AgeE,
QANX1E:QINFL7E,
QHIND1E:QHIND5E,
SPSS1E, SPSS4E, SPSS5E, SPSS6E, SPSS7E, SPSS8E, SPSS9E) %>%
# Recode variables with labels
mutate(GenderE = case_match(GenderE,
1 ~ "Female",
2 ~ "Male",
3 ~ "Non-binary"),
RaceEthE = case_match(RaceEthE,
1 ~ "Black/African American",
2 ~ "Hispanic/Other Latinx",
3 ~ "White",
4 ~ "Multiracial",
5 ~ "Asian/Pacific Islander",
6 ~ "Native American/Alaska Native",
7 ~ "South/Central American"))
```
Instead of creating all the intermediary objects, we go straight from joining the two data sets to recoding the variables in mutate, all in one object. Side by side, you can see the difference in the process we had to go through:
::: {.panel-tabset}
#### Creating separate objects
```{r eval=FALSE}
# join demog and scales by a common identifier
full_data <- inner_join(x = demog,
y = scales,
by = "StudentIDE")
# select the key variables listed above
full_data_select <- select(full_data,
StudentIDE,
GenderE,
RaceEthE,
AgeE,
QANX1E:QINFL7E,
QHIND1E:QHIND5E,
SPSS1E, SPSS4E:SPSS9E)
# recode gender and race to labels
full_data_mutate <- mutate(full_data_select,
GenderE = case_match(GenderE,
1 ~ "Female",
2 ~ "Male",
3 ~ "Non-binary"),
RaceEthE = case_match(RaceEthE,
1 ~ "Black...",
2 ~ "Hispanic...",
3 ~ "White",
4 ~ "Multiracial",
5 ~ "Asian...",
6 ~ "Native American...",
7 ~ "South..."))
```
#### Combining functions using pipes
```{r eval=FALSE}
# Create an object starting with demog
full_data_pipe <- demog %>%
# Join with scales
inner_join(y = scales,
by = "StudentIDE") %>%
# Select key columns
select(StudentIDE,
GenderE,
RaceEthE,
AgeE,
QANX1E:QINFL7E,
QHIND1E:QHIND5E,
SPSS1E, SPSS4E:SPSS9E) %>%
# Recode variables with labels
mutate(GenderE = case_match(GenderE,
1 ~ "Female",
2 ~ "Male",
3 ~ "Non-binary"),
RaceEthE = case_match(RaceEthE,
1 ~ "Black...",
2 ~ "Hispanic...",
3 ~ "White",
4 ~ "Multiracial",
5 ~ "Asian...",
6 ~ "Native American...",
7 ~ "South..."))
```
:::
As you get used to using pipes, remember you can interpret them as "and then". So, we could explain the function of the code to ourselves as:
- Create `full_data_pipe` by starting with `demog` data, **and then**
- Join with the `scales` data using `StudentIDE` as an identifier, **and then**,
- Select our key columns, **and then**
- Mutate to recode gender and race.
It can be tricky at first to understand what pipes are doing from a conceptual point of view, but it is well worth learning to use them. When your code starts getting longer, they are much more efficient and you write less code which is always a good thing to debug and find errors. You also have fewer objects in your environment as we created one object instead of three, tidying your workspace.
::: {.callout-important}
#### Error mode
One key difference that can trip people up is we no longer specify the data object as the first argument in each function. The reason that this function - the `%>%` - is called a pipe is because it 'pipes' the data through to the next function. When you wrote the code previously, the first argument of each function was the dataset you wanted to work on. When you use pipes, it will automatically take the data from the previous line of code so you do not need to specify it again.
For example, if we tried to specify `demog` again in `select()`, we would just receive an error.
```{r eval=FALSE}
# Create an object starting with demog
full_data_pipe <- demog %>%
# Join with scales
inner_join(y = scales,
by = "StudentIDE") %>%
# Select key columns
select(.data = demog,
StudentIDE,
GenderE,
RaceEthE,
AgeE,
QANX1E:QINFL7E,
QHIND1E:QHIND5E,
SPSS1E, SPSS4E:SPSS9E)
```
:::
::: {.callout-tip}
#### Try this
Pipes also work with other functions like `filter()`, `group_by()` and `summarise()`. If you start with the object `full_data_mutate`, try and express the following instructions in code:
1. Create a new object `age_groups` using `full_data_mutate` as your starting point, **and then**
2. Filter to only include "White" and "Black/African American" participants using `RaceEthE`, **and then**,
3. Group the observations by `RaceEthE`, **and then**,
4. Summarise the data to calculate the mean and standard deviation `AgeE`.
Check your attempt with the solution below when you have tried on your own.
```{r eval=FALSE}
# create age_groups by filtering, grouping, and summarising
age_groups <- full_data_mutate %>%
?
```
:::
::: {.callout-caution collapse="true"}
#### Solution
We complete this task in three steps. First, we filter `full_data_mutate` to just focus on White and Black/African American participants. Second, we group the data by `RaceEthE` so our summary statistics are split into two groups. Third, we calculate the mean and SD age.
```{r}
# create age_groups by filtering, grouping, and summarising
age_groups <- full_data_mutate %>%
filter(RaceEthE %in% c("White", "Black/African American")) %>%
group_by(RaceEthE) %>%
summarise(mean_age = mean(AgeE, na.rm = TRUE),
SD_age = sd(AgeE, na.rm = TRUE))
```
:::
## Test yourself
To end the chapter, we have some knowledge check questions to test your understanding of the concepts we covered in the chapter. We then have some error mode tasks to see if you can find the solution to some common errors in the concepts we covered in this chapter.
### Knowledge check
Which function(s) would you use to approach each of the following problems?
**Question 1**. We have a data set of 400 adults but we want to remove anyone with an age of 50 years or more. To do this, we could use:
`r longmcq(sample(c("select()", answer="filter()", "mutate()", "arrange()", "group_by()", "summarise()")))`
**Question 2**. We are interested in overall summary statistics for our data, such as the mean and total number of observations for a variable. To do this, we could use:
`r longmcq(sample(c("select()", "filter()", "mutate()", "arrange()", "group_by()", answer="summarise()")))`
**Question 3**. Our data set has a column with the number of cats a person has and a column with the number of dogs. We want to calculate a new column which contains the total number of pets each participant has. To do this, we could use:
`r longmcq(sample(c("select()", "filter()", answer="mutate()", "arrange()", "group_by()", "summarise()")))`
**Question 4**. We want to calculate the mean value of a column for several groups in our data set. To do this, we could use:
`r longmcq(sample(c(answer="group_by() and summarise()", "filter() and select()", "group_by() and arrange()", "arrange() and mutate()")))`
**Question 5**. If we wanted to apply the following wrangling steps with pipes, which series of functions would work? With the object `wide_data`, select several columns **and then**, pivot three columns longer **and then**, group by a participant ID **and then**, calculate the sum of responses.
`r longmcq(sample(c(answer = "wide_data %>% select() %>% pivot_longer() %>% group_by() %>% summarise()", "long_data %>% select() %>% pivot__longer_wider() %>% group_by() %>% summarise()", "wide_data %>% pivot_longer() %>% select() %>% summarise() %>% group_by()", "select() %>% pivot_longer() %>% group_by() %>% summarise() %>% wide_data")))`
### Error mode
The following questions are designed to introduce you to making and fixing errors. For this topic, we focus on data wrangling using past functions, `pivot_longer()`, and pipes (`%>%`). Remember to keep a note of what kind of error messages you receive and how you fixed them, so you have a bank of solutions when you tackle errors independently.
Create and save a new R Markdown file for these activities. Delete the example code, so your file is blank from line 10. Create a new code chunk to load <pkg>tidyverse</pkg> and the data files:
```{r eval=FALSE}
# Load the tidyverse package below
library(tidyverse)
# Load the data files
# This should be the Alter_2024_demographics.csv file
demog <- read_csv("data/Alter_2024_demographics.csv")
# This should be the Alter_2024_scales.csv file
scales <- read_csv("data/Alter_2024_scales.csv")
```
Below, we have several variations of a code chunk error or misspecification. Copy and paste them into your R Markdown file below the code chunk to load <pkg>tidyverse</pkg> and the data files. Once you have copied the activities, click knit and look at the error message you receive. See if you can fix the error and get it working before checking the answer.
**Question 6**. Copy the following code chunk into your R Markdown file and press knit. In this code chunk, we want to calculate the mean and SD age of all participants using `demog`. There are two errors/omissions here to try and fix:
1. One causes the document not to knit. You should receive an error like `Caused by error in "SD()": ! could not find function "SD"`.
2. The other looks like we just get NA values?
````{verbatim, lang = "markdown"}
```{r}
# calculate the mean and SD age
demog %>%
summarise(mean_age = mean(AgeE),
SD_age = SD(AgeE))
```
````
::: {.callout-caution collapse="true"}
#### Explain the solution
The first error is using the wrong function name for SD. Because we always abbreviate standard deviation to SD, it is tempting to try and use that as the function name. However, the function is lowercase: `sd()`.
The second error is not including the `na.rm = TRUE` argument. There are NAs in the data, so you either need to address them before running the function, or ignoring the NAs with `na.rm = TRUE`.
```{r eval = FALSE}
# calculate the mean and SD age
demog %>%
summarise(mean_age = mean(AgeE, na.rm = TRUE),
SD_age = sd(AgeE, na.rm = TRUE))
```
:::
**Question 7**. Copy the following code chunk into your R Markdown file and press knit. We want to calculate the sum of the five quantitative hindrances items per participant. This code... works, but does it look like it fits in the possible 5-25 range?
````{verbatim, lang = "markdown"}
```{r}
# sum quant hindrances items per participant
sum_quant_hindrance <- scales %>%
mutate(sum_quant_hindrance = sum(c(QHIND1E, QHIND2E, QHIND3E, QHIND4E, QHIND5E), na.rm = TRUE))
```
````
::: {.callout-caution collapse="true"}
#### Explain the solution
This is the main of warning we flagged in the opening section to `pivot_longer()`. Intuitively, it is the right idea to try and calculate the sum in a new column. However, in `mutate()`, it sums all the columns, not the observations for each participant.
Instead, we can pivot longer focusing on the quantitative hindrance items, group by participant ID, and summarise.
```{r eval = FALSE}
# sum quant hindrances items per participant
sum_quant_hindrance <- scales %>%
# pivot longer on 5 quant hindrances items
pivot_longer(cols = QHIND1E:QHIND5E,
names_to = "Question",
values_to = "Response") %>%
# group by student ID
group_by(StudentIDE) %>%
# summarise for sum of new long column
summarise(sum_quant_hindrance = sum(Response, na.rm = TRUE))
```
:::
**Question 8**. Copy the following code chunk into your R Markdown file and press knit. We want to filter `demog` to focus on female participant and calculate the mean age of the female participants. You should receive an error containing `Caused by error:! "..1$StudentIDE" must be a logical vector, not a double vector` which is not the most helpful error for diagnosing the problem.
````{verbatim, lang = "markdown"}
```{r}
# filter for females then calculate mean age
demog %>%
filter(.data = demog,
GenderE == 2) %>%
summarise(.data = demog,
mean_age = mean(AgeE, na.rm = TRUE))