-
Notifications
You must be signed in to change notification settings - Fork 4
/
R.data.containers.Rnw
1313 lines (986 loc) · 81.4 KB
/
R.data.containers.Rnw
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
% !Rnw root = appendix.main.Rnw
<<echo=FALSE, include=FALSE>>=
opts_chunk$set(opts_fig_wide)
opts_knit$set(concordance=TRUE)
opts_knit$set(unnamed.chunk.label = 'container-chunk')
@
\chapter{Base \Rlang: ``Collective Nouns''}\label{chap:R:collective}
\begin{VF}
The information that is available to the computer consists of a selected set of \emph{data} about the real world, namely, that set which is considered relevant to the problem at hand, that set from which it is believed that the desired results can be derived. The data represent an abstraction of reality\ldots
\VA{Niklaus Wirth}{\emph{Algorithms $+$ Data Structures $=$ Programs}, 1976}\nocite{Wirth1976}
\end{VF}
\section{Aims of This Chapter}
Data set organisation and storage is one of the keys to efficient data analysis. How to keep together all the information that belongs together, say all measurements from an experiment and corresponding metadata such as treatments applied and/or dates. The title ``collective nouns'' is based on the idea that a data set is a collection of data objects.
In this chapter, you will familiarise with how data sets are usually managed in \Rlang. I use both abstract examples to emphasise the general properties of data sets and the \Rlang classes available for their storage and a few more specific examples to exemplify their use in a more concrete way. While in chapter \ref{chap:R:as:calc} the focus was on atomic data types and objects, like vectors, useful for the storage of collections of values of a given type, like numbers, in the present chapter the focus is on the storage within a single object of heterogeneous data, such as a combination of factors, and character and numeric vectors. Broadly speaking, heterogeneous \emph{data containers}.
To describe the structure of \Rlang objects I use diagrams similar to those in the previous chapter.
\index{data sets!their storage|(}
\section{Data from Surveys and Experiments}
\index{data sets!origin}\index{data sets!characteristics}
The data we plot, summarise, and analyse in \Rlang, in most cases, originate from measurements done as part of experiments or surveys. Data collected mechanically from user interactions with websites or by crawling through internet content originate from a statistical perspective from surveys. The value of any data comes from knowing their origin, say treatments applied to plants, or the country from where website users connect; sometimes several properties are of interest to describe the origin of the data and in other cases observations consist in the measurement of multiple properties on each subject under study. Consequently, all software designed for data analysis implements ways of dealing with data sets as a whole both during storage and when passing them as arguments to functions. A data set is a usually heterogeneous collection of data with related information.
In \Rlang, lists are the most flexible type of objects useful for storing whole data sets. In most cases, we do not need this much flexibility, so rectangular collections of observations are most frequently stored in a variation upon lists called data frames. These objects can have as their members the vectors and factors described in chapter \ref{chap:R:as:calc}.
Any \Rlang object can have attributes, allowing objects to carry along additional bits of information. Some like comments are part of \Rlang and aimed at storage of ancillary information or metadata by users. Other attributes are used internally by \Rlang and finally users can store arbitrary ancillary data using attributes created \emph{ad hoc}.
\section{Lists}\label{sec:calc:lists}
\index{lists|(}\qRclass{list}
In \Rlang, \Rclass{list} objects are in several respects similar the vectors described in chapter \ref{chap:R:as:calc} but differently to vectors, the members they contain can be heterogeneous, i.e., different members of the same list can belong to different classes. In addition, while the member elements of a vector must be \emph{atomic} values like numbers or character strings, any \Rlang object can be a list member including other lists.
In \Rlang, the members of a list can be considered as following a sequence, and accessible through numerical indexes, the same as the members of vectors. Members of a list as well as members of a vector can be named, and retrieved (indexed) through their names. In practice, named lists are more frequently used than named vectors. \Rlang lists are created, or constructed, with function \Rfunction{list()} similarly as vectors are constructed with function \Rfunction{c()}.
\begin{explainbox}
\Rlang lists can have as members not only objects storing data on observations and categories, but also function definitions, model formulas, unevaluated expressions, matrices, arrays, and objects of user-defined classes.
\end{explainbox}
\begin{explainbox}
List and list-like objects are widely used in \Rlang because they make it possible to keep, for example, the data, instructions for operations, and results from operations together in a single \Rlang object that can be saved, copied, etc. as a unit. This avoids the proliferation of multiple disconnected objects with their interrelations being encoded only by their names, or even worse in separate notes or even in a person's memory---all approaches that are error-prone. Model fit functions described in chapter \ref{chap:R:statistics} are good examples of this approach. Objects used to store the instructions to build plots with multiple layers as described in chapter \ref{chap:R:plotting} are also good examples.
\end{explainbox}
Our first list has as its members three different vectors, each one belonging to a different class: \code{numeric}, \code{character} and \code{logical}. The three vectors also differ in their length: 6, 1, and 2, respectively.\qRfunction{list()}\qRfunction{names()}
<<lists-0>>=
lst1 <- list(x = 1:3, y = "ab", z = c(TRUE, FALSE))
@
<<lists-0a>>=
str(lst1)
names(lst1)
@
\begin{center}
\begin{footnotesize}
\begin{tikzpicture}[font=\sffamily, my shape/.style={
rectangle split, rectangle split parts=#1, draw, anchor=north, minimum size=12mm},
array/.style={matrix of nodes,nodes={draw, minimum size=1mm, fill=black},column sep=2pt, row sep=0.5mm, nodes in empty cells,
row 1/.style={nodes={draw=none, fill=none, minimum size=1mm}}}]
\matrix[array] (array) {
1 & 2 & 3 \\
\rule{10mm}{.1pt} & \rule{10mm}{.1pt} & \rule{10mm}{.1pt}\\};
\begin{scope}[on background layer]
\fill[blue!10] (array-1-1.north west) rectangle (array-1-3.south east);
\end{scope}
\draw (array-2-1.west) node [signal, draw, fill=codeshadecolor, minimum size=6mm, line width=1.5pt, left] (first) {\code{\strut lst1}};
\draw (array-2-1.north) node [signal, draw, fill=codeshadecolor, minimum size=5mm, rotate=-90, xshift=-11.5mm, yshift=-2.7mm, above] (nameh) {\rotatebox{90}{x\strut}};
\draw (array-2-2.north) node [signal, draw, fill=codeshadecolor, minimum size=5mm, rotate=-90, xshift=-11.5mm, yshift=-2.7mm, above] (namec) {\rotatebox{90}{y\strut}};
\draw (array-2-3.north) node [signal, draw, fill=codeshadecolor, minimum size=5mm, rotate=-90, xshift=-11.5mm, yshift=-2.7mm, above] (namew) {\rotatebox{90}{z\strut}};
%\draw (array-1-2.north)--++(90:3mm) node [above] (first) {Index};
\draw (array-1-3.east)--++(0:12.5mm) node [right]{\code{integer} positional indices};
\draw (array-2-3.east)--++(0:8mm) node [right]{\textsl{heterogeneous} class, \textsl{varying} length};
\draw (namew)--++(0:15mm) node [right]{\code{character} member names};
%
\node [my shape=3, rectangle split, fill=blue!20] at (-1.3,-.25)
{1\strut\nodepart{two}2\strut\nodepart{three}3\strut};
\node [my shape=1, fill=red!20] at (0,-.25)
{``ab''\strut};
\node [my shape=2, fill=yellow!20] at (1.3,-.25)
{TRUE\strut\nodepart{two}FALSE\strut};
%\draw (-0.6,+0.65) node [signal, draw, fill=codeshadecolor, minimum size=6mm, line width=1.5pt, left] (first) {\code{\strut lst1}};
\end{tikzpicture}
\end{footnotesize}
\end{center}
\begin{warningbox}
It is best to use informative names for accessing \code{list} members, as their members are heterogenous, usually containing loosely related/connected data. Names make code easier to understand and mistakes more visible. Using names also makes code more robust to future changes in the position of list members in lists created upstream of our own \Rlang code. Below, we use both positional indices and names to highlight the similarities between lists and vectors.
\end{warningbox}
Lists can behave as vectors with heterogeneous elements as members, as we will describe next. Lists can also be nested, so tree-like structures are also possible (see section \ref{sec:calc:lists:nested} on page \pageref{sec:calc:lists:nested}).
%{ \tikzstyle{every node}=[draw=black,thick,anchor=west,fill=blue!10]
% \tikzstyle{root}=[dashed,fill=gray!50]
%\sffamily
%\centering
%\footnotesize
%\begin{tikzpicture}[%
% grow via three points={one child at (0.5,-0.55) and
% two children at (0.5,-0.55) and (0.5,-1.1)},
% edge from parent path={(\tikzparentnode.south) |- (\tikzchildnode.west)}]
% \node [root] {lst1}
% child { node {\$ x: int [1:6] 1 2 3 4 5 6}}
% child { node {\$ y: chr "a"}}
% child { node {\$ z: logi [1:2] TRUE FALSE}};
%\end{tikzpicture}
%}
\begin{faqbox}{How to create an empty list?}
In the same way as \code{numeric()} by default creates a \code{numeric} vector of length zero, \Rfunction{list()} by default creates a \code{list} object with no members.
<<list-empty-faq>>=
list()
@
\end{faqbox}
\subsection{Member extraction, deletion and insertion}
In\index{lists!member extraction|(}\index{lists!member indexing|see{lists, member extraction}}\index{lists!deletion and addition of members|(} section \ref{sec:calc:indexing} on page \pageref{sec:calc:indexing}, we saw that the extraction operator \Roperator{[ ]} applied to a vector, returns a vector, longer or shorter, possibly of length one, or even length zero. Similarly, applying operator \Roperator{[ ]} to a list returns a list, possibly of different length: \code{lst1["x"]} or \code{lst[1]} return a list containing only one member, the numeric vector stored at the first position of \code{lst1}. In the last statement in the chunk below, \code{lst1[c(1, 3)]} returns a list of length two as expected.
<<lists-1a>>=
lst1["x"]
lst1[1]
lst1[c(1, 3)]
@
As with vectors negative positional indices remove members instead of extracting them. See page \pageref{par:calc:lists:rm} for a safer approach to the deletion of list members.
<<lists-1ay>>=
lst1[-1]
lst1[c(-1, -3)]
@
Using operator \Roperator{[[ ]]} (double square brackets) for indexing a list extracts the element stored in the list, in its original mode. In the example below, \code{lst1[["x"]]} and \code{lst1[[1]]} return a numeric vector. We might say that extraction operator \Roperator{[[ ]]} reaches ``deeper'' into the list than operator \Roperator{[ ]}. Operator \Roperator{\$}, used in the second statement below, provides a shorthand notation, equivalent to calling \Roperator{[[ ]]} with a single constant \code{character} value as argument.
<<lists-1>>=
lst1$x
lst1[["x"]]
lst1[[1]]
@
\begin{explainbox}\label{box:extraction:opers}
We mentioned above that indexing by name can be done either with double square brackets, \Roperator{[[ ]]}, or with \Roperator{\$}. Operators \Roperator{[ ]} and \Roperator{[[ ]]} work like normal \Rlang functions, accepting as arguments passed to them both constant values and variables for indexing. In contrast, \Roperator{\$} mainly intended for use when typing at the console, accepts only bare member names on its \emph{rhs}. With \Roperator{[[ ]]}, the name of the variable or column is given as a character string, enclosed in quotation marks, or as a variable with mode \code{character}. A number as a positional index is also accepted.
<<index-partial-1>>=
lst1a <- list(abcd = 123, xyzw = 789)
lst1a[[1]]
lst1a[["abcd"]]
vct1 <- "abcd"
lst1a[[vct1]]
@
When using \Roperator{\$}, the name is entered as a constant, without quotation marks, and cannot be a variable or a number.
<<index-partial-1a>>=
lst1a$abcd
lst1a$ab
lst1a$a
@
Both in the case of lists and data frames (see section \ref{sec:R:data:frames} on page \pageref{sec:R:data:frames}), when using double square brackets, by default an exact match is required between the name in the object and the name used for indexing. In contrast, with \Roperator{\$}, an unambiguous partial match is silently accepted. For interactive use, partial matching decreases the extent of the text typed at the console. However, in scripts, and especially \Rlang code in packages, it is best to avoid the use of \Roperator{\$} as partial matching to a wrong variable present at a later time, e.g., when someone else revises the script, misdirected partial matching can lead to difficult-to-diagnose errors.
In addition, as \Roperator{\$} is implemented by first attempting a match to the name and then calling \Roperator{[[ ]]}, using \Roperator{\$} for indexing can result in slightly slower performance compared to using \Roperator{[[ ]]}. It is possible to set \Rlang option \code{warnPartialMatchDollar} so that partial matching triggers a warning when using \Roperator{\$} to extract a member, which can be very useful when debugging.
\end{explainbox}
<<lists-1az>>=
is.vector(lst1[1])
is.list(lst1[1])
is.vector(lst1[[1]])
is.list(lst1[[1]])
@
The two extraction operators can be used together as shown below, with \code{lst1[[1]]} extracting the vector from \code{lst1} and \code{[3]} extracting the member at position 3 of the vector.
<<lists-1ax>>=
lst1[[1]][3]
@
Extraction\label{par:calc:list:member:assign} operators can be used on the \emph{lhs} as well as on the \emph{rhs} of an assignment, and lists can be empty, i.e., be of length zero. The example below makes use of this to build a list step by step.
<<lists-pg-01, eval=eval_playground>>=
lst2 <- list()
lst2[["x"]] <- 1:3
lst2[["y"]] <- "ab"
lst2[["z"]] <- c(TRUE, FALSE)
@
\begin{playground}
Compare \code{lst2} to \code{lst1}, used for the examples above. Then run the code below and compare them again. Try to understand why \code{lst2} has changed as it did. Pay also attention to possible changes to the members' names.
<<lists-pg-02, eval=eval_playground>>=
lst2[["y"]] <- lst2[["x"]]
@
\end{playground}
\begin{explainbox}
\emph{Lists}, as usually defined in languages like \Clang, are based on pointers to memory locations, with pointers stored at each node. These pointers chain or link the different member nodes (this allows, for example, sorting of lists in place by modifying the pointers). In such implementations, indexing by position is not possible, or at least requires ``walking'' down the list, node by node. \Rlang does not implement pointers to ``addresses'', or locations, in memory. In \Rlang, \code{list} members can be accessed through positional indexes or member names, similarly to vector members. Of course, as with vectors, insertions and deletions in the middle of a list, shift the position of members, and change which member is pointed at by indexes for positions past the modified location. The names, in contrast, remain valid.
<<lists-eb-xx>>>=
list(a = 1, b = 2, c = 3)[-2]
@
\end{explainbox}
Three frequent operations on lists are concatenation, insertions, and deletions.\index{lists!insert into}\index{lists!append to} The same functions as with vectors are used: \Rfunction{c()}, to concatenate, and \Rfunction{append()}, to append and insert. Lists can be combined only with other lists, otherwise, these operations work as with vectors (see pages \pageref{par:calc:concatenate}--\pageref{par:calc:append:end}).
<<lists-1b>>=
lst3 <- append(lst1, list(yy = 1:10, zz = letters[5:1]), after = 2)
lst3
@
To\label{par:calc:lists:rm} delete a member from a list, we assign \code{NULL} to it.
<<lists-1c>>=
lst1$y <- NULL
lst1
@
To investigate the members contained in a list, function \Rfunction{str()} (\emph{structure}), used above, is convenient, especially when lists have many members. Structure formats lists more compactly than \code{print()} applied directly to a list.\label{par:calc:str}
<<lists-1aa>>=
print(lst1)
str(lst1)
@
\index{lists!deletion and addition of members|)}\index{lists!member extraction|)}
\subsection{Nested lists}\label{sec:calc:lists:nested}
Lists\index{lists!nested} can be nested, i.e., lists of lists can be constructed to an arbitrary depth. In the example below, \code{lst4} and \code{lst5} are members of \code{lst6}, i.e., \code{lst4} and \code{lst5} are nested within \code{lst6}.
<<lists-2>>=
lst4 <- list("a", "aa", 10)
lst5 <- list("b", TRUE)
lst6 <- list(A = lst4, B = lst5) # nested
str(lst6)
@
A nested\index{lists!nested} list can alternatively be constructed within a single statement in which several member lists are created. Here we combine the first three statements in the earlier chunk into a single one.
<<lists-3>>=
lst7 <- list(A = list("a", "aa", 10), B = list("b", TRUE))
str(lst7)
@
A list can contain a combination of \code{list} and \code{vector} members.
<<lists-3s>>=
lst8 <- list(A = list("a", "aa", 10),
B = list("b", TRUE),
C = c(1, 3, 9),
D = 4321)
str(lst8)
@
\begin{explainbox}
The logic behind the extraction of members of nested lists using indexing is the same as for simple lists, but applied recursively---e.g., \code{lst7[[2]]} extracts the second member of the outermost list, which is another list. As, this is a list, its members can be extracted using again the extraction operator: \code{lst7[[2]][[1]]}. It is important to remember that these concatenated extraction operations are written so that the leftmost operator is applied to the outermost list.
The example above uses the \Roperator{[[ ]]} operator, but the left-to-right precedence also applies to concatenated calls to \Roperator{[ ]} and to calls combining both operators.
\end{explainbox}
\begin{playground}
What\index{lists!nested} do you expect each of the statements below to return? \emph{Before running the code}, predict what value and of which mode each statement will return. You may use implicit or explicit calls to \Rfunction{print()}, or calls to \Rfunction{str()} to visualise the structure of the different objects.
% not handled correctly by knitr, works at console.
<<lists-PG4, eval=eval_playground>>=
LST9 <- list(A = list("a", "aa", "aaa"), B = list("b", "bb"))
# str(LST9)
LST9[2:1]
LST9[1]
LST9[[1]][2]
LST9[[1]][[2]]
LST9[2]
LST9[2][[1]]
@
\end{playground}
\begin{explainbox}\index{lists!structure}
When dealing with deep lists, it is sometimes useful to limit the number of levels of nesting returned by \Rfunction{str()} by passing a \code{numeric} argument to parameter \code{max.levels}.
<<lists-EB1b>>=
str(lst8, max.level = 1)
@
\end{explainbox}
Sometimes we need to flatten a list\index{lists!flattening}\index{lists!nested}, or a nested structure of lists within lists. Function \Rfunction{unlist()} is what should be normally used in such cases.
The list \code{lst10} is a nested system of lists, but all the ``terminal'' members are character strings. In other words, terminal nodes are all of the same \code{mode}, allowing the list to be ``flattened'' into a character vector.\qRfunction{is.list()}
<<lists-5>>=
lst10 <- list(A = list("a", "aa", "aaa"), B = list("b", "bb"))
vct1 <- unlist(lst10)
vct1
is.list(lst10)
is.list(vct1)
mode(lst10)
mode(vct1)
names(lst10)
names(vct1)
@
The returned value is a vector with named member elements. We use function \Rfunction{str()} to figure out how this vector relates to the original list. The names, always of mode character, are based on the names of list elements when available, while characters depicting positions as numbers are used for anonymous nodes. We can access the members of the vector either through numeric indexes or names.
<<lists-6>>=
str(vct1)
vct1[2]
vct1["A2"]
@
\begin{playground}
Function \Rfunction{unlist()}\index{lists!convert into vector} has two additional parameters, with default argument values, which we did not modify in the example above. These parameters are \code{recursive} and \code{use.names}, both of them expecting a \code{logical} value as an argument. Modify the statement \code{c.vec <- unlist(c.list)}, by passing \code{FALSE} as an argument to these two parameters, in turn, and in each case, study the value returned and how it differs with respect to the one obtained above.
\end{playground}
Function \Rfunction{unname()} can be used to remove names safely---i.e., without risk of altering the mode or class of the object.
<<lists-7>>=
unname(vct1)
unname(lst10)
@
\index{lists|)}
<<lists-cleanup, include=FALSE>>=
rm(list = setdiff(ls(pattern="*"), to.keep))
@
\section{Data Frames}\label{sec:R:data:frames}
\index{data frames|(}\qRclass{data.frame}
\index{worksheet@`worksheet'|see{data frame}}
Data frames are a special type of list, in which all members have the same length, giving origin to a matrix-like object, in which columns can belong to different classes. Most commonly the member ``columns'' are vectors or factors, but they can also be matrices with the same number of rows as the enclosing data frame, or lists with the same number of members as rows in the enclosing data frame.
Data frames are central to most data manipulation and analysis procedures in \Rlang. They are commonly used to store observations, with \code{numeric} columns holding data for continuous variables and \code{factor} columns data for categorical variables. Binary variables can be stored in \code{logical} columns. Text data can be stored in \code{character} columns. Date and time can be stored in columns of specific classes, such as \code{POSIXct}. In the diagram below, column \code{treatment} is a factor with two levels encoding two conditions, \code{hot} and \code{cold}. Columns \code{height} and \code{weight} are numeric vectors containing measurements.
\begin{center}
\begin{footnotesize}
\begin{tikzpicture}[font=\sffamily, my shape/.style={
rectangle split, rectangle split parts=#1, draw, anchor=north, minimum size=12mm},
array/.style={matrix of nodes,nodes={draw, minimum size=1mm, fill=black},column sep=2pt, row sep=0.5mm, nodes in empty cells,
row 1/.style={nodes={draw=none, fill=none, minimum size=1mm}}}]
\matrix[array] (array) {
1 & 2 & 3 \\
\rule{10mm}{.1pt} & \rule{10mm}{.1pt} & \rule{10mm}{.1pt}\\};
\begin{scope}[on background layer]
\fill[blue!10] (array-1-1.north west) rectangle (array-1-3.south east);
\end{scope}
\draw (array-2-1.west) node [signal, draw, fill=codeshadecolor, minimum size=6mm, line width=1.5pt, left] (first) {\code{\strut df1}};
\draw (array-2-1.north) node [signal, draw, fill=codeshadecolor, minimum size=5mm, rotate=-90, xshift=-17mm, yshift=-3mm, above] (nameh) {\rotatebox{180}{treatment\strut}};
\draw (array-2-2.north) node [signal, draw, fill=codeshadecolor, minimum size=5mm, rotate=-90, xshift=-14.4mm, yshift=-3mm, above] (namec) {\rotatebox{180}{height\strut}};
\draw (array-2-3.north) node [signal, draw, fill=codeshadecolor, minimum size=5mm, rotate=-90, xshift=-14.5mm, yshift=-3mm, above] (namew) {\rotatebox{180}{weight\strut}};
%\draw (array-1-2.north)--++(90:3mm) node [above] (first) {Index};
\draw (array-1-3.east)--++(0:12.5mm) node [right]{\code{integer} positional indices};
\draw (array-2-3.east)--++(0:8mm) node [right]{\textsl{heterogeneous} class, \textsl{same} length};
\draw (namew)--++(0:15mm) node [right]{\code{character} column names};
%
\node [my shape=4, rectangle split, fill=green!20] at (-1.3,-.25)
{hot\strut\nodepart{two}cold\strut\nodepart{three}hot\strut\nodepart{four}\ldots\strut};
\node [my shape=4, fill=blue!20] at (0,-.25)
{10.2\strut\nodepart{two}\phantom{1}8.3\strut\nodepart{three}12.0\strut\nodepart{four}\ldots\strut};
\node [my shape=4, fill=blue!20] at (1.3,-.25)
{2.2\strut\nodepart{two}3.3\strut\nodepart{three}2.5\strut\nodepart{four}\ldots\strut};
%\draw (-0.6,+0.65) node [signal, draw, fill=codeshadecolor, minimum size=6mm, line width=1.5pt, left] (first) {\code{\strut a.list}};
\end{tikzpicture}
\end{footnotesize}
\end{center}
Data frames are created with constructor function \Rfunction{data.frame()} with a syntax similar to that used for lists.\qRfunction{colnames()}\qRfunction{rownames()}\qRfunction{is.data.frame()}
<<data-frames-0>>=
df1 <- data.frame(treatment = factor(rep(c("hot", "cold"), 3)),
height = c(10.2, 8.3, 12.0, 9.0, 11.2, 8.7),
weight = c(2.2, 3.3, 2.5, 2.8, 2.4, 3.0))
df1
colnames(df1)
rownames(df1)
str(df1)
class(df1)
mode(df1)
is.data.frame(df1)
is.list(df1)
@
We can see above that when printed each row of a \code{data.frame} is preceded by a row name. Row names are character strings, just like column names. The \Rfunction{data.frame()} constructor adds by default row names representing running numbers. Default row names are rarely of much use, except to track insertions and deletions of rows during debugging.
\begin{playground}
As the expectation is that all member variables (or ``columns'') have equal length, if vectors of different lengths are supplied as arguments, the shorter vector(s) is/are recycled, possibly several times, until the required full length is reached, as shown below for \code{treatment}.
<<data-frames-0a>>=
df2 <- data.frame(treatment = factor(c("hot", "cold")),
height = c(10.2, 8.3, 12.0, 9.0, 11.2, 8.7),
weight = c(2.2, 3.3, 2.5, 2.8, 2.4, 3.0))
@
Are \code{df1} crated above and \code{df2} created here equal?
\end{playground}
With function \Rfunction{class()} we can query the class of an \Rlang object (see section \ref{sec:rlang:mode} on page \pageref{sec:rlang:mode}). As we saw in the previous chunk, \code{list} and \code{data.frame} objects belong to two different classes. However, their \code{mode} is the same. Consequently, data frames inherit the methods and characteristics of lists, as long as they have not been hidden by new ones defined for data frames (for an explanation of \emph{methods}, see section \ref{sec:methods} on page \pageref{sec:methods}).
Extraction of individual member variables or ``columns'' can be done like in a list with operators \Roperator{[[ ]]} and \Roperator{\$} (see call-out in \pageref{box:extraction:opers}).
<<data-frames-1>>=
df1$height
df1[["height"]]
df1[[2]]
class(df1[["height"]])
@
In the same way as with lists, we can add member variables to data frames. Recycling takes place if needed.
<<data-frames-2>>=
df1$x2 <- 6:1
df1[["x3"]] <- "b"
str(df1)
@
\begin{playground}
We have added two columns to the data frame, and in the case of column \code{x3} recycling took place. This is where lists and data frames differ substantially in their behaviour. In a data frame, although class and mode can be different for different member variables (columns), they are required to be vectors or factors of the same length (or a matrix with the same number of rows, or a list with the same number of members). In the case of lists, there is no such requirement, and recycling never takes place when adding a member. Compare the values returned below for \code{LST1}, to those in the example above for \code{df1}.
<<data-frames-2a>>=
LST1 <- list(x = 1:6, y = "a", z = c(TRUE, FALSE))
str(LST1)
LST1$x2 <- 6:1
LST1$x3 <- "b"
str(LST1)
@
\end{playground}
\begin{faqbox}{How to create an empty data frame?}
In the same way as \code{numeric()} creates a \code{numeric} vector of length zero, \Rfunction{data.frame()} by default creates a \code{data.frame} with zero rows and no columns.
<<data-frame-empty-faq>>=
data.frame()
@
\end{faqbox}
\begin{faqbox}{How to make a list of data frames?}
We create a list of data frames in the same way as we create a nested list of lists, or in fact of a list of any other \Rlang objects. See section \ref{sec:calc:lists:nested} on page \pageref{sec:calc:lists:nested}.
<<data-frame-listof-faq>>=
list(df1, df2)
@
\end{faqbox}
\begin{faqbox}{How to add a new column to a data frame (to the front and end)?}
In the same way as we can assign a new member to a list using the extraction operator \Roperator{[[ ]]}, we can add a new column to a data frame (see page \pageref{par:calc:list:member:assign}). In this case, if the column name does not already exist, the assigned vector or factor is appended as the last column (no recycling applied to short vectors or factors unless of length one).
<<data-frame-add-co1l-faq>>=
DF1 <- data.frame(A = 1:5, B = factor(5:1))
DF1[["C"]] <- 11:15
DF1
@
To add a column at the front, we can use function \Rfunction{cbind()} (column bind).
<<data-frame-add-col2-faq>>=
DF2 <- data.frame(A = 1:5, B = factor(5:1))
cbind(C = 11:15, DF2)
@
\end{faqbox}
Being two-dimensional and rectangular in shape, data frames, in relation to indexing and dimensions, behave similarly to a matrix. They have two margins, rows, and columns, and, thus, two indices are used to indicate the location of a member ``cell''. We provide some examples here, but please consult section \ref{sec:calc:indexing} on page \pageref{sec:calc:indexing} and section \ref{sec:matrix:array} on page \pageref{sec:matrix:array} for additional details.
Matrix-like notation allows simultaneous extraction from multiple columns, which is not possible with lists. The value returned is in most cases a ``smaller'' data frame as in this example.
<<data-frames-bx-03>>=
df1[2:3, 1:2]
@
<<data-frames-3>>=
# first column, df1[[1]] preferred
df1[ , 1]
# first column, df1[["x"]] or df1$x preferred
df1[ , "treatment"]
# first row
df1[1, ]
# first two rows of the third and fourth columns
df1[1:2, c(FALSE, FALSE, TRUE, TRUE, FALSE)]
# the rows for which comparison is true
df1[df1$treatment == "hot" , ]
# the heights > 8
df1[df1$height > 8, "height"]
@
As explained earlier for vectors (see section \ref{sec:calc:indexing} on page \pageref{sec:calc:indexing}), indexing can be present both on the right- and left-hand sides of an assignment, allowing the replacement of both individual values and rectangular regions.
The next few examples do assignments to ``cells'' of \code{df1}, either to one whole column, or individual values. The last statement in the chunk below copies a number from one location to another by using indexing of the same data frame both on the right side and left side of the assignment.\qRoperator{[[ ]]}\qRoperator{[ ]}
<<data-frames-3a>>=
df1[1, 2] <- 99
df1
df1[ , 2] <- -99
df1
df1[["height"]] <- c(10, 12)
df1
df1[1, 2] <- df1[6, 3]
df1
df1[3:6, 2] <- df1[6, 3]
df1
@
Similarly as with matrices, if we extract a single column from a data frame using matrix-like indexing, it is by default simplified into a vector or factor, i.e., the column-dimension is dropped. By passing \code{drop = FALSE}, we can prevent this. Contrary to matrices, rows are not simplified in the case of data frames.
<<data-frames-2b>>=
is.data.frame(df1[1, ])
is.data.frame(df1[ , 2])
is.data.frame(df1[ , "treatment"])
is.data.frame(df1[1:2, 2:3])
is.vector(df1[1, ])
is.vector(df1[ , 2])
is.factor(df1[ , "treatment"])
is.vector(df1[1:2, 2:3])
@
<<data-frames-2bb>>=
is.data.frame(df1[ , 1, drop = FALSE])
is.data.frame(df1[ , "treatment", drop = FALSE])
@
\begin{warningbox}
In contrast to matrices and data frames, the extraction operator \Roperator{[ ]} of tibbles---defined in package \pkgname{tibble}---never simplifies returned one-column tibbles into vectors (see section \ref{sec:data:tibble} on page \pageref{sec:data:tibble} for details on the differences between data frames and tibbles).
\end{warningbox}
Usually data frames are created from lists or by passing individual vectors and factors to the constructors. It is also possible to construct data frames starting from matrices, other data frames and named vectors, in which case, the identity function \Rfunction{I()} can be used to protect them from interpretation by the \Rfunction{data.frame()} constructor. In these cases, additional nuances become important. The details are well described in \code{help(data.frame)}.
With a named numeric vector, and a factor as arguments, the names are moved from the vector to the rows of the data frame!
<<data-frames-bx-constr-01>>=
vct1 <- c(one = 1, two = 2, three = 3, four = 4)
fct1 <- as.factor(c(1, 2, 3, 2))
df1 <- data.frame(fct1, vct1)
df1
df1$vct1
@
If the vector is protected with \Rlang's identity function \Rfunction{I()} the names are not moved as can be seen by extracting the column \code{vct1} from data frame \code{df2}.
<<data-frames-bx-constr-02>>=
df2 <- data.frame(fct1, I(vct1))
df2
df2$vct1
@
\begin{explainbox}
With a matrix instead of a vector, the matrix is split into separate columns in the data frame. If the matrix has no column names, new ones are created.
<<data-frames-bx-constr-04>>=
mat1 <- matrix(1:12, ncol = 3)
df4 <- data.frame(fct1, mat1)
@
<<data-frames-bx-constr-04a>>=
df4
@
If the matrix is protected with function \Rfunction{I()}, it is not split, and the whole matrix becomes a column in the data frame.
<<data-frames-bx-constr-05>>=
df5 <- data.frame(fct1, I(mat1))
df5
df5$mat1
@
\end{explainbox}
\begin{explainbox}
With a list, whose member are vectors, each member of the list becomes a column in the data frame. In the case of too short members, recycling is applied.
<<data-frames-bx-constr-06>>=
lst1 <- list(a = 4:1, b = letters[4:1], c = "n", d = "z")
df6<- data.frame(fct1, lst1)
df6
@
If the list is protected with \Rfunction{I()}, the list is added in whole as a variable or column in the data frame. In this case, the length of the list must match the number of rows in the data frame, while the length and class of the individual members of the list can vary. The names of the list members are used to set the \code{rownames} of the data frame.
This is similar to the default behaviour of tibbles, while \Rlang data frames require explicit use of \Rfunction{I()} for lists not to be split (see chapter \ref{chap:R:data} on page \pageref{chap:R:data} for details about package \pkgname{tibble}).
<<data-frames-bx-constr-07>>=
df7<- data.frame(fct1, I(lst1))
df7
@
<<data-frames-bx-constr-07b>>=
df7$lst1
@
\end{explainbox}
\begin{advplayground}
What do we gain using \Rfunction{I()}? Check the documentation carefully and think of uses where the flexibility gained by the option to protect or not the arguments passed to the \Rfunction{data.frame()} constructor can be useful. In addition, write \Rlang statements to extract individual members of embedded matrices or lists using indexing. Finally, test if the behaviour of \Rfunction{I()} is the same when assigning new member variables (or ``columns'') to an existing data frame.
\end{advplayground}
\subsection{Sub-setting data frames}\label{sec:calc:df:subset}
When\index{data frames!subsetting}\index{data frames!``filtering rows''} the names of data frames are long, complex conditions become awkward to write using indexing---i.e., subscripts. In such cases, \Rfunction{subset()} is handy because it evaluates the condition with the data frame as the ``environment'', i.e., the names of the columns are recognised if entered directly when writing the condition. Function \Rfunction{subset()} ``filters'' rows, usually corresponding to observations or experimental units. The condition is computed for each row, and if it returns \code{TRUE}, the row is included in the returned data frame, and excluded if \code{FALSE}.
We create a data frame with six rows and three columns. For column \code{y}, we rely on \Rlang automatically extending \code{"a"} by repeating it six times, while for column \code{z}, we rely on \Rlang automatically extending \code{c(TRUE, FALSE)} by repeating it three times.
<<data-frames-4>>=
df8 <- data.frame(x = 1:6, y = "a", z = c(TRUE, FALSE))
subset(df8, x > 3)
@
\begin{advplayground}
What is the behaviour of \code{subset()} when the condition is \code{NA}? Find the answer by writing code to test this, for a case where tests for different rows return \code{NA}, \code{TRUE} and \code{FALSE}.
\end{advplayground}
When calling functions that return a vector, data frame, or other structure, the extraction operators \Roperator{[ ]}, \Roperator{[[ ]]}, or \Roperator{\$} can be appended to the rightmost parenthesis of the function call, in the same way as to the name of a variable holding the same data.
<<data-frames-5>>=
subset(df8, x > 3)[ , -3]
subset(df8, x > 3)[ , "x", drop = FALSE]
subset(df8, x > 3)[ , "x"]
@
\begin{advplayground}
When do extraction operators applied to data frames return a vector or factor, and when do they return a data frame? Please, experiment with your own code examples to work out the answer.
\end{advplayground}
\begin{explainbox}
In the case of \Rfunction{subset()}, we can select columns directly as shown below, while for most other functions, extraction using operators \Roperator{[ ]}, \Roperator{[[ ]]}, or \Roperator{\$} is needed.
<<data-frames-5aa>>=
subset(df8, x > 3, select = 2)
@
<<data-frames-5ab>>=
subset(df8, x > 3, select = x)
@
<<data-frames-5ac>>=
subset(df8, x > 3, select = "x")
@
\end{explainbox}
None of the examples in the last four code chunks alters the original data frame \code{df8}. We can store the returned value using a new name if we want to preserve \code{df8} unchanged, or we can assign the result to \code{df8}, deleting in the process, the previously stored value.
\begin{warningbox}
In the examples above, the names in the expression passed as the second argument to \code{subset()} were searched within \code{df8} and found. However, if not found in the data frame, objects with matching names are searched for in the global environment (outside the data frame, and visible in the user's workspace or enclosing environment). With no variable \code{A} present in data frame \code{df8}, vector \code{A} from the environment is silently used in the chunk below resulting in a returned data frame with no rows as \code{A > 3} returns \code{FALSE}.
<<data-frames-5b>>=
A <- 1
subset(df8, A > 3)
@
This also applies to the expression passed as argument to parameter \code{select}, here shown as a way of selecting columns based on names stored in a character vector.
<<data-frames-5c>>=
columns <- c("x", "z")
subset(df8, select = columns)
@
The use of \Rfunction{subset()} is convenient, but more prone to bugs compared to directly using the extraction operator \code{[ ]}. This same ``cost'' to achieving convenience applies to functions like \Rfunction{attach()} and \Rfunction{with()} described below. The longer time that a script is expected to be used, adapted, and reused, the more careful we should be when using any of these functions. An alternative way of avoiding excessive verbosity is to keep the names of data frames short.
\end{warningbox}
A frequently used way of deleting a column by name from a data frame is to assign \code{NULL} to it---i.e., in the same way as members are usually deleted from \code{list}s. This approach modifies \code{df9} in place, rather than returning a modified copy of \code{df9}.
<<data-frames-6>>=
df9 <- df8
head(df9)
df9[["y"]] <- NULL
head(df9)
@
Alternatively, negative indexing can be used to remove columns from a copy of a data frame. In this example, a single column is removed. As base \Rlang does not support negative indexing by name with the extraction operator, the numerical index of the column to delete needs to be obtained first. (See the examples above using \code{subset()} with bare names to delete columns.)
<<data-frames-6a>>=
df8[ , -which(colnames(df8) == "y")]
@%
\pagebreak
Instead of using the equality test, we can use the operator \code{\%in\%} or function \code{grepl()} to create a \code{logical} vector useful for deleting or selecting multiple columns in a single statement.
\begin{playground}
In the previous code chunk, we deleted the last column of the data frame \code{df8}, but using the extraction operator, we modified only the returned copy of \code{df8}, leaving \code{df8} unchanged. Thus we reuse it here for a surprising trick. You should first untangle how it changes the positions of columns and rows, and afterwards think how and why indexing with the extraction operator \Roperator{[ ]} on both sides of the assignment operator \Roperator{<-} can be useful when working with data.
<<data-frames-7, eval=eval_playground>>=
df8[1:6, c(1,3)] <- df8[6:1, c(3,1)]
df8
@
\end{playground}
\begin{warningbox}
Although in this last example we used numeric indexes to make it more interesting, in practice, especially in scripts or other code that will be reused, do use column or member names instead of positional indexes whenever possible. This makes code much more reliable, as changes elsewhere in the script could alter the order of columns and \emph{invalidate} numerical indexes. In addition, using meaningful names makes programmers' intentions easier to understand.
\end{warningbox}
\subsection{Summarising and splitting data frames}\label{sec:calc:df:split}\label{sec:calc:df:aggregate}
Function\index{data frames!summarising} \Rfunction{summary()} can be used to obtain a summary from objects of most \Rlang classes, including data frames. It is also possible to use \Rloop{sapply()}, \Rloop{lapply()} or \Rloop{vapply()} to apply any suitable function to data by columns (see section \ref{sec:data:apply} on page \pageref{sec:data:apply} for a description of these functions and their use).
<<data-frames-7aaa>>=
summary(df8)
@
\index{data frames!splitting}
\Rlang function \Rfunction{split()} makes it possible to split a data frame into a list of data frames, based on the levels of a factor, even if the rows are not ordered according to factor levels.
We create a data frame with six rows and three columns. In the case of column \code{z}, we rely on \Rlang to automatically extend \code{c("a", "b")} by repeating it three times so as to fill the six rows.
<<data-frames-7aa>>=
df10 <- data.frame(x1 = 1:6, x2 = c(1, 5, 4, 2, 6, 3), z = c("a", "b"))
@
<<data-frames-7a>>=
split(df10, df10$z)
@
\begin{explainbox}
The same operation can be specified using a one-sided formula \code{\textasciitilde z} to indicate the grouping.
<<data-frames-7c>>=
split(df10, ~ z)
@
\end{explainbox}
Function \Rfunction{unsplit()} can be used to reverse splitting done by \Rfunction{split()}.
\begin{explainbox}
\Rfunction{split()} is sometimes used in combination with apply functions (see section \ref{sec:data:apply} on page \pageref{sec:data:apply}) to compute group or treatment summaries. However, in most cases it is simpler to use \Rfunction{aggregate()} for computing such summaries.
\end{explainbox}
Related to splitting a data frame is the calculation of summaries based on a subset of cases, or more commonly summaries for all observations but after grouping them based on the values in a column or the levels of a factor.
\begin{faqbox}{How to summarise one variable from a data frame by group?}
To summarise a single variable by group, we can use \Rfunction{aggregate()}.
<<faq-aggregate-01>>=
aggregate(x = iris$Petal.Length,
by = list(iris$Species), FUN = mean)
@
\end{faqbox}
\begin{faqbox}{How to summarise numeric variables from a data frame by group?}
To summarise variables, we can use \Rfunction{aggregate()} (see section \ref{sec:dplyr:group:wise} on page \pageref{sec:dplyr:group:wise} for an alternative approach using package \pkgnameNI{dplyr}).
<<faq-aggregate-02>>=
aggregate(x = iris[ , sapply(iris, is.numeric)],
by = list(iris$Species), FUN = mean)
@
For these data, as the only non-numeric variable is \code{Species}, we could have also used formula notation as shown below.
\end{faqbox}
\begin{explainbox}
There\index{data frames!summarising} is also a formula-based \Rfunction{aggregate()} method (or ``variant'') available (\Rlang \emph{formulas} are described in depth in section \ref{sec:stat:formulas} on page \pageref{sec:stat:formulas}). In \Rfunction{aggregate()}, the left-hand side (\emph{lhs}) of the formula indicates the variable to summarise and its right-hand side (\emph{rhs}) the factor used to split or group the data before summarising them.
<<data-frames-7d>>=
aggregate(x1 ~ z, FUN = mean, data = df10)
@
We can summarise more than one column at a time.
<<data-frames-7e>>=
aggregate(cbind(x1, x2) ~ z, FUN = mean, data = df10)
@
If all the columns not used for grouping are valid input to the function passed as the argument to \code{FUN} the formula can be simplified using a point (\code{.}) with meaning ``all columns except those on the \emph{rhs} of the formula''.
<<data-frames-7f>>=
aggregate(. ~ z, FUN = mean, data = df10)
@
\end{explainbox}
Function \Rfunction{aggregate()} can be also used to aggregate time series data based on time intervals (see \code{help(aggregate)}).
\subsection{Re-arranging columns and rows}
\index{data frames!ordering rows}\index{data frames!ordering columns}
As with members of vectors and lists, to change the position of columns or rows in a data frame we use the extraction operator and indexing by name or position. In a matrix-like object, such as a data frame, the first index corresponds to rows and the second to columns.
The most direct way of changing the order of columns and/or rows in data frames (as for matrices and arrays) is to use subscripting. Once we know the original position and target position we can use column names or positions as indexes on the right-hand side, listing all columns to be retained, even those remaining at their original position.
<<data-frames-8>>=
df11 <- data.frame(A = 1:10, B = 3, C = c("A", "B"))
head(df11, 2)
df11 <- df11[ , c("B", "A", "C")]
head(df11, 2)
@
\begin{warningbox}
When using the extraction operator \Roperator{[ ]} on both the left- and right-hand-sides, with a \code{numeric} vector as an argument to swap two columns, the vectors or factors are swapped, while the names of the columns are not!
To retain the correspondence between column naming and column contents after swapping or rearranging the columns \emph{using numeric indices}, we need to separately move the names of the columns. This may seem counter-intuitive, unless we think in terms of positions being named rather than the contents of the columns being linked to the names.\qRfunction{colnames()}\qRfunction{colnames()<-}
<<data-frames-8ax>>=
df11 <- data.frame(A = 1:10, B = 3, C = c("A", "B"))
head(df11, 2)
df11[ , 1:2] <- df11[ , 2:1]
head(df11, 2)
colnames(df11)[1:2] <- colnames(df11)[2:1]
head(df11, 2)
@
\end{warningbox}
Taking into account that \Rfunction{order()} returns the indexes needed to sort a vector (see page \pageref{box:vec:sort}), we can use \Rfunction{order()} to generate the indexes needed to sort the rows of a data frame. In this case, the argument to \Rfunction{order()} is usually a column of the data frame being arranged. However, any vector of suitable length, including the result of applying a function to one or more columns, can be passed as an argument to \Rfunction{order()}. Function \Rfunction{order()} is not useful for sorting columns of data frames \emph{based on data from the columns} as it requires a vector across columns as input, which is possible only when all columns are of the same class. (In the case of \Rclass{matrix} and \Rclass{array} this approach can be applied to any of their dimensions as all their elements homogenously belong to one class.)
\begin{faqbox}{How to order columns or rows in a data frame?}
We use column names or numeric indexes with the extraction operator \Roperator{[ ]} only on the \emph{rhs} of the assignment. For example, to arrange the columns of data set \code{iris} in decreasing alphabetical order, we use \Rfunction{sort()} as shown, or \Rfunction{order()} (see page \pageref{box:vec:sort}).
<<faq-data-frames-01>>=
sorted_cols_iris <- iris[ , sort(colnames(iris), decreasing = TRUE)]
head(sorted_cols_iris, 5)
@
Similarly, we can use values in a column as argument to \Rfunction{order()} to obtain the \code{numeric} indices to sort rows.
<<faq-data-frames-02>>=
sorted_rows_iris <- iris[order(iris$Petal.Length), ]
head(sorted_rows_iris, 5)
@
\end{faqbox}
\begin{advplayground}\index{data frames!ordering rows}
Create a new data frame containing three numeric columns with three different haphazard sequences of values and a factor with two levels. Call these columns \code{A}, \code{B}, \code{C} and \code{F}. 1) Sort the rows of the data frame so that the values in \code{A} are in decreasing order. 2) Sort the rows of the data frame according to increasing values of the sum of \code{A} and \code{B} without adding a new column to the data frame or storing the vector of sums in a variable. In other words, do the sorting based on sums calculated on-the-fly. 1) Sort the rows by level of factor \code{F}, and 2) by level of factor \code{F} and by values in \code{B} within each factor level. Hint: revisit the exercise on page \pageref{calc:ADVPG:order:sort} were the use of \Rfunction{order()} on factors is described.
\end{advplayground}
\subsection{Re-encoding or adding variables}
It is common that some variables need to be added to an existing data frame based on existing variables, either as a computed value or based on mapping, for example, treatments to sample codes already in a data frame. In the second case, named\index{named vectors!mapping with} vectors can be used to replace values in a variable or to add a variable to a data frame.
Mapping is possible because the length of the value returned by the extraction operator \Roperator{[ ]} is given by the length of the indexing vector (see section \ref{sec:calc:indexing} on page \pageref{sec:calc:indexing}). Although we show toy-like examples, this approach is most useful with data frames containing many rows.
If the existing variable is a character vector or factor, we need to create a named vector with the new values as data and the existing values as names.
<<data-frames-9>>=
df12 <-
data.frame(genotype = rep(c("WT", "mutant1", "mutant2"), 2),
value = c(1.5, 3.2, 4.5, 8.2, 7.4, 6.2))
mutant <- c(WT = FALSE, mutant1 = TRUE, mutant2 = TRUE)
df12$mutant <- mutant[df12$genotype]
df12
@
If the existing variable is an \code{integer} vector, we can use a vector without names, being careful that the positions in the \emph{mapping} vector match the values of the existing variable
<<data-frames-10>>=
df13 <- data.frame(individual = rep(1:3, 2),
value = c(1.5, 3.2, 4.5, 8.2, 7.4, 6.2))
genotype <- c("WT", "mutant1", "mutant2")
df13$genotype <- genotype[df13$individual]
df13
@
\begin{advplayground}
Add a variable named \code{genotype} to the data frame below so that for individual \code{4} its value is \code{"WT"}, for individual \code{1} its value is \code{"mutant1"}, and for individual \code{2} its value is \code{"mutant2"}.
<<data-frames-11, eval=eval_playground>>=
DF1 <- data.frame(individual = rep(c(2, 4, 1), 2),
value = c(1.5, 3.2, 4.5, 8.2, 7.4, 6.2))
@
\end{advplayground}
\subsection{Operating within data frames}\label{sec:calc:df:with}
In the case of computing new values from existing variables, named vectors are of limited use. Instead, variables in a data frame can be added or modified with \Rlang functions \Rscoping{transform()}, \Rscoping{with()} and \Rscoping{within()}. These functions can be thought as convenience functions as the same computations can be done using the extraction operators to access individual variables, in the lhs, rhs, or both lhs and rhs (see section \ref{sec:calc:indexing} on page \pageref{sec:calc:indexing}).
In the case of \Rscoping{with()}, only one, possibly compound code statement is affected and this statement is passed as an argument. As before, we need to fully specify the left-hand side of the assignment. The value returned is the one returned by the statement passed as an argument, in the case of compound statements, the value returned by the last contained simple code statement to be executed. Consequently, if the intent is to modify the container, assignment to an individual member variable (column in this case) is required.
In this example, column \code{A} of \code{df14} takes precedence, and the returned value is the expected one.
<<data-frames-EB-12>>=
df14 <- data.frame(A = 1:10, B = 3)
df14$C <- with(df14, (A + B) / A) # add column
head(df14, 3)
@
In the case of \Rscoping{within()}, assignments in the argument to its second parameter affect the object returned, which is a copy of the container (in this case, a whole data frame), which still needs to be saved through assignment. Here the intention is to modify it, so we assign it back to the same name, but it could have been assigned to a different name so as not to overwrite the original data frame.
<<data-frames-EB-13>>=
df14$C <- NULL
df15 <- within(df14, C <- (A + B) / A) # midified copy
head(df15, 3)
@
In the example above, using \code{within()} instead of \Rscoping{with()} makes little difference to the amount of typing or clarity of the code, but with multiple member variables being operated upon, as shown below, using \Rscoping{within()} results in more concise and easier to understand code.
<<data-frames-EB-14>>=
df16 <- within(df14,
{C <- (A + B) / A
D <- A * B
E <- A / B + 1}
)
head(df16, 3)
@
\begin{explainbox}
Repeatedly pre-pending the name of a \emph{container}, such as a list or data frame, to the name of each member variable being accessed can make \Rlang code verbose and difficult to understand. Functions \Rscoping{attach()} and its matching \Rscoping{detach()} allow us to change where \Rlang first looks for the names of objects mentioned in a code statement. When using a long name for a data frame, entering a simple calculation can easily result in a difficult-to-read statement. Here even with a very short name for the data frame, the verbosity compared to the last chunk above is clear.
<<data-frames-EB-10>>=
df14$C <- (df14$A + df14$B) / df14$A
df14$D <- df14$A * df14$B
df14$D <- df14$A / df14$B + 1
head(df14, 3)
@
Using\index{data frames!attaching}\label{par:calc:attach} \Rscoping{attach()} we can alter where \Rlang looks up names and consequently simplify the statement. With \Rscoping{detach()} we can restore the original state. It is important to remember that here we can only simplify the right-hand side of the assignment, while the ``destination'' of the result of the computation still needs to be fully specified on the left-hand side of the assignment operator. We include below only one statement between \Rscoping{attach()} and \Rscoping{detach()} but multiple statements are allowed. Furthermore, if variables with the same name as the columns exist in the search path, these will take precedence, something that can result in bugs or crashes, or as seen below, a message warns that variable \code{A} from the global environment will be used instead of column \code{A} of the attached \code{df17}. The returned value is, of course, not the desired one.
<<data-frames-EB-11a>>=
df17 <- data.frame(A = 1:10, B = 3)
A
attach(df17)
A
detach(df17)
A
@
<<data-frames-EB-11>>=
attach(df17)
df17$C <- (A + B) / A
detach(df17)
head(df17, 2)
@
Use of \Rscoping{attach()} and \Rscoping{detach()}, which work as a pair of ON and OFF switches, can result in an undesired after-effect on name lookup if the script terminates after \Rscoping{attach()} is executed but before \Rscoping{detach()} is called, as the attached object is not detached. In contrast, \Rscoping{with()} and \Rscoping{within()}, being self-contained, guarantee that cleanup takes place. Consequently, the usual recommendation is to give preference to the use of \Rscoping{with()} and \Rscoping{within()} over \Rscoping{attach()} and \Rscoping{detach()}.
\end{explainbox}
<<include=FALSE>>=
rm(list = setdiff(ls(pattern="*"), to.keep))
@
\section{Reshaping and Editing Data Frames}\label{sec:calc:reshape}
\index{data frames!long vs.\ wide shape}
As mentioned above, in most cases, in \Rlang data rows represent measurement events or observations possibly on multiple response variables and factors describing groupings, i.e., a ``long'' shape. However, when measurements are repeated in time, columns rather frequently represent observations of the same response variable at different times, i.e., a ``wide'' shape. Other cases exist where reshaping is needed. Function \Rfunction{reshape()} can convert wide data frames into long data frames and vice versa. See section \ref{sec:data:reshape} on page \pageref{sec:data:reshape} on package \pkgnameNI{tidyr} for an alternative approach to reshaping data with a friendlier user interface.
We start by creating a data frame of hypothetical data measured on two occasions. With these data, for example, if we wish to compute the growth of each subject by computing the difference in \code{weight} and \code{height} between the two time points, one approach is to reshape the data frame into a wider shape and subsequently subtract the columns.
<<data-frames-reshape-01>>=
# artifical data
df1 <- data.frame(id = rep(1:4, rep(2,4)),
Time = factor(rep(c("Before","After"), 4)),
Weight = rnorm(n = 4, mean = c(20.1, 30.8)),
Height = rnorm(n = 4, mean = c(9.5, 14.2)))
df1
# make it wider
df2 <- reshape(df1, timevar = "Time", idvar = "id", direction = "wide")
df2
# possible further calculation
within(df2,
{
Height.growth <- Height.After - Height.Before
Weight.growth <- Weight.After - Weight.Before
})
@
Alternatively, we may want to convert \code{df1} into a longer shape, with a single column with measurements, and a new column indicating whether the measured variable was \code{height} or \code{weight}. For this operation to succeed, we need to add a column with a unique value for each row in \code{df1}, and one easy way is to copy row names into a column. The names of the parameters of function \Rfunction{reshape()} are meaningful only when dealing with time series. Thus, reading the code below becomes rather difficult. It is also to be noted that the user is responsible of passing the values to \code{times} in the correct order.
<<data-frames-reshape-02>>=
df1$ID <- rownames(df1) # unique ID for each row
# make it longer
reshape(df1,
idvar = "ID",
timevar = "Quantity",
times = c("Weight", "Height"),
v.names = "Value",
direction = "long",
varying = c("Weight", "Height"))
@
To edit a data frame programmatically, one can use the approaches already discussed, using the extraction operators \Roperator{[ ]} or \Roperator{[[ ]]} on the \emph{lhs} of \Roperator{<-} to replace member elements. This in combination with functions like \Rfunction{gsub()} makes it possible to ``edit'' the contents of data frames.
Methods \Rfunction{View()}, \Rfunction{edit()} and \Rfunction{fix()} can be used interactively to display and edit \Rlang objects. When using \Rpgrm from within IDEs like \RStudio, calling these functions with a data frame as argument opens in most cases the IDE's own worksheet-like data editors, and for other types of objects a text editor pane. Output is not included for this chunk, as the use of these functions requires user interaction. Please, run these examples in \Rpgrm and in an IDE like \RStudio.
<<exploring-dfs-0a, eval=FALSE>>=
View(cars)
edit(cars)
@
\begin{explainbox}
These functions can be used at the \Rlang console also when \Rpgrm is used on its own, but the editors activated are different ones. In any case, the use of scripts has made the interactive use of \Rpgrm at the console less frequent and the need to edit \Rlang objects previously saved in the user's current workspace nearly disappear. \Rfunction{View()}, \Rfunction{edit()} and \Rfunction{fix()} are unusual in that their definitions are dependent on system variables that at least when using \Rpgrm on its own, can be modified by the user.
\end{explainbox}
\index{data frames|)}
<<echo=FALSE,cache=FALSE>>=
rm(list = setdiff(ls(pattern="*"), to.keep))
@
\section{Attributes of \Rlang Objects}\label{sec:calc:attributes}
\index{attributes|(}
\Rlang objects can have attributes. Attributes are named \emph{slots} normally used to store ancillary data such as object properties functioning as additional fields where to store additional information in any \Rlang object. There are no restrictions on the class of what is assigned to an attribute. They can be used to store metadata accompanying the data stored in an object, which is important for reproducible research and data sharing. They can be set and read by user code and they are also used internally by \Rlang among other things to store the class an object belongs to, column and row names in data frames and matrices and the labels of levels in factors. Although most \Rlang objects have attributes, they are rarely displayed explicitly when an object is printed, while the structure of objects as displayed by function \Rfunction{str()} includes them.
Although we rarely need to set or extract values stored in attributes explicitly, many of the features of \Rlang that we take for granted are implemented using attributes: columns names in data frames are stored in an attribute. Matrices are vectors with additional attributes.
<<attributes-00>>=
df1 <- data.frame(x = 1:6, y = c("a", "b"), z = c(TRUE, FALSE, NA))
df1
attributes(df1)
str(df1)
@