R.data.containers.Rnw

% !Rnw root = appendix.main.Rnw

<<echo=FALSE, include=FALSE>>=
opts_chunk$set(opts_fig_wide)
opts_knit$set(concordance=TRUE)
opts_knit$set(unnamed.chunk.label = 'container-chunk')
@

\chapter{Base \Rlang: ``Collective Nouns''}\label{chap:R:collective}

\begin{VF}
The information that is available to the computer consists of a selected set of \emph{data} about the real world, namely, that set which is considered relevant to the problem at hand, that set from which it is believed that the desired results can be derived. The data represent an abstraction of reality\ldots

\VA{Niklaus Wirth}{\emph{Algorithms $+$ Data Structures $=$ Programs}, 1976}\nocite{Wirth1976}
\end{VF}

\section{Aims of This Chapter}

Data set organisation and storage is one of the keys to efficient data analysis. How to keep together all the information that belongs together, say all measurements from an experiment and corresponding metadata such as treatments applied and/or dates. The title ``collective nouns'' is based on the idea that a data set is a collection of data objects.

In this chapter, you will familiarise with how data sets are usually managed in \Rlang. I use both abstract examples to emphasise the general properties of data sets and the \Rlang classes available for their storage and a few more specific examples to exemplify their use in a more concrete way. While in chapter \ref{chap:R:as:calc} the focus was on atomic data types and objects, like vectors, useful for the storage of collections of values of a given type, like numbers, in the present chapter the focus is on the storage within a single object of heterogeneous data, such as a combination of factors, and character and numeric vectors. Broadly speaking, heterogeneous \emph{data containers}.

To describe the structure of \Rlang objects I use diagrams similar to those in the previous chapter.

\index{data sets!their storage|(}

\section{Data from Surveys and Experiments}
\index{data sets!origin}\index{data sets!characteristics}
The data we plot, summarise, and analyse in \Rlang, in most cases, originate from measurements done as part of experiments or surveys. Data collected mechanically from user interactions with websites or by crawling through internet content originate from a statistical perspective from surveys. The value of any data comes from knowing their origin, say treatments applied to plants, or the country from where website users connect; sometimes several properties are of interest to describe the origin of the data and in other cases observations consist in the measurement of multiple properties on each subject under study. Consequently, all software designed for data analysis implements ways of dealing with data sets as a whole both during storage and when passing them as arguments to functions. A data set is a usually heterogeneous collection of data with related information.

In \Rlang, lists are the most flexible type of objects useful for storing whole data sets. In most cases, we do not need this much flexibility, so rectangular collections of observations are most frequently stored in a variation upon lists called data frames. These objects can have as their members the vectors and factors described in chapter \ref{chap:R:as:calc}.

Any \Rlang object can have attributes, allowing objects to carry along additional bits of information. Some like comments are part of \Rlang and aimed at storage of ancillary information or metadata by users. Other attributes are used internally by \Rlang and finally users can store arbitrary ancillary data using attributes created \emph{ad hoc}.

\section{Lists}\label{sec:calc:lists}
\index{lists|(}\qRclass{list}
In \Rlang, \Rclass{list} objects are in several respects similar the vectors described in chapter \ref{chap:R:as:calc} but differently to vectors, the members they contain can be heterogeneous, i.e., different members of the same list can belong to different classes. In addition, while the member elements of a vector must be \emph{atomic} values like numbers or character strings, any \Rlang object can be a list member including other lists.

In \Rlang, the members of a list can be considered as following a sequence, and accessible through numerical indexes, the same as the members of vectors. Members of a list as well as members of a vector can be named, and retrieved (indexed) through their names. In practice, named lists are more frequently used than named vectors. \Rlang lists are created, or constructed, with function \Rfunction{list()} similarly as vectors are constructed with function \Rfunction{c()}.

\begin{explainbox}
  \Rlang lists can have as members not only objects storing data on observations and categories, but also function definitions, model formulas, unevaluated expressions, matrices, arrays, and objects of user-defined classes.
\end{explainbox}

\begin{explainbox}
List and list-like objects are widely used in \Rlang because they make it possible to keep, for example, the data, instructions for operations, and results from operations together in a single \Rlang object that can be saved, copied, etc. as a unit. This avoids the proliferation of multiple disconnected objects with their interrelations being encoded only by their names, or even worse in separate notes or even in a person's memory---all approaches that are error-prone. Model fit functions described in chapter \ref{chap:R:statistics} are good examples of this approach. Objects used to store the instructions to build plots with multiple layers as described in chapter \ref{chap:R:plotting} are also good examples.
\end{explainbox}

Our first list has as its members three different vectors, each one belonging to a different class: \code{numeric}, \code{character} and \code{logical}. The three vectors also differ in their length: 6, 1, and 2, respectively.\qRfunction{list()}\qRfunction{names()}

<<lists-0>>=
lst1 <- list(x = 1:3, y = "ab", z = c(TRUE, FALSE))
@

<<lists-0a>>=
str(lst1)
names(lst1)
@

\begin{center}
\begin{footnotesize}
\begin{tikzpicture}[font=\sffamily, my shape/.style={
  rectangle split, rectangle split parts=#1, draw, anchor=north, minimum size=12mm},
array/.style={matrix of nodes,nodes={draw, minimum size=1mm, fill=black},column sep=2pt, row sep=0.5mm, nodes in empty cells,
row 1/.style={nodes={draw=none, fill=none, minimum size=1mm}}}]

\matrix[array] (array) {
1 & 2 & 3 \\
\rule{10mm}{.1pt} & \rule{10mm}{.1pt} & \rule{10mm}{.1pt}\\};

\begin{scope}[on background layer]
\fill[blue!10] (array-1-1.north west) rectangle (array-1-3.south east);
\end{scope}

\draw (array-2-1.west) node [signal, draw, fill=codeshadecolor, minimum size=6mm, line width=1.5pt, left] (first) {\code{\strut lst1}};
\draw (array-2-1.north) node [signal, draw, fill=codeshadecolor, minimum size=5mm, rotate=-90, xshift=-11.5mm, yshift=-2.7mm, above] (nameh) {\rotatebox{90}{x\strut}};
\draw (array-2-2.north) node [signal, draw, fill=codeshadecolor, minimum size=5mm, rotate=-90, xshift=-11.5mm, yshift=-2.7mm, above] (namec) {\rotatebox{90}{y\strut}};
\draw (array-2-3.north) node [signal, draw, fill=codeshadecolor, minimum size=5mm, rotate=-90, xshift=-11.5mm, yshift=-2.7mm, above] (namew) {\rotatebox{90}{z\strut}};
%\draw (array-1-2.north)--++(90:3mm) node [above] (first) {Index};
\draw (array-1-3.east)--++(0:12.5mm) node [right]{\code{integer} positional indices};
\draw (array-2-3.east)--++(0:8mm) node [right]{\textsl{heterogeneous} class, \textsl{varying} length};
\draw (namew)--++(0:15mm) node [right]{\code{character} member names};
%
  \node [my shape=3, rectangle split, fill=blue!20] at (-1.3,-.25)
    {1\strut\nodepart{two}2\strut\nodepart{three}3\strut};
  \node [my shape=1, fill=red!20] at (0,-.25)
    {``ab''\strut};
  \node [my shape=2, fill=yellow!20] at (1.3,-.25)
    {TRUE\strut\nodepart{two}FALSE\strut};
%\draw (-0.6,+0.65) node [signal, draw, fill=codeshadecolor, minimum size=6mm, line width=1.5pt, left] (first) {\code{\strut lst1}};
\end{tikzpicture}
\end{footnotesize}
\end{center}

\begin{warningbox}
  It is best to use informative names for accessing \code{list} members, as their members are heterogenous, usually containing loosely related/connected data. Names make code easier to understand and mistakes more visible. Using names also makes code more robust to future changes in the position of list members in lists created upstream of our own \Rlang code. Below, we use both positional indices and names to highlight the similarities between lists and vectors.
\end{warningbox}

Lists can behave as vectors with heterogeneous elements as members, as we will describe next. Lists can also be nested, so tree-like structures are also possible (see section \ref{sec:calc:lists:nested} on page \pageref{sec:calc:lists:nested}).

%{ \tikzstyle{every node}=[draw=black,thick,anchor=west,fill=blue!10]
% \tikzstyle{root}=[dashed,fill=gray!50]
%\sffamily
%\centering
%\footnotesize
%\begin{tikzpicture}[%
%  grow via three points={one child at (0.5,-0.55) and
%  two children at (0.5,-0.55) and (0.5,-1.1)},
%  edge from parent path={(\tikzparentnode.south) |- (\tikzchildnode.west)}]
%  \node [root] {lst1}
%      child { node {\$ x: int [1:6] 1 2 3 4 5 6}}
%      child { node {\$ y: chr "a"}}
%      child { node {\$ z: logi [1:2] TRUE FALSE}};
%\end{tikzpicture}
%}

\begin{faqbox}{How to create an empty list?}
  In the same way as \code{numeric()} by default creates a \code{numeric} vector of length zero, \Rfunction{list()} by default creates a \code{list} object with no members.

<<list-empty-faq>>=
list()
@
\end{faqbox}

\subsection{Member extraction, deletion and insertion}

In\index{lists!member extraction|(}\index{lists!member indexing|see{lists, member extraction}}\index{lists!deletion and addition of members|(} section \ref{sec:calc:indexing} on page \pageref{sec:calc:indexing}, we saw that the extraction operator \Roperator{[ ]} applied to a vector, returns a vector, longer or shorter, possibly of length one, or even length zero. Similarly, applying operator \Roperator{[ ]} to a list returns a list, possibly of different length: \code{lst1["x"]} or \code{lst[1]} return a list containing only one member, the numeric vector stored at the first position of \code{lst1}. In the last statement in the chunk below, \code{lst1[c(1, 3)]} returns a list of length two as expected.

<<lists-1a>>=
lst1["x"]
lst1[1]
lst1[c(1, 3)]
@

As with vectors negative positional indices remove members instead of extracting them. See page \pageref{par:calc:lists:rm} for a safer approach to the deletion of list members.

<<lists-1ay>>=
lst1[-1]
lst1[c(-1, -3)]
@

Using operator \Roperator{[[ ]]} (double square brackets) for indexing a list extracts the element stored in the list, in its original mode. In the example below, \code{lst1[["x"]]} and \code{lst1[[1]]} return a numeric vector. We might say that extraction operator \Roperator{[[ ]]} reaches ``deeper'' into the list than operator \Roperator{[ ]}. Operator \Roperator{\$}, used in the second statement below, provides a shorthand notation, equivalent to calling \Roperator{[[ ]]} with a single constant \code{character} value as argument.

<<lists-1>>=
lst1$x
lst1[["x"]]
lst1[[1]]
@

\begin{explainbox}\label{box:extraction:opers}
We mentioned above that indexing by name can be done either with double square brackets, \Roperator{[[ ]]}, or with \Roperator{\$}. Operators \Roperator{[ ]} and \Roperator{[[ ]]} work like normal \Rlang functions, accepting as arguments passed to them both constant values and variables for indexing. In contrast, \Roperator{\$} mainly intended for use when typing at the console, accepts only bare member names on its \emph{rhs}. With \Roperator{[[ ]]}, the name of the variable or column is given as a character string, enclosed in quotation marks, or as a variable with mode \code{character}. A number as a positional index is also accepted.

<<index-partial-1>>=
lst1a <- list(abcd = 123, xyzw = 789)
lst1a[[1]]
lst1a[["abcd"]]
vct1 <- "abcd"
lst1a[[vct1]]
@

When using \Roperator{\$}, the name is entered as a constant, without quotation marks, and cannot be a variable or a number.

<<index-partial-1a>>=
lst1a$abcd
lst1a$ab
lst1a$a
@

Both in the case of lists and data frames (see section \ref{sec:R:data:frames} on page \pageref{sec:R:data:frames}), when using double square brackets, by default an exact match is required between the name in the object and the name used for indexing. In contrast, with \Roperator{\$}, an unambiguous partial match is silently accepted. For interactive use, partial matching decreases the extent of the text typed at the console. However, in scripts, and especially \Rlang code in packages, it is best to avoid the use of \Roperator{\$} as partial matching to a wrong variable present at a later time, e.g., when someone else revises the script, misdirected partial matching can lead to difficult-to-diagnose errors.

In addition, as \Roperator{\$} is implemented by first attempting a match to the name and then calling \Roperator{[[ ]]}, using \Roperator{\$} for indexing can result in slightly slower performance compared to using \Roperator{[[ ]]}. It is possible to set \Rlang option \code{warnPartialMatchDollar} so that partial matching triggers a warning when using \Roperator{\$} to extract a member, which can be very useful when debugging.
\end{explainbox}

<<lists-1az>>=
is.vector(lst1[1])
is.list(lst1[1])
is.vector(lst1[[1]])
is.list(lst1[[1]])
@

The two extraction operators can be used together as shown below, with \code{lst1[[1]]} extracting the vector from \code{lst1} and \code{[3]} extracting the member at position 3 of the vector.

<<lists-1ax>>=
lst1[[1]][3]
@

Extraction\label{par:calc:list:member:assign} operators can be used on the \emph{lhs} as well as on the \emph{rhs} of an assignment, and lists can be empty, i.e., be of length zero. The example below makes use of this to build a list step by step.

<<lists-pg-01, eval=eval_playground>>=
lst2 <- list()
lst2[["x"]] <- 1:3
lst2[["y"]] <- "ab"
lst2[["z"]] <- c(TRUE, FALSE)
@

\begin{playground}
Compare \code{lst2} to \code{lst1}, used for the examples above. Then run the code below and compare them again. Try to understand why \code{lst2} has changed as it did. Pay also attention to possible changes to the members' names.

<<lists-pg-02, eval=eval_playground>>=
lst2[["y"]] <- lst2[["x"]]
@
\end{playground}

\begin{explainbox}
\emph{Lists}, as usually defined in languages like \Clang, are based on pointers to memory locations, with pointers stored at each node. These pointers chain or link the different member nodes (this allows, for example, sorting of lists in place by modifying the pointers). In such implementations, indexing by position is not possible, or at least requires ``walking'' down the list, node by node. \Rlang does not implement pointers to ``addresses'', or locations, in memory. In \Rlang, \code{list} members can be accessed through positional indexes or member names, similarly to vector members. Of course, as with vectors, insertions and deletions in the middle of a list, shift the position of members, and change which member is pointed at by indexes for positions past the modified location. The names, in contrast, remain valid.

<<lists-eb-xx>>>=
list(a = 1, b = 2, c = 3)[-2]
@
\end{explainbox}

Three frequent operations on lists are concatenation, insertions, and deletions.\index{lists!insert into}\index{lists!append to} The same functions as with vectors are used: \Rfunction{c()}, to concatenate, and \Rfunction{append()}, to append and insert. Lists can be combined only with other lists, otherwise, these operations work as with vectors (see pages \pageref{par:calc:concatenate}--\pageref{par:calc:append:end}).

<<lists-1b>>=
lst3 <- append(lst1, list(yy = 1:10, zz = letters[5:1]), after = 2)
lst3
@

To\label{par:calc:lists:rm} delete a member from a list, we assign \code{NULL} to it.

<<lists-1c>>=
lst1$y <- NULL
lst1
@

To investigate the members contained in a list, function \Rfunction{str()} (\emph{structure}), used above, is convenient, especially when lists have many members. Structure formats lists more compactly than \code{print()} applied directly to a list.\label{par:calc:str}

<<lists-1aa>>=
print(lst1)
str(lst1)
@

\index{lists!deletion and addition of members|)}\index{lists!member extraction|)}

\subsection{Nested lists}\label{sec:calc:lists:nested}

Lists\index{lists!nested} can be nested, i.e., lists of lists can be constructed to an arbitrary depth. In the example below, \code{lst4} and \code{lst5} are members of \code{lst6}, i.e., \code{lst4} and \code{lst5} are nested within \code{lst6}.

<<lists-2>>=
lst4 <- list("a", "aa", 10)
lst5 <- list("b", TRUE)
lst6 <- list(A = lst4, B = lst5) # nested
str(lst6)
@

A nested\index{lists!nested} list can alternatively be constructed within a single statement in which several member lists are created. Here we combine the first three statements in the earlier chunk into a single one.

<<lists-3>>=
lst7 <- list(A = list("a", "aa", 10), B = list("b", TRUE))
str(lst7)
@

A list can contain a combination of \code{list} and \code{vector} members.

<<lists-3s>>=
lst8 <- list(A = list("a", "aa", 10),
             B = list("b", TRUE),
             C = c(1, 3, 9),
             D = 4321)
str(lst8)
@

\begin{explainbox}
The logic behind the extraction of members of nested lists using indexing is the same as for simple lists, but applied recursively---e.g., \code{lst7[[2]]} extracts the second member of the outermost list, which is another list. As, this is a list, its members can be extracted using again the extraction operator: \code{lst7[[2]][[1]]}. It is important to remember that these concatenated extraction operations are written so that the leftmost operator is applied to the outermost list.

The example above uses the \Roperator{[[ ]]} operator, but the left-to-right precedence also applies to concatenated calls to \Roperator{[ ]} and to calls combining both operators.
\end{explainbox}

\begin{playground}
What\index{lists!nested} do you expect each of the statements below to return? \emph{Before running the code}, predict what value and of which mode each statement will return. You may use implicit or explicit calls to \Rfunction{print()}, or calls to \Rfunction{str()} to visualise the structure of the different objects.

% not handled correctly by knitr, works at console.
<<lists-PG4, eval=eval_playground>>=
LST9 <- list(A = list("a", "aa", "aaa"), B = list("b", "bb"))
# str(LST9)
LST9[2:1]
LST9[1]
LST9[[1]][2]
LST9[[1]][[2]]
LST9[2]
LST9[2][[1]]
@

\end{playground}

\begin{explainbox}\index{lists!structure}
When dealing with deep lists, it is sometimes useful to limit the number of levels of nesting returned by \Rfunction{str()} by passing a \code{numeric} argument to parameter \code{max.levels}.

<<lists-EB1b>>=
str(lst8, max.level = 1)
@

\end{explainbox}

Sometimes we need to flatten a list\index{lists!flattening}\index{lists!nested}, or a nested structure of lists within lists. Function \Rfunction{unlist()} is what should be normally used in such cases.

The list \code{lst10} is a nested system of lists, but all the ``terminal'' members are character strings. In other words, terminal nodes are all of the same \code{mode}, allowing the list to be ``flattened'' into a character vector.\qRfunction{is.list()}

<<lists-5>>=
lst10 <- list(A = list("a", "aa", "aaa"), B = list("b", "bb"))
vct1 <- unlist(lst10)
vct1
is.list(lst10)
is.list(vct1)
mode(lst10)
mode(vct1)
names(lst10)
names(vct1)
@

The returned value is a vector with named member elements. We use function \Rfunction{str()} to figure out how this vector relates to the original list. The names, always of mode character, are based on the names of list elements when available, while characters depicting positions as numbers are used for anonymous nodes. We can access the members of the vector either through numeric indexes or names.

<<lists-6>>=
str(vct1)
vct1[2]
vct1["A2"]
@

\begin{playground}
Function \Rfunction{unlist()}\index{lists!convert into vector} has two additional parameters, with default argument values, which we did not modify in the example above. These parameters are \code{recursive} and \code{use.names}, both of them expecting a \code{logical} value as an argument. Modify the statement \code{c.vec <- unlist(c.list)}, by passing \code{FALSE} as an argument to these two parameters, in turn, and in each case, study the value returned and how it differs with respect to the one obtained above.
\end{playground}

Function \Rfunction{unname()} can be used to remove names safely---i.e., without risk of altering the mode or class of the object.

<<lists-7>>=
unname(vct1)
unname(lst10)
@
\index{lists|)}

<<lists-cleanup, include=FALSE>>=
rm(list = setdiff(ls(pattern="*"), to.keep))
@

\section{Data Frames}\label{sec:R:data:frames}
\index{data frames|(}\qRclass{data.frame}
\index{worksheet@`worksheet'|see{data frame}}
Data frames are a special type of list, in which all members have the same length, giving origin to a matrix-like object, in which columns can belong to different classes. Most commonly the member ``columns'' are vectors or factors, but they can also be matrices with the same number of rows as the enclosing data frame, or lists with the same number of members as rows in the enclosing data frame.

Data frames are central to most data manipulation and analysis procedures in \Rlang. They are commonly used to store observations, with \code{numeric} columns holding data for continuous variables and \code{factor} columns data for categorical variables. Binary variables can be stored in \code{logical} columns. Text data can be stored in \code{character} columns. Date and time can be stored in columns of specific classes, such as \code{POSIXct}. In the diagram below, column \code{treatment} is a factor with two levels encoding two conditions, \code{hot} and \code{cold}. Columns \code{height} and \code{weight} are numeric vectors containing measurements.

\begin{center}
\begin{footnotesize}
\begin{tikzpicture}[font=\sffamily, my shape/.style={
  rectangle split, rectangle split parts=#1, draw, anchor=north, minimum size=12mm},
array/.style={matrix of nodes,nodes={draw, minimum size=1mm, fill=black},column sep=2pt, row sep=0.5mm, nodes in empty cells,
row 1/.style={nodes={draw=none, fill=none, minimum size=1mm}}}]

\matrix[array] (array) {
1 & 2 & 3 \\
\rule{10mm}{.1pt} & \rule{10mm}{.1pt} & \rule{10mm}{.1pt}\\};

\begin{scope}[on background layer]
\fill[blue!10] (array-1-1.north west) rectangle (array-1-3.south east);
\end{scope}

\draw (array-2-1.west) node [signal, draw, fill=codeshadecolor, minimum size=6mm, line width=1.5pt, left] (first) {\code{\strut df1}};
\draw (array-2-1.north) node [signal, draw, fill=codeshadecolor, minimum size=5mm, rotate=-90, xshift=-17mm, yshift=-3mm, above] (nameh) {\rotatebox{180}{treatment\strut}};
\draw (array-2-2.north) node [signal, draw, fill=codeshadecolor, minimum size=5mm, rotate=-90, xshift=-14.4mm, yshift=-3mm, above] (namec) {\rotatebox{180}{height\strut}};
\draw (array-2-3.north) node [signal, draw, fill=codeshadecolor, minimum size=5mm, rotate=-90, xshift=-14.5mm, yshift=-3mm, above] (namew) {\rotatebox{180}{weight\strut}};
%\draw (array-1-2.north)--++(90:3mm) node [above] (first) {Index};
\draw (array-1-3.east)--++(0:12.5mm) node [right]{\code{integer} positional indices};
\draw (array-2-3.east)--++(0:8mm) node [right]{\textsl{heterogeneous} class, \textsl{same} length};
\draw (namew)--++(0:15mm) node [right]{\code{character} column names};
%
  \node [my shape=4, rectangle split, fill=green!20] at (-1.3,-.25)
    {hot\strut\nodepart{two}cold\strut\nodepart{three}hot\strut\nodepart{four}\ldots\strut};
  \node [my shape=4, fill=blue!20] at (0,-.25)
    {10.2\strut\nodepart{two}\phantom{1}8.3\strut\nodepart{three}12.0\strut\nodepart{four}\ldots\strut};
  \node [my shape=4, fill=blue!20] at (1.3,-.25)
    {2.2\strut\nodepart{two}3.3\strut\nodepart{three}2.5\strut\nodepart{four}\ldots\strut};
%\draw (-0.6,+0.65) node [signal, draw, fill=codeshadecolor, minimum size=6mm, line width=1.5pt, left] (first) {\code{\strut a.list}};
\end{tikzpicture}
\end{footnotesize}
\end{center}

Data frames are created with constructor function \Rfunction{data.frame()} with a syntax similar to that used for lists.\qRfunction{colnames()}\qRfunction{rownames()}\qRfunction{is.data.frame()}

<<data-frames-0>>=
df1 <- data.frame(treatment = factor(rep(c("hot", "cold"), 3)),
                  height = c(10.2, 8.3, 12.0, 9.0, 11.2, 8.7),
                  weight = c(2.2, 3.3, 2.5, 2.8, 2.4, 3.0))
df1
colnames(df1)
rownames(df1)
str(df1)
class(df1)
mode(df1)
is.data.frame(df1)
is.list(df1)
@

We can see above that when printed each row of a \code{data.frame} is preceded by a row name. Row names are character strings, just like column names. The \Rfunction{data.frame()} constructor adds by default row names representing running numbers. Default row names are rarely of much use, except to track insertions and deletions of rows during debugging.

\begin{playground}
As the expectation is that all member variables (or ``columns'') have equal length, if vectors of different lengths are supplied as arguments, the shorter vector(s) is/are recycled, possibly several times, until the required full length is reached, as shown below for \code{treatment}.

<<data-frames-0a>>=
df2 <- data.frame(treatment = factor(c("hot", "cold")),
                  height = c(10.2, 8.3, 12.0, 9.0, 11.2, 8.7),
                  weight = c(2.2, 3.3, 2.5, 2.8, 2.4, 3.0))
@

Are \code{df1} crated above and \code{df2} created here equal?

\end{playground}

With function \Rfunction{class()} we can query the class of an \Rlang object (see section \ref{sec:rlang:mode} on page \pageref{sec:rlang:mode}). As we saw in the previous chunk, \code{list} and \code{data.frame} objects belong to two different classes. However, their \code{mode} is the same. Consequently, data frames inherit the methods and characteristics of lists, as long as they have not been hidden by new ones defined for data frames (for an explanation of \emph{methods}, see section \ref{sec:methods} on page \pageref{sec:methods}).

Extraction of individual member variables or ``columns'' can be done like in a list with operators \Roperator{[[ ]]} and \Roperator{\$} (see call-out in \pageref{box:extraction:opers}).

<<data-frames-1>>=
df1$height
df1[["height"]]
df1[[2]]
class(df1[["height"]])
@

In the same way as with lists, we can add member variables to data frames. Recycling takes place if needed.

<<data-frames-2>>=
df1$x2 <- 6:1
df1[["x3"]] <- "b"
str(df1)
@

\begin{playground}
We have added two columns to the data frame, and in the case of column \code{x3} recycling took place. This is where lists and data frames differ substantially in their behaviour. In a data frame, although class and mode can be different for different member variables (columns), they are required to be vectors or factors of the same length (or a matrix with the same number of rows, or a list with the same number of members). In the case of lists, there is no such requirement, and recycling never takes place when adding a member. Compare the values returned below for \code{LST1}, to those in the example above for \code{df1}.

<<data-frames-2a>>=
LST1 <- list(x = 1:6, y = "a", z = c(TRUE, FALSE))
str(LST1)
LST1$x2 <- 6:1
LST1$x3 <- "b"
str(LST1)
@
\end{playground}

\begin{faqbox}{How to create an empty data frame?}
In the same way as \code{numeric()} creates a \code{numeric} vector of length zero, \Rfunction{data.frame()} by default creates a \code{data.frame} with zero rows and no columns.

<<data-frame-empty-faq>>=
data.frame()
@
\end{faqbox}

\begin{faqbox}{How to make a list of data frames?}
We create a list of data frames in the same way as we create a nested list of lists, or in fact of a list of any other \Rlang objects. See section \ref{sec:calc:lists:nested} on page \pageref{sec:calc:lists:nested}.

<<data-frame-listof-faq>>=
list(df1, df2)
@
\end{faqbox}

\begin{faqbox}{How to add a new column to a data frame (to the front and end)?}
In the same way as we can assign a new member to a list using the extraction operator \Roperator{[[ ]]}, we can add a new column to a data frame (see page \pageref{par:calc:list:member:assign}). In this case, if the column name does not already exist, the assigned vector or factor is appended as the last column (no recycling applied to short vectors or factors unless of length one).

<<data-frame-add-co1l-faq>>=
DF1 <- data.frame(A = 1:5, B = factor(5:1))
DF1[["C"]] <- 11:15
DF1
@

To add a column at the front, we can use function \Rfunction{cbind()} (column bind).

<<data-frame-add-col2-faq>>=
DF2 <- data.frame(A = 1:5, B = factor(5:1))
cbind(C = 11:15, DF2)
@
\end{faqbox}

Being two-dimensional and rectangular in shape, data frames, in relation to indexing and dimensions, behave similarly to a matrix. They have two margins, rows, and columns, and, thus, two indices are used to indicate the location of a member ``cell''. We provide some examples here, but please consult section \ref{sec:calc:indexing} on page \pageref{sec:calc:indexing} and section \ref{sec:matrix:array} on page \pageref{sec:matrix:array} for additional details.

Matrix-like notation allows simultaneous extraction from multiple columns, which is not possible with lists. The value returned is in most cases a ``smaller'' data frame as in this example.

<<data-frames-bx-03>>=
df1[2:3, 1:2]
@

<<data-frames-3>>=
# first column, df1[[1]] preferred
df1[ , 1]
# first column, df1[["x"]] or df1$x preferred
df1[ , "treatment"]
# first row
df1[1, ]
# first two rows of the third and fourth columns
df1[1:2, c(FALSE, FALSE, TRUE, TRUE, FALSE)]
# the rows for which comparison is true
df1[df1$treatment == "hot" , ]
# the heights > 8
df1[df1$height > 8, "height"]
@

As explained earlier for vectors (see section \ref{sec:calc:indexing} on page \pageref{sec:calc:indexing}), indexing can be present both on the right- and left-hand sides of an assignment, allowing the replacement of both individual values and rectangular regions.

The next few examples do assignments to ``cells'' of \code{df1}, either to one whole column, or individual values. The last statement in the chunk below copies a number from one location to another by using indexing of the same data frame both on the right side and left side of the assignment.\qRoperator{[[ ]]}\qRoperator{[ ]}

<<data-frames-3a>>=
df1[1, 2] <- 99
df1
df1[ , 2] <- -99
df1
df1[["height"]] <- c(10, 12)
df1
df1[1, 2] <- df1[6, 3]
df1
df1[3:6, 2] <- df1[6, 3]
df1
@

Similarly as with matrices, if we extract a single column from a data frame using matrix-like indexing, it is by default simplified into a vector or factor, i.e., the column-dimension is dropped. By passing \code{drop = FALSE}, we can prevent this. Contrary to matrices, rows are not simplified in the case of data frames.

<<data-frames-2b>>=
is.data.frame(df1[1, ])
is.data.frame(df1[ , 2])
is.data.frame(df1[ , "treatment"])
is.data.frame(df1[1:2, 2:3])
is.vector(df1[1, ])
is.vector(df1[ , 2])
is.factor(df1[ , "treatment"])
is.vector(df1[1:2, 2:3])
@

<<data-frames-2bb>>=
is.data.frame(df1[ , 1, drop = FALSE])
is.data.frame(df1[ , "treatment", drop = FALSE])
@

\begin{warningbox}
In contrast to matrices and data frames, the extraction operator \Roperator{[ ]} of tibbles---defined in package \pkgname{tibble}---never simplifies returned one-column tibbles into vectors (see section \ref{sec:data:tibble} on page \pageref{sec:data:tibble} for details on the differences between data frames and tibbles).
\end{warningbox}

Usually data frames are created from lists or by passing individual vectors and factors to the constructors. It is also possible to construct data frames starting from matrices, other data frames and named vectors, in which case, the identity function \Rfunction{I()} can be used to protect them from interpretation by the \Rfunction{data.frame()} constructor. In these cases, additional nuances become important. The details are well described in \code{help(data.frame)}.

With a named numeric vector, and a factor as arguments, the names are moved from the vector to the rows of the data frame!

<<data-frames-bx-constr-01>>=
vct1 <- c(one = 1, two = 2, three = 3, four = 4)
fct1 <- as.factor(c(1, 2, 3, 2))
df1 <- data.frame(fct1, vct1)
df1
df1$vct1
@

If the vector is protected with \Rlang's identity function \Rfunction{I()} the names are not moved as can be seen by extracting the column \code{vct1} from data frame \code{df2}.

<<data-frames-bx-constr-02>>=
df2 <- data.frame(fct1, I(vct1))
df2
df2$vct1
@

\begin{explainbox}
With a matrix instead of a vector, the matrix is split into separate columns in the data frame. If the matrix has no column names, new ones are created.

<<data-frames-bx-constr-04>>=
mat1 <- matrix(1:12, ncol = 3)
df4 <- data.frame(fct1, mat1)
@

<<data-frames-bx-constr-04a>>=
df4
@

If the matrix is protected with function \Rfunction{I()}, it is not split, and the whole matrix becomes a column in the data frame.

<<data-frames-bx-constr-05>>=
df5 <- data.frame(fct1, I(mat1))
df5
df5$mat1
@

\end{explainbox}

\begin{explainbox}
With a list, whose member are vectors, each member of the list becomes a column in the data frame. In the case of too short members, recycling is applied.

<<data-frames-bx-constr-06>>=
lst1 <- list(a = 4:1, b = letters[4:1], c = "n", d = "z")
df6<- data.frame(fct1, lst1)
df6
@

If the list is protected with \Rfunction{I()}, the list is added in whole as a variable or column in the data frame. In this case, the length of the list must match the number of rows in the data frame, while the length and class of the individual members of the list can vary. The names of the list members are used to set the \code{rownames} of the data frame.
This is similar to the default behaviour of tibbles, while \Rlang data frames require explicit use of \Rfunction{I()} for lists not to be split (see chapter \ref{chap:R:data} on page \pageref{chap:R:data} for details about package \pkgname{tibble}).

<<data-frames-bx-constr-07>>=
df7<- data.frame(fct1, I(lst1))
df7
@
<<data-frames-bx-constr-07b>>=
df7$lst1
@

\end{explainbox}

\begin{advplayground}
What do we gain using \Rfunction{I()}? Check the documentation carefully and think of uses where the flexibility gained by the option to protect or not the arguments passed to the \Rfunction{data.frame()} constructor can be useful. In addition, write \Rlang statements to extract individual members of embedded matrices or lists using indexing. Finally, test if the behaviour of \Rfunction{I()} is the same when assigning new member variables (or ``columns'') to an existing data frame.
\end{advplayground}

\subsection{Sub-setting data frames}\label{sec:calc:df:subset}
When\index{data frames!subsetting}\index{data frames!``filtering rows''} the names of data frames are long, complex conditions become awkward to write using indexing---i.e., subscripts. In such cases, \Rfunction{subset()} is handy because it evaluates the condition with the data frame as the ``environment'', i.e., the names of the columns are recognised if entered directly when writing the condition. Function  \Rfunction{subset()} ``filters'' rows, usually corresponding to observations or experimental units. The condition is computed for each row, and if it returns \code{TRUE}, the row is included in the returned data frame, and excluded if \code{FALSE}.

We create a data frame with six rows and three columns. For column \code{y}, we rely on \Rlang automatically extending \code{"a"} by repeating it six times, while for column \code{z}, we rely on \Rlang automatically extending \code{c(TRUE, FALSE)} by repeating it three times.

<<data-frames-4>>=
df8 <- data.frame(x = 1:6, y = "a", z = c(TRUE, FALSE))
subset(df8, x > 3)
@

\begin{advplayground}
What is the behaviour of \code{subset()} when the condition is \code{NA}? Find the answer by writing code to test this, for a case where tests for different rows return \code{NA}, \code{TRUE} and \code{FALSE}.
\end{advplayground}

When calling functions that return a vector, data frame, or other structure, the extraction operators \Roperator{[ ]}, \Roperator{[[ ]]}, or \Roperator{\$} can be appended to the rightmost parenthesis of the function call, in the same way as to the name of a variable holding the same data.

<<data-frames-5>>=
subset(df8, x > 3)[ , -3]
subset(df8, x > 3)[ , "x", drop = FALSE]
subset(df8, x > 3)[ , "x"]
@

\begin{advplayground}
When do extraction operators applied to data frames return a vector or factor, and when do they return a data frame? Please, experiment with your own code examples to work out the answer.
\end{advplayground}

\begin{explainbox}
In the case of \Rfunction{subset()}, we can select columns directly as shown below, while for most other functions, extraction using operators \Roperator{[ ]}, \Roperator{[[ ]]}, or \Roperator{\$} is needed.

<<data-frames-5aa>>=
subset(df8, x > 3, select = 2)
@

<<data-frames-5ab>>=
subset(df8, x > 3, select = x)
@

<<data-frames-5ac>>=
subset(df8, x > 3, select = "x")
@
\end{explainbox}

None of the examples in the last four code chunks alters the original data frame \code{df8}. We can store the returned value using a new name if we want to preserve \code{df8} unchanged, or we can assign the result to \code{df8}, deleting in the process, the previously stored value.

\begin{warningbox}
In the examples above, the names in the expression passed as the second argument to \code{subset()} were searched within \code{df8} and found. However, if not found in the data frame, objects with matching names are searched for in the global environment (outside the data frame, and visible in the user's workspace or enclosing environment). With no variable \code{A} present in data frame \code{df8}, vector \code{A} from the environment is silently used in the chunk below resulting in a returned data frame with no rows as \code{A > 3} returns \code{FALSE}.

<<data-frames-5b>>=
A <- 1
subset(df8, A > 3)
@

This also applies to the expression passed as argument to parameter \code{select}, here shown as a way of selecting columns based on names stored in a character vector.

<<data-frames-5c>>=
columns <- c("x", "z")
subset(df8, select = columns)
@

The use of \Rfunction{subset()} is convenient, but more prone to bugs compared to directly using the extraction operator \code{[ ]}. This same ``cost'' to achieving convenience applies to functions like \Rfunction{attach()} and \Rfunction{with()} described below. The longer time that a script is expected to be used, adapted, and reused, the more careful we should be when using any of these functions. An alternative way of avoiding excessive verbosity is to keep the names of data frames short.
\end{warningbox}

A frequently used way of deleting a column by name from a data frame is to assign \code{NULL} to it---i.e., in the same way as members are usually deleted from \code{list}s. This approach modifies \code{df9} in place, rather than returning a modified copy of \code{df9}.

<<data-frames-6>>=
df9 <- df8
head(df9)
df9[["y"]] <- NULL
head(df9)
@

Alternatively, negative indexing can be used to remove columns from a copy of a data frame. In this example, a single column is removed. As base \Rlang does not support negative indexing by name with the extraction operator, the numerical index of the column to delete needs to be obtained first. (See the examples above using \code{subset()} with bare names to delete columns.)

<<data-frames-6a>>=
df8[ , -which(colnames(df8) == "y")]
@%
\pagebreak

Instead of using the equality test, we can use the operator \code{\%in\%} or function \code{grepl()} to create a \code{logical} vector useful for deleting or selecting multiple columns in a single statement.

\begin{playground}
In the previous code chunk, we deleted the last column of the data frame \code{df8}, but using the extraction operator, we modified only the returned copy of \code{df8}, leaving \code{df8} unchanged. Thus we reuse it here for a surprising trick. You should first untangle how it changes the positions of columns and rows, and afterwards think how and why indexing with the extraction operator \Roperator{[ ]} on both sides of the assignment operator \Roperator{<-} can be useful when working with data.

<<data-frames-7, eval=eval_playground>>=
df8[1:6, c(1,3)] <- df8[6:1, c(3,1)]
df8
@
\end{playground}

\begin{warningbox}
Although in this last example we used numeric indexes to make it more interesting, in practice, especially in scripts or other code that will be reused, do use column or member names instead of positional indexes whenever possible. This makes code much more reliable, as changes elsewhere in the script could alter the order of columns and \emph{invalidate} numerical indexes. In addition, using meaningful names makes programmers' intentions easier to understand.
\end{warningbox}

\subsection{Summarising and splitting data frames}\label{sec:calc:df:split}\label{sec:calc:df:aggregate}
Function\index{data frames!summarising} \Rfunction{summary()} can be used to obtain a summary from objects of most \Rlang classes, including data frames. It is also possible to use \Rloop{sapply()}, \Rloop{lapply()} or \Rloop{vapply()} to apply any suitable function to data by columns (see section \ref{sec:data:apply} on page \pageref{sec:data:apply} for a description of these functions and their use).

<<data-frames-7aaa>>=
summary(df8)
@

\index{data frames!splitting}
\Rlang function \Rfunction{split()} makes it possible to split a data frame into a list of data frames, based on the levels of a factor, even if the rows are not ordered according to factor levels.

We create a data frame with six rows and three columns. In the case of column \code{z}, we rely on \Rlang to automatically extend \code{c("a", "b")} by repeating it three times so as to fill the six rows.

<<data-frames-7aa>>=
df10 <- data.frame(x1 = 1:6, x2 = c(1, 5, 4, 2, 6, 3), z = c("a", "b"))
@

<<data-frames-7a>>=
split(df10, df10$z)
@

\begin{explainbox}
The same operation can be specified using a one-sided formula \code{\textasciitilde z} to indicate the grouping.

<<data-frames-7c>>=
split(df10, ~ z)
@

\end{explainbox}

Function \Rfunction{unsplit()} can be used to reverse splitting done by \Rfunction{split()}.

\begin{explainbox}
\Rfunction{split()} is sometimes used in combination with apply functions (see section \ref{sec:data:apply} on page \pageref{sec:data:apply}) to compute group or treatment summaries. However, in most cases it is simpler to use \Rfunction{aggregate()} for computing such summaries.
\end{explainbox}

Related to splitting a data frame is the calculation of summaries based on a subset of cases, or more commonly summaries for all observations but after grouping them based on the values in a column or the levels of a factor.

\begin{faqbox}{How to summarise one variable from a data frame by group?}
To summarise a single variable by group, we can use \Rfunction{aggregate()}.

<<faq-aggregate-01>>=
aggregate(x = iris$Petal.Length,
          by = list(iris$Species), FUN = mean)
@

\end{faqbox}

\begin{faqbox}{How to summarise numeric variables from a data frame by group?}
To summarise variables, we can use \Rfunction{aggregate()} (see section \ref{sec:dplyr:group:wise} on page \pageref{sec:dplyr:group:wise} for an alternative approach using package \pkgnameNI{dplyr}).

<<faq-aggregate-02>>=
aggregate(x = iris[ , sapply(iris, is.numeric)],
          by = list(iris$Species), FUN = mean)
@

For these data, as the only non-numeric variable is \code{Species}, we could have also used formula notation as shown below.
\end{faqbox}

\begin{explainbox}
There\index{data frames!summarising} is also a formula-based \Rfunction{aggregate()} method (or ``variant'') available (\Rlang \emph{formulas} are described in depth in section \ref{sec:stat:formulas} on page \pageref{sec:stat:formulas}). In \Rfunction{aggregate()}, the left-hand side (\emph{lhs}) of the formula indicates the variable to summarise and its right-hand side (\emph{rhs}) the factor used to split or group the data before summarising them.

<<data-frames-7d>>=
aggregate(x1 ~ z, FUN = mean, data = df10)
@

We can summarise more than one column at a time.
<<data-frames-7e>>=
aggregate(cbind(x1, x2) ~ z, FUN = mean, data = df10)
@

If all the columns not used for grouping are valid input to the function passed as the argument to \code{FUN} the formula can be simplified using a point (\code{.}) with meaning ``all columns except those on the \emph{rhs} of the formula''.
<<data-frames-7f>>=
aggregate(. ~ z, FUN = mean, data = df10)
@

\end{explainbox}

Function \Rfunction{aggregate()} can be also used to aggregate time series data based on time intervals (see \code{help(aggregate)}).

\subsection{Re-arranging columns and rows}
\index{data frames!ordering rows}\index{data frames!ordering columns}
As with members of vectors and lists, to change the position of columns or rows in a data frame we use the extraction operator and indexing by name or position. In a matrix-like object, such as a data frame, the first index corresponds to rows and the second to columns.

The most direct way of changing the order of columns and/or rows in data frames (as for matrices and arrays) is to use subscripting. Once we know the original position and target position we can use column names or positions as indexes on the right-hand side, listing all columns to be retained, even those remaining at their original position.

<<data-frames-8>>=
df11 <- data.frame(A = 1:10, B = 3, C = c("A", "B"))
head(df11, 2)
df11 <- df11[ , c("B", "A", "C")]
head(df11, 2)
@

\begin{warningbox}
When using the extraction operator \Roperator{[ ]} on both the left- and right-hand-sides, with a \code{numeric} vector as an argument to swap two columns, the vectors or factors are swapped, while the names of the columns are not!
To retain the correspondence between column naming and column contents after swapping or rearranging the columns \emph{using numeric indices}, we need to separately move the names of the columns. This may seem counter-intuitive, unless we think in terms of positions being named rather than the contents of the columns being linked to the names.\qRfunction{colnames()}\qRfunction{colnames()<-}

<<data-frames-8ax>>=
df11 <- data.frame(A = 1:10, B = 3, C = c("A", "B"))
head(df11, 2)
df11[ , 1:2] <- df11[ , 2:1]
head(df11, 2)
colnames(df11)[1:2] <- colnames(df11)[2:1]
head(df11, 2)
@

\end{warningbox}

Taking into account that \Rfunction{order()} returns the indexes needed to sort a vector (see page \pageref{box:vec:sort}), we can use \Rfunction{order()} to generate the indexes needed to sort the rows of a data frame. In this case, the argument to \Rfunction{order()} is usually a column of the data frame being arranged. However, any vector of suitable length, including the result of applying a function to one or more columns, can be passed as an argument to \Rfunction{order()}. Function \Rfunction{order()} is not useful for sorting columns of data frames \emph{based on data from the columns} as it requires a vector across columns as input, which is possible only when all columns are of the same class. (In the case of \Rclass{matrix} and \Rclass{array} this approach can be applied to any of their dimensions as all their elements homogenously belong to one class.)

\begin{faqbox}{How to order columns or rows in a data frame?}
We use column names or numeric indexes with the extraction operator \Roperator{[ ]} only on the \emph{rhs} of the assignment. For example, to arrange the columns of data set \code{iris} in decreasing alphabetical order, we use \Rfunction{sort()} as shown, or \Rfunction{order()} (see page \pageref{box:vec:sort}).

<<faq-data-frames-01>>=
sorted_cols_iris <- iris[ , sort(colnames(iris), decreasing = TRUE)]
head(sorted_cols_iris, 5)
@

Similarly, we can use values in a column as argument to \Rfunction{order()} to obtain the \code{numeric} indices to sort rows.

<<faq-data-frames-02>>=
sorted_rows_iris <- iris[order(iris$Petal.Length), ]
head(sorted_rows_iris, 5)
@

\end{faqbox}

\begin{advplayground}\index{data frames!ordering rows}
Create a new data frame containing three numeric columns with three different haphazard sequences of values and a factor with two levels. Call these columns \code{A}, \code{B}, \code{C} and \code{F}. 1) Sort the rows of the data frame so that the values in \code{A} are in decreasing order. 2) Sort the rows of the data frame according to increasing values of the sum of \code{A} and \code{B} without adding a new column to the data frame or storing the vector of sums in a variable. In other words, do the sorting based on sums calculated on-the-fly. 1) Sort the rows by level of factor \code{F}, and 2) by level of factor \code{F} and by values in \code{B} within each factor level. Hint: revisit the exercise on page \pageref{calc:ADVPG:order:sort} were the use of \Rfunction{order()} on factors is described.
\end{advplayground}

\subsection{Re-encoding or adding variables}

It is common that some variables need to be added to an existing data frame based on existing variables, either as a computed value or based on mapping, for example, treatments to sample codes already in a data frame. In the second case, named\index{named vectors!mapping with} vectors can be used to replace values in a variable or to add a variable to a data frame.

Mapping is possible because the length of the value returned by the extraction operator \Roperator{[ ]} is given by the length of the indexing vector (see section \ref{sec:calc:indexing} on page \pageref{sec:calc:indexing}). Although we show toy-like examples, this approach is most useful with data frames containing many rows.

If the existing variable is a character vector or factor, we need to create a named vector with the new values as data and the existing values as names.

<<data-frames-9>>=
df12 <-
  data.frame(genotype = rep(c("WT", "mutant1", "mutant2"), 2),
             value = c(1.5, 3.2, 4.5, 8.2, 7.4, 6.2))
mutant <- c(WT = FALSE, mutant1 = TRUE, mutant2 = TRUE)
df12$mutant <- mutant[df12$genotype]
df12
@

If the existing variable is an \code{integer} vector, we can use a vector without names, being careful that the positions in the \emph{mapping} vector match the values of the existing variable

<<data-frames-10>>=
df13 <- data.frame(individual = rep(1:3, 2),
                   value = c(1.5, 3.2, 4.5, 8.2, 7.4, 6.2))
genotype <- c("WT", "mutant1", "mutant2")
df13$genotype <- genotype[df13$individual]
df13
@

\begin{advplayground}
Add a variable named \code{genotype} to the data frame below so that for individual \code{4} its value is \code{"WT"}, for individual \code{1} its value is \code{"mutant1"}, and for individual \code{2} its value is \code{"mutant2"}.

<<data-frames-11, eval=eval_playground>>=
DF1 <- data.frame(individual = rep(c(2, 4, 1), 2),
                  value = c(1.5, 3.2, 4.5, 8.2, 7.4, 6.2))
@
\end{advplayground}

\subsection{Operating within data frames}\label{sec:calc:df:with}

In the case of computing new values from existing variables, named vectors are of limited use. Instead, variables in a data frame can be added or modified with \Rlang functions \Rscoping{transform()}, \Rscoping{with()} and \Rscoping{within()}. These functions can be thought as convenience functions as the same computations can be done using the extraction operators to access individual variables, in the lhs, rhs, or both lhs and rhs (see section \ref{sec:calc:indexing} on page \pageref{sec:calc:indexing}).

In the case of \Rscoping{with()}, only one, possibly compound code statement is affected and this statement is passed as an argument. As before, we need to fully specify the left-hand side of the assignment. The value returned is the one returned by the statement passed as an argument, in the case of compound statements, the value returned by the last contained simple code statement to be executed. Consequently, if the intent is to modify the container, assignment to an individual member variable (column in this case) is required.

In this example, column \code{A} of \code{df14} takes precedence, and the returned value is the expected one.

<<data-frames-EB-12>>=
df14 <- data.frame(A = 1:10, B = 3)
df14$C <- with(df14, (A + B) / A) # add column
head(df14, 3)
@

In the case of \Rscoping{within()}, assignments in the argument to its second parameter affect the object returned, which is a copy of the container (in this case, a whole data frame), which still needs to be saved through assignment. Here the intention is to modify it, so we assign it back to the same name, but it could have been assigned to a different name so as not to overwrite the original data frame.

<<data-frames-EB-13>>=
df14$C <- NULL
df15 <- within(df14,  C <- (A + B) / A) # midified copy
head(df15, 3)
@

In the example above, using \code{within()} instead of \Rscoping{with()} makes little difference to the amount of typing or clarity of the code, but with multiple member variables being operated upon, as shown below, using \Rscoping{within()} results in more concise and easier to understand code.

<<data-frames-EB-14>>=
df16 <- within(df14,
               {C <- (A + B) / A
                D <- A * B
                E <- A / B + 1}
               )
head(df16, 3)
@

\begin{explainbox}
Repeatedly pre-pending the name of a \emph{container}, such as a list or data frame, to the name of each member variable being accessed can make \Rlang code verbose and difficult to understand. Functions \Rscoping{attach()} and its matching \Rscoping{detach()} allow us to change where \Rlang first looks for the names of objects mentioned in a code statement. When using a long name for a data frame, entering a simple calculation can easily result in a difficult-to-read statement. Here even with a very short name for the data frame, the verbosity compared to the last chunk above is clear.

<<data-frames-EB-10>>=
df14$C <- (df14$A + df14$B) / df14$A
df14$D <- df14$A * df14$B
df14$D <- df14$A / df14$B + 1
head(df14, 3)
@

Using\index{data frames!attaching}\label{par:calc:attach} \Rscoping{attach()} we can alter where \Rlang looks up names and consequently simplify the statement. With \Rscoping{detach()} we can restore the original state. It is important to remember that here we can only simplify the right-hand side of the assignment, while the ``destination'' of the result of the computation still needs to be fully specified on the left-hand side of the assignment operator. We include below only one statement between \Rscoping{attach()} and \Rscoping{detach()} but multiple statements are allowed. Furthermore, if variables with the same name as the columns exist in the search path, these will take precedence, something that can result in bugs or crashes, or as seen below, a message warns that variable \code{A} from the global environment will be used instead of column \code{A} of the attached \code{df17}. The returned value is, of course, not the desired one.

<<data-frames-EB-11a>>=
df17 <- data.frame(A = 1:10, B = 3)
A
attach(df17)
A
detach(df17)
A
@

<<data-frames-EB-11>>=
attach(df17)
df17$C <- (A + B) / A
detach(df17)
head(df17, 2)
@

Use of \Rscoping{attach()} and \Rscoping{detach()}, which work as a pair of ON and OFF switches, can result in an undesired after-effect on name lookup if the script terminates after \Rscoping{attach()} is executed but before \Rscoping{detach()} is called, as the attached object is not detached. In contrast, \Rscoping{with()} and \Rscoping{within()}, being self-contained, guarantee that cleanup takes place. Consequently, the usual recommendation is to give preference to the use of \Rscoping{with()} and \Rscoping{within()} over \Rscoping{attach()} and \Rscoping{detach()}.
\end{explainbox}

<<include=FALSE>>=
rm(list = setdiff(ls(pattern="*"), to.keep))
@

\section{Reshaping and Editing Data Frames}\label{sec:calc:reshape}
\index{data frames!long vs.\ wide shape}

As mentioned above, in most cases, in \Rlang data rows represent measurement events or observations possibly on multiple response variables and factors describing groupings, i.e., a ``long'' shape. However, when measurements are repeated in time, columns rather frequently represent observations of the same response variable at different times, i.e., a ``wide'' shape. Other cases exist where reshaping is needed. Function \Rfunction{reshape()} can convert wide data frames into long data frames and vice versa. See section \ref{sec:data:reshape} on page \pageref{sec:data:reshape} on package \pkgnameNI{tidyr} for an alternative approach to reshaping data with a friendlier user interface.

We start by creating a data frame of hypothetical data measured on two occasions. With these data, for example, if we wish to compute the growth of each subject by computing the difference in \code{weight} and \code{height} between the two time points, one approach is to reshape the data frame into a wider shape and subsequently subtract the columns.

<<data-frames-reshape-01>>=
# artifical data
df1 <- data.frame(id = rep(1:4, rep(2,4)),
                  Time = factor(rep(c("Before","After"), 4)),
                  Weight = rnorm(n = 4, mean = c(20.1, 30.8)),
                  Height = rnorm(n = 4, mean = c(9.5, 14.2)))
df1
# make it wider
df2 <- reshape(df1, timevar = "Time", idvar = "id", direction = "wide")
df2
# possible further calculation
within(df2,
       {
        Height.growth <- Height.After - Height.Before
        Weight.growth <- Weight.After - Weight.Before
       })
@

Alternatively, we may want to convert \code{df1} into a longer shape, with a single column with measurements, and a new column indicating whether the measured variable was \code{height} or \code{weight}. For this operation to succeed, we need to add a column with a unique value for each row in \code{df1}, and one easy way is to copy row names into a column. The names of the parameters of function \Rfunction{reshape()} are meaningful only when dealing with time series. Thus, reading the code below becomes rather difficult. It is also to be noted that the user is responsible of passing the values to \code{times} in the correct order.

<<data-frames-reshape-02>>=
df1$ID <- rownames(df1) # unique ID for each row
# make it longer
reshape(df1,
        idvar = "ID",
        timevar = "Quantity",
        times = c("Weight", "Height"),
        v.names = "Value",
        direction = "long",
        varying = c("Weight", "Height"))
@

To edit a data frame programmatically, one can use the approaches already discussed, using the extraction operators \Roperator{[ ]} or \Roperator{[[ ]]} on the \emph{lhs} of \Roperator{<-} to replace member elements. This in combination with functions like \Rfunction{gsub()} makes it possible to ``edit'' the contents of data frames.

Methods \Rfunction{View()}, \Rfunction{edit()} and \Rfunction{fix()} can be used interactively to display and edit \Rlang objects. When using \Rpgrm from within IDEs like \RStudio, calling these functions with a data frame as argument opens in most cases the IDE's own worksheet-like data editors, and for other types of objects a text editor pane. Output is not included for this chunk, as the use of these functions requires user interaction. Please, run these examples in \Rpgrm and in an IDE like \RStudio.

<<exploring-dfs-0a, eval=FALSE>>=
View(cars)
edit(cars)
@

\begin{explainbox}
These functions can be used at the \Rlang console also when \Rpgrm is used on its own, but the editors activated are different ones. In any case, the use of scripts has made the interactive use of \Rpgrm at the console less frequent and the need to edit \Rlang objects previously saved in the user's current workspace nearly disappear. \Rfunction{View()}, \Rfunction{edit()} and \Rfunction{fix()} are unusual in that their definitions are dependent on system variables that at least when using \Rpgrm on its own, can be modified by the user.
\end{explainbox}

\index{data frames|)}

<<echo=FALSE,cache=FALSE>>=
rm(list = setdiff(ls(pattern="*"), to.keep))
@

\section{Attributes of \Rlang Objects}\label{sec:calc:attributes}
\index{attributes|(}

\Rlang objects can have attributes. Attributes are named \emph{slots} normally used to store ancillary data such as object properties functioning as additional fields where to store additional information in any \Rlang object. There are no restrictions on the class of what is assigned to an attribute. They can be used to store metadata accompanying the data stored in an object, which is important for reproducible research and data sharing. They can be set and read by user code and they are also used internally by \Rlang among other things to store the class an object belongs to, column and row names in data frames and matrices and the labels of levels in factors. Although most \Rlang objects have attributes, they are rarely displayed explicitly when an object is printed, while the structure of objects as displayed by function \Rfunction{str()} includes them.

Although we rarely need to set or extract values stored in attributes explicitly, many of the features of \Rlang that we take for granted are implemented using attributes: columns names in data frames are stored in an attribute. Matrices are vectors with additional attributes.

<<attributes-00>>=
df1 <- data.frame(x = 1:6, y = c("a", "b"), z = c(TRUE, FALSE, NA))
df1
attributes(df1)
str(df1)
@

Attribute \code{"comment"} is meant to be set by users to store a character string---e.g., to store metadata as text together with data. As comments are frequently used, \Rlang has functions for accessing and setting comments. \qRfunction{comment()}\qRfunction{comment()<-}

<<attributes-01>>=
comment(df1)
comment(df1) <- "this is stored as a comment"
comment(df1)
@

Functions like \Rfunction{names()}, \Rfunction{dim()} or \Rfunction{levels()} return values retrieved from attributes stored in \Rlang objects, whereas \Rfunction{names()<-}, \Rfunction{dim()<-} or \Rfunction{levels()<-} set (or unset with \code{NULL}) the value of the respective attributes. Dedicated query and set functions do not exist for all attributes. Functions \Rfunction{attr()}, \Rfunction{attr()<-} and \Rfunction{attributes()} can be used with any attribute. With \Rfunction{attr()} we query, and with  \Rfunction{attr()<-} we set individual attributes by name. With \Rfunction{attributes()} we retrieve all attributes of an object as a named \code{list}. In addition, method \Rfunction{str()} displays the structure of an \Rlang object with all its components, including their attributes.

Continuing with the previous example, we can retrieve and set the value stored in the \code{"comment"}  attribute using these functions. In the second statement, we delete the value stored in the attribute by assigning \code{NULL} to it.

<<attributes-01a>>=
attr(df1, "comment")
attr(df1, "comment") <- NULL
attr(df1, "comment")
comment(df1) # same as previous line
@

The \code{"names"} attribute of \code{df1} was set by the \code{data.frame()} constructor when it was created above. In the next example, in the first statement we retrieve the names and implicitly print them. In the second statement, read from right to left, we retrieve the names, convert them to upper case, and save them back to the same attribute. \qRfunction{colnames()}\qRfunction{colnames()<-}

<<attributes-02>>=
names(df1)
colnames(df1) # same as names()
colnames(df1) <- toupper(colnames(df1))
colnames(df1)
attr(df1, "names") # same as previous line
@

\begin{advplayground}
  In general, \Rlang objects do not have by default names assigned to members. As seen on page \pageref{par:calc:vector:map}, we can give names to vector members during construction with a call to \Rfunction{c()} or we can assign names (set attribute \code{names}) with function \Rfunction{names()<-} to existing vectors. Lists behave almost the same as vectors, although members of nested objects can also be named. Data frames have attributes \code{names} and \code{row.names}, that can be accessed with functions \Rfunction{names()} or \Rfunction{colnames()}, and function \Rfunction{rownames()}, respectively. The attributes can be set with functions \Rfunction{names()<-} or \Rfunction{colnames()<-}, and \Rfunction{rownames()<-}. The \Rfunction{data.frame()} constructor sets (column) names and row names by default. The \Rfunction{matrix()} constructor by default does not set \code{dimnames} or \code{names} attributes. When names are assigned to a \code{matrix} with \Rfunction{names()<-}, the matrix behaves like a vector, and the names are assigned to individual members. Functions \Rfunction{dimnames()<-}, \Rfunction{colnames()<-}, and \Rfunction{rownames()<-} are used to assign names to columns and rows. The matching functions \Rfunction{dimnames()}, \Rfunction{colnames()} and \Rfunction{rownames()} are used to access these values.

  When no names have been set, \Rfunction{names()}, \Rfunction{colnames()}, \Rfunction{rownames()}, and \Rfunction{dimnames()} return \code{NULL}. In contrast, \Rfunction{labels()}, intended to be used for printing, returns made-up names based on positions.

  Run the examples below and write similar examples for a \code{list} and a \code{data.frame}. For \code{matrix}, write an additional statement that uses \Rfunction{dimnames()<-} to set row and column names simultaneously.

<<attributes-names-ebx-01, eval=eval_playground>>=
VCT1 <- 5:10
names(VCT1)
labels(VCT1)
names(VCT1) <- letters[5:10]
names(VCT1)
labels(VCT1)
@

<<attibutes-names-ebx-02, eval=eval_playground>>=
MAT1 <- matrix(1:10, ncol = 2)
dimnames(MAT1)
labels(MAT1)
colnames(MAT1) <- c("a", "b")
colnames(MAT1)
dimnames(MAT1)
labels(MAT1)
@
\end{advplayground}

We can add a new attribute, under our own control, as long as its name does not clash with those of existing attributes.

<<attributes-02a>>=
attr(df1, "my.attribute") <- "this is stored in my attribute"
attributes(df1)
@

\begin{explainbox}
The attributes used internally by \Rlang can be directly modified by user code. In most cases, this is unnecessary as \Rlang provides pairs of functions to query and set the relevant attributes. This is true for the attributes \code{dim}, \code{names} and \code{levels}. In the example below, we read the attributes from a matrix.

<<attibutes-ebx-01a>>=
mat1 <- matrix(1:10, ncol = 2)
attributes(mat1)
dim(mat1)
dimnames(mat1)
@

<<attibutes-ebx-01aa>>=
labels(mat1)
mat1
@

<<attibutes-ebx-01b>>=
attr(mat1, "dim")
attr(mat1, "dim") <- c(2, 5)
mat1
@

<<attibutes-ebx-01c>>=
attr(mat1, "dim") <- NULL
is.vector(mat1 )
mat1
@

In this case we can also use \Rfunction{dim()}.

<<attibutes-ebx-01d>>=
dim(mat1) <- NULL
is.vector(mat1 )
@

\end{explainbox}

\begin{warningbox}
There is no restriction to the creation, setting, resetting, and reading of attributes, but not all functions and operators that can be used to modify objects will preserve non-standard attributes. This can be a problem when using some \Rlang packages, such as the \pkgname{tidyverse}. So, using private attributes is a double-edged sword that usually is worthwhile considering only when designing a new class together with the corresponding methods for it. The values returned by model fitting functions like \Rfunction{lm()} are good examples of the extensive use of class-specific attributes (see section \ref{sec:stat:LM} on page \pageref{sec:stat:LM}).
\end{warningbox}

<<include=FALSE>>=
rm(list = setdiff(ls(pattern="*"), to.keep))
@

\index{attributes|)}

\section{Saving and Loading Data}

\subsection{Data sets in \Rlang and packages}
\index{data!loading data sets|(}\index{data!saving data sets|(}
To be able to present more meaningful examples, we need some real data. Here we use \code{cars}, one of the many data sets included in base \Rpgrm. Function \Rfunction{data()} is used to load data objects that are included in \Rlang or contained in packages (whether a call to \Rfunction{data()} is needed or not depends on how the package where the data objects are defined was configured). It is also possible to import data saved in files with \textit{foreign} formats, defined by other software or commonly used for data exchange. Package \pkgname{foreign}, included in the \Rlang distribution, as well as contributed packages make available functions capable of reading and decoding various foreign formats. How to read or import `foreign' data is discussed in the \Rlang documentation, in the manual \emph{R Data Import/Export}, and in this book, in chapter \ref{chap:R:data:io} on page \pageref{chap:R:data:io}. It is also good to keep in mind that in \Rlang, URLs (Uniform Resource Locators) are accepted as arguments to the \code{file} or \code{path} parameter of many functions (see section \ref{sec:files:remote} on page \pageref{sec:files:remote}).

In the next example, we load data available in \Rlang package \pkgname{datasets} as \Rlang objects by calling function \Rfunction{data()}. The loaded \Rlang object \code{cars} is a data frame. (Package \pkgname{datasets} is part of the \Rpgrm distribution and is always available).

<<data-1>>=
data(cars)
@

%Once we have a data set available, the first step is usually to explore it, and we will do this with \code{cars} in section \ref{sec:calc:looking:at:data} on page \pageref{sec:calc:looking:at:data}.
%\index{data!loading data sets|)}

\subsection{.rda files}\label{sec:data:rda}

By\index{file formats!RDA ``R data, multiple objects''} default, at the end of a session, the current workspace containing the results of one's work is saved into a file called \code{.RData}. In addition to saving the whole workspace, it is possible to save one or more \Rlang objects present in the workspace to disk using the same file format (with file name tag \code{.rda} or \code{.Rda}). One or more objects, belonging to any mode or class can be saved into a single file using function \Rfunction{save()}. Reading the file restores all the saved objects into the current workspace with their original names. These files are portable across most \Rlang versions---i.e., old formats can be read and written by newer versions of \Rpgrm, although the newer, default format may be not readable with earlier \Rpgrm versions. Whether compression is used, and whether the ``binary'' data are encoded into ASCII characters, allowing maximum portability at the expense of increased size can be controlled by passing suitable arguments to \Rfunction{save()}.

We create a data frame object and then save it to a file. The file name used can be any valid one in the operating system, however to ensure compatibility with multiple operating systems, it is good to use only ASCII characters. Although not enforced, using the name tag \code{.rda} or \code{.Rda} is recommended.

<<rda-01>>=
df1 <- data.frame(x = 1:5, y = 5:1)
df1
save(df1, file = "df1.rda")
@

We delete the data frame object and confirm that it is no longer present in the workspace (see page \pageref{par:calc:remove} for details about \Rfunction{remove()} and \Rfunction{objects()}).

<<rda-02>>=
remove(df1)
objects(pattern = "df1")
@

We read the file we earlier saved to restore the object.\qRfunction{load()}

<<rda-03>>=
load(file = "df1.rda")
objects(pattern = "df1")
df1
@

The default format used is binary and compressed, which results in smaller files.

\begin{playground}
In the example above, only one object was saved, but one can simply give the bare names of additional objects as arguments separated by commas ahead of \code{file}. Just try saving more than one data frame to the same file. Then the data frames plus a few vectors. After creating each file, clear the workspace and then restore from the file the objects you saved.
\end{playground}

Sometimes it is easier to supply the names of the objects to be saved as a vector of \code{character} strings passed as an argument to parameter \code{list} (in spite of the name the argument passed must be a \code{vector}, not a \code{list}). One use case is saving a group of objects based on their names. In this case, one can use \Rfunction{objects()} (also available as \Rfunction{ls()}) to obtain a vector of \code{character} strings with the names of objects matching a simple \code{pattern} or a complex \emph{regular expression} (see section \ref{sec:calc:regex} on page \pageref{sec:calc:regex}). The example below uses this approach in two steps, first saving in variable \code{dfs} a \code{character} \code{vector} with the names of the objects matching a pattern, and then using this saved vector as an argument to parameter \code{list} in the call to \Rfunction{save()}.

<<rda-04>>=
dfs <- objects(pattern = "*.df")
save(list = dfs, file = "my-dfs.rda")
@

The two statements above can be combined into a single statement by nesting the function calls.

<<rda-05>>=
save(list = objects(pattern = "*.df"), file = "my-dfs.rda")
@

\begin{playground}
Practice using different patterns with \Rfunction{objects()}. You do not need to save the objects to a file. Just have a look at the list of object names returned.
\end{playground}

As a coda, I show how to clean up by deleting the two files we created. Function \Rfunction{file.remove()} can be used to delete files stored in the operating system file system, usually on a hard disk drive or a solid state drive, as long as the user has enough rights. No confirmation is requested, so care not to delete valuable files is required. Function \Rfunction{unlink()}, is not an exact equivalent, as it can also delete folders and supports recursion through nested folders. The name \emph{unlink} is borrowed from that of the equivalent function in \osnameNI{Unix} and \osnameNI{Linux}.

<<rda-06>>=
file.remove(c("my-dfs.rda", "df1.rda"))
@

\subsection{.rds files}\label{sec:data:rds}

The\index{file formats!RDS ``R data, single object''} RDS format can be used to save individual objects instead of multiple objects (usually using file name tag \code{.rds}). They are read and saved with functions \Rfunction{readRDS()} and \Rfunction{saveRDS()}, respectively. The value returned by a call to  \Rfunction{readRDS()} is the object read from the file on disk. When RDS files are read, different from when RDA files are loaded, assigning the object read to a name is frequently the first step. This name can be any valid \Rlang name. Of course, it is also possible to use the object returned by \Rfunction{readRDS()} as an argument to a function by nesting the function calls.

<<rds-1>>=
saveRDS(df1, "df1.rds")
@

If we read the file at the \Rpgrm console, by default the read \Rlang object will be printed at the console.

<<rds-1a>>=
readRDS("df1.rds")
@

If we assign the read object to a different name, it is possible to check if the object read is identical to the one saved.

<<rds-2>>=
df2 <- readRDS("df1.rds")
identical(df1, df2)
@

As above, we clean up by deleting the file.

<<rds-03>>=
file.remove("df1.rds")
@

\subsection{\code{dput()}}

In\index{file formats!R data ``deparsed object''} general, the use of \code{.rda} and {.rds} files is preferred. Function \Rfunction{dput()} is sometimes used to share data as part of a code chunk at StackOverflow, mostly as a convenient way of converting a data frame or list into plain text that can be pasted into the code chunk listing to reconstruct the object. If no argument is passed to parameter \code{file}, the result of deparsing an object is printed at the \Rlang console.

<<dput-01>>=
dput(df1)
@

There exists a companion function \Rfunction{dget()} to recreate the object.
\index{data!saving data sets|)}\index{data!loading data sets|)}

\begin{warningbox}
  Output to, and input from, text-based file formats as well as to and from various binary formats \emph{foreign} to \Rlang is described in chapter \ref{chap:R:data:io} on page \pageref{chap:R:data:io}.
\end{warningbox}

<<include=FALSE>>=
rm(list = setdiff(ls(pattern="*"), to.keep))
@

\section{Plotting}
\index{plots!base R graphics}
In most cases, the most effective way of obtaining an overview of a data set is by plotting it using multiple approaches. The base-\Rlang generic method \Rfunction{plot()} can be used to plot different data. It is a generic method that has specialisations suitable for different kinds of objects (see section \ref{sec:script:objects:classes:methods} on page \pageref{sec:script:objects:classes:methods} for a brief introduction to objects, classes and methods). In this section, I very briefly demonstrate the use of the most common base-\Rlang graphics functions. They are well described in the book \citebooktitle{Murrell2019} \autocite{Murrell2019}. I describe in detail the use of the \emph{layered grammar of graphics} and plotting with package \ggplot in chapter \ref{chap:R:plotting} on page \pageref{chap:R:plotting}.

\subsection{Plotting data}
It is possible to pass two vectors (here columns from a data frame) directly as arguments to the \code{x} and \code{y} parameters of function \Rfunction{plot()}. (The plot is shown farther down, as the three approaches create identical plots.)

<<plot-0, include=FALSE, cache=FALSE>>=
opts_chunk$set(opts_fig_narrow_square)
@

<<plot-1a, eval=FALSE>>=
plot(x = cars$speed, y = cars$dist)
@

It is also possible to use \Rfunction{with()} or \Rfunction{attach()} as described in section \ref{sec:calc:df:with} on page \pageref{sec:calc:df:with}.

<<plot-1b, eval=FALSE>>=
with(cars, plot(x = speed, y = dist))
@

However, it is better to use a \emph{formula} to specify the variables to be plotted on the $x$ and $y$ axes, passing as an argument to parameter \code{data} a data frame containing these variables as columns. The formula \code{dist \textasciitilde\ speed}, is read as \code{dist} explained by \code{speed}---i.e., \code{dist} is mapped to the $y$-axis as the dependent variable and \code{speed} to the $x$-axis as the independent variable. The names used in the formula, are those of columns in the \code{data.frame}. As described in section \ref{sec:stat:mf} on page \pageref{sec:stat:mf}, the same syntax is used to describe models to be fitted to observations.

<<plot-1c>>=
plot(dist ~ speed, data = cars)
@

Within \Rlang there exist different specialisations, or ``flavours'', of method \Rfunction{plot()} that become active depending on the class of the variables passed as arguments: passing two numerical variables results in a scatter plot as seen above. In contrast, passing one factor and one numeric variable to \code{plot()} results in a box-and-whiskers plot being produced. Use \code{help("chickwts")} to learn more about this data set, also included in \Rpgrm .

<<plot-3>>=
plot(weight ~ feed, data = chickwts)
@

\subsection{Graphical output}
Graphical\index{file formats!PDF}\index{file formats!PNG} output, such as produced by \Rfunction{plot()}, is rendered by \emph{graphical output devices}.
When \Rlang is used interactively, a software device is opened automatically to output the graphical output to a physical device, usually the computer screen. The name of the \Rlang software device used may depend on the operating system (e.g., \osname{MS-Windows} or \osname{Linux}), or on the IDE (e.g., \RStudio).

In \Rlang, software graphical devices not necessarily generate output on a physical device like a printer, as several of these devices translate the plotting commands into a file format and save it to disk. Graphical devices in \Rlang differ in the kind of output they produce: raster or bitmap files (e.g., TIFF, PNG, and JPEG formats), vector graphics files (e.g., SVG, EPS, and PDF), or output to a physical device like the screen of a computer. Additional devices are available through contributed \Rlang packages.

\RStudio makes it possible to export plots into graphic files through a menu-based interface in the \emph{Plots} viewer tab. This interface uses some of the some graphic devices that are available at the console and through scripts. For reproducibility, it is preferable to include the \Rlang commands used to export plots in the scripts used for data analysis.

Devices follow the paradigm of ON and OFF switches, opening and closing a destination for \code{print()}, \code{plot()} and related functions. Some devices producing a file as output, save their output one plot at a time to single-page graphic files, while others write the file only when the device is closed, possibly as a multi-page file.

When opening a device the user supplies additional information. For the PDF and SVG devices that produce output in a vector-graphics format, width and height of the output are specified in \emph{inches}. A default file name is used unless we pass a \code{character} string as an argument to parameter \code{file}.

<<gr-devices-01, message=FALSE>>=
pdf(file = "output/my-file.pdf", width = 6, height = 5, onefile = TRUE)
plot(dist ~ speed, data = cars)
plot(weight ~ feed, data = chickwts)
dev.off()
@

Raster devices return bitmaps and \code{width} and \code{height} are specified in most cases in \emph{pixels}.

<<gr-devices-02, message=FALSE>>=
png(file = "output/my-file.png", width = 600, height = 500)
plot(weight ~ feed, data = chickwts)
dev.off()
@

The approach of direct output to a software device is used in base \Rlang by \Rfunction{plot()} and its companions \Rfunction{text()}, \Rfunction{lines()}, and \Rfunction{points()}. \Rfunction{plot()} outputs a graph, and the other three functions can add elements to it. The addition of plot components, as shown below, is done directly to the output device, i.e., when output is to the computer screen the partial plot is visible at each step.

<<gr-devices-03, message=FALSE>>=
png(file = "output/my-file.png", width = 600, height = 500)
plot(dist ~ speed, data = cars)
text(x = 10, y = 110, labels = "some texts to be added")
dev.off()
@%
\pagebreak

This is not the only approach available in \Rpgrm for building complex plots. As we will see in chapter \ref{chap:R:plotting} on page \pageref{chap:R:plotting}, an alternative approach is to build a \emph{plot object} as a list of member components, that can be saved as any other \Rlang object. This object functions as a ``recipe'' that is later rendered as a whole on a graphical device by calling \code{print()} to display it.

\index{data!exploration at the R console|)}
\index{data sets!their storage|)}

\section{Further Reading}
For\index{further reading!using the R language} further reading on the aspects of \Rlang discussed in the current chapter, I suggest  the book \citetitle{Matloff2011} \autocite{Matloff2011}, with emphasis on the \Rlang language and programming. The new, open-source, book \citetitle{Gagolewski2023} \autocite{Gagolewski2023} provides a free alternative. This book also covers base \Rlang plotting giving more advanced examples than \textit{Learn R: As a Language}. An in-depth description of plotting and graphic devices in \Rlang is available in the book \citetitle{Murrell2019} \autocite{Murrell2019}.

<<container-chapter-cleanup, include=FALSE>>=
rm(list = setdiff(ls(pattern="*"), to.keep))
@

<<eval=eval_diag, include=eval_diag, echo=eval_diag, cache=FALSE>>=
knitter_diag()
R_diag()
other_diag()
@