R.functions.Rnw

% !Rnw root = appendix.main.Rnw
<<echo=FALSE, include=FALSE>>=
opts_chunk$set(opts_fig_wide)
opts_knit$set(concordance=TRUE)
opts_knit$set(unnamed.chunk.label = 'functions-chunk')
@

\chapter{Base \Rlang: Adding New ``Words''}\label{chap:R:functions}

\begin{VF}
Computer Science is a science of abstraction---creating the right model for a problem and devising the appropriate mechanizable techniques to solve it.

\VA{Alfred V. Aho and Jeffrey D. Ullman}{\emph{Foundations of Computer Science}, 1992}\nocite{Aho1992}
\end{VF}

%\dictum[Alfred V. Aho, Jeffrey D. Ullman, \emph{Foundations of Computer Science}, Computer Science Press, 1992]{Computer Science is a science of abstraction---creating the right model for a problem and devising the appropriate mechanizable techniques to solve it.}\vskip2ex

\section{Aims of This Chapter}

In earlier chapters we have only used base \Rlang features. In this chapter you will learn how to expand the range of features available. I start by discussing how to define and use new functions, operators, and classes. What are their semantics and how they contribute to conciseness and reliability of computer scripts and programs. Later I focus on using existing packages to share extensions to \Rlang and touch briefly on how they work. I do not consider the important, but more advanced question of packaging functions and classes into new \Rlang packages. Instead I discuss how packages are installed and used.

\section{Defining Functions and Operators}\label{sec:script:functions}
\index{functions!defining new}\index{operators!defining new}

\emph{Abstraction} can be defined as separating the fundamental properties from the accidental ones. Say obtaining the mean from a given vector of numbers is an actual operation. There can be many such operations on different numeric vectors, each one a specific case. When we describe an algorithm for computing the mean from any numeric vector, we formulate an abstraction of \emph{mean}. In the same way, each time we separate operations from specific data we create a new abstraction. In this sense, functions are abstractions of operations or actions; they are like ``verbs'' describing actions separately from actors.

The main role of functions is that of providing an abstraction allowing us to avoid repeating blocks of code (groups of statements) applying the same operations on different data. The reasons to avoid repetition of similar blocks of code statements are that 1) if the algorithm or implementation needs to be revised---e.g., to fix a bug or error---it is best to make edits in a single place; 2) sooner or later pieces of repeated code can become different leading to inconsistencies and hard-to-track bugs; 3) abstraction and division of a problem into smaller chunks, greatly helps with keeping the code understandable to humans; 4) textual repetition makes the script file longer, and this makes debugging, commenting, etc., more tedious, and error prone; 5) with well-defined input and output, functions facilitate testing.

How does one, in practice, avoid repeating bits of code? One writes a function containing the statements that would need to be repeated, and later one \emph{calls} (``uses'') the function in their place. We have been calling \Rlang functions or operators in almost every example in this book; what we will next tackle is how to define new functions of our own.

The diagram in section \ref{sec:script:compound:statement} on page \pageref{sec:script:compound:statement} describes a compound statement. A function is a code statement, simple or compound, that is partly isolated from the enclosing environment. The \emph{function} abstraction relies on formal parameters working as placeholders for arguments within the function body. When the function is called (or ``used'') values are passed as arguments to the parameters, and used when executing the code within the function.

New functions and operators are defined using function \Rfunction{function()}, and saved like any other object in \Rlang by assignment to a variable name. In the example below, \code{x} and \code{y} are both formal parameters, or names used within the function for objects that will be supplied as \emph{arguments} when the function is called.

Function \code{fun1()} has two formal parameters, \code{x} and \code{y}.

<<fun-00a>>=
fun1 <- function(x, y){x * y}
@

When we call \code{fun1()} with \code{4} and \code{3} as arguments, the computation that takes place is \code{4 * 3} and the value returned is \code{12}. In this example, the returned value or result is printed, but it could have been assigned to a variable or used in further computations within the calling statement.

<<fun-00b>>=
fun1(x = 4, y = 3)
@

\begin{playground}
What is the computation that takes places in these function calls?

<<fun-00c, eval=eval_playground>>=
fun1(x = 10, y = 50)
fun1(x = 10, y = 50) * 3
@
\end{playground}

\begin{warningbox}
Even though the statements within the function body do have access to the environment in which the function is called, it is safest to pass all input through the function parameters, and return all values to the caller. This ensures that the users of the function can treat it as a black box with no side effects.
\end{warningbox}

\begin{figure}
  \centering
\begin{small}
\begin{tikzpicture}[node distance=1.7cm]
\node (call) [startstop] {\textsl{arguments $\to$ \textcolor{blue}{parameters}}};
\node (enc) [enclosure, color = blue, fill = blue!5, below of=call, yshift=-0.85cm] {\ };
\node (stat1) [process, color = blue, fill = blue!15, below of=call] {\code{<statement A>}};
\node (stat2) [process, color = blue, fill = blue!15, below of=stat1] {\code{<statement B>}};
\node (return) [startstop, below of=stat2] {\textsl{\textcolor{blue}{returned  value} $\to$ caller}};
\draw [arrow, color = blue] (call) -- (stat1);
\draw [arrow, color = blue] (stat1) -- (stat2);
\draw [arrow, color = blue] (stat2) -- (return);
\end{tikzpicture}
\end{small}
  \caption[Diagram of function with no side effects]{Diagram of function with no side effects, seen as a compound code statement receiving its input as arguments passed to its formal parameters and returning an object or value to the statement from where it was called or run. The body of the function is represented by the filled box.}\label{fig:function:diagram}
\end{figure}

In \Rlang, statements within the function usually do not affect directly any variable defined outside the function, the result from the computation is returned as a value. The diagram in Figure \ref{fig:function:diagram} describes a function that has no \emph{side effects}, as it does not affect its environment, it only returns a value to the caller. A value on which the caller has full control. The statement that calls the function ``decides'' what to do with the value received from the function.

\begin{figure}
  \centering
\begin{small}
\begin{tikzpicture}[node distance=1.7cm]
\node (call) [startstop] {\textsl{arguments $\to$ \textcolor{blue}{parameters}}};
\node (enc) [enclosure, color = blue, fill = blue!5, below of=call, yshift=-0.85cm] {\ };
\node (stat1) [process, color = blue, fill = blue!15, below of=call] {\code{<statement A>}};
\node (sideeff) [process, color = black, fill = yellow!20, right of=stat2, xshift=3cm] {\textsl{\textcolor{blue}{side effect}}};
\node (stat2) [process, color = blue, fill = blue!15, below of=stat1] {\code{<statement B>}};
\node (return) [startstop, below of=stat2] {\textsl{\textcolor{blue}{returned  value} $\to$ caller}};
\draw [arrow, color = blue] (call) -- (stat1);
\draw [arrow, color = blue] (stat1) -- (stat2);
\draw [arrow, color = blue] (stat2) -- (sideeff);
\draw [arrow, color = blue] (stat2) -- (return);
\end{tikzpicture}
\end{small}
  \caption[Diagram of function with side effects]{Diagram of function as a compound code statement receiving its input as arguments passed to its formal parameters and returning an object or value to the statement from where it was called or run. The body of the function is represented by the box filled in blue, while the side effect of the code in the function directly outside is represented by the box filled in yellow.}\label{fig:function:side:effect:diagram}
\end{figure}

When a function has a side effect, the caller is no longer in full control (Figure \ref{fig:function:side:effect:diagram}). Side effects can be actions that do not alter any object in the calling code, like when a call to \Rfunction{print()} displays text or numbers. Side effects can also be an assignment that modifies an object in the caller's environment, such as assigning a new value to a variable in the caller's environment, i.e., ``outside the function''.

A function can return only one object, so when multiple results are produced they need to be collected into a single object. In many cases, lists are used to collect all the values to be returned into one \Rlang object. For example, model fit functions like \code{lm()}, discussed in section \ref{sec:stat:LM} on page \pageref{sec:stat:LM}, return lists with multiple heterogeneous members, plus ancillary information stored in several attributes. In the case of \Rfunction{lm()} the returned object's class is \Rclass{lm}, and its mode is \Rclass{list}.

\begin{playground}
When function \Rcontrol{return()} is called within a function, the flow of execution within the function stops and the argument passed to \Rcontrol{return()} is the value returned by the function call. In contrast, if function \Rcontrol{return()} is not explicitly called, the value returned by the function call is that returned by the last statement \emph{executed} within the body of the function. Run these examples, and your own variations.

\label{chunck:print:funs}
<<fun-02, eval=eval_playground>>=
FN1 <- function(x) print("prn")
FN1("arg")
FN2 <- function(x){print("prn")
                   return(x)}
FN2("arg")
FN3 <- function(x){return(x)
                   print("prn")}
FN3("arg")
FN4 <- function(x){return()
                   print("prn")}
FN4("arg")
FN5 <- function(x){return(print(x))
                   print("prn")}
FN5("arg")
@
\end{playground}

In base \Rlang, arguments\index{functions!arguments} to functions are passed by copy. This is something important to remember. If code in a function's body modifies the value of a parameter (the placeholder for an argument), its value outside the function is not affected, e.g., if the argument passed was a variable.

<<fun-01>>=
fn2 <- function(x){x <- 99}
a <- 1
fn2(a)
a
@

\begin{warningbox}
In some other computer languages, arguments can be passed by reference, meaning that assignments to a formal parameter within the body of the function are back-referenced to the argument and modify it. It is possible to imitate such behaviour in \Rlang using some language trickery and consequently, occasionally functions in \Rlang use this approach.
\end{warningbox}

Functions have their own \emph{scope}. Any new variables created by normal assignment within the body of a function are visible only within the body of the function and are destroyed when the function returns from the call. In normal use, functions in \Rlang do not affect their environment through side effects.

\begin{warningbox}
Functions can be called without giving them a name. This is common when the function is simple and called only once. Anonymous functions are frequently used together with \emph{apply functions}, as a definition passed directly as an argument to parameter \code{FUN} (see section \ref{sec:data:apply} on page \pageref{sec:data:apply}).

<<fun-10>>=
(function(x, y){x * y})(x = 4, y = 3)
@

A new terse notation for defining functions was introduced in \Rlang 4.1.0, with \Rfunction{\textbackslash()} as a synonym of \Rfunction{function()}. This is intended to make code concise, and especially useful for anonymous or lambda functions. However, I think this notation should be used sparingly, and possibly only at the \Rlang console. I have not used \Rfunction{\textbackslash()} in code examples in the book, except for the one below.

<<fun-11>>=
(\(x, y){x * y})(x = 4, y = 3)
@

\end{warningbox}

\subsection{Scope of names}
\index{names and scoping}\index{scoping rules}\index{namespaces}
Scoping in \Rlang is implemented using \emph{environments} and \emph{name spaces}. We can think of environments as having a boundary with asymmetric visibility. The code within a function runs in its own environment, in isolation from the calling environment in relation to assignments, but the values stored in objects in the calling environment can be retrieved. This protects from unintentional side effects by making difficult to overwrite object definitions in the calling environment. It is possible to override this protection with operator \Roperator{<<-} or with function \Rfunction{assign()}. When used, assignment as side effects, can make the code much more difficult to read and debug, so its best to avoid them.

\begin{warningbox}
  Parameters and local variables are not read-only, they behave like normal variables within the body of the function. However, assignments made using the operator \Roperator{<-}, only affect a local copy that is destroyed when the function returns.
\end{warningbox}

The visibility of names is determined by the \emph{scoping rules} of a language. The clearest, but not the only situation when scoping rules matter, is when objects with the same name coexist. In such a situation, one will be accessible by its unqualified name and the other hidden but possibly accessible by qualifying the name with the namespace where it is defined.

As the \Rlang language has few reserved words for which no redefinition is allowed, we should take care not to accidentally reuse names that are part of the language. For example, \code{pi} is a constant defined in \Rlang with the value of the mathematical constant $\pi$. If we use the same name for one of our variables, the original definition is hidden and can no longer be normally accessed.

<<scope-01>>=
pi
pi <- "apple pie"
pi
rm(pi)
pi
exists("pi")
@

In the example above, the two variables are not defined in the same scope. In the example below, we assign a new value to a variable we have earlier created within the same scope, and consequently the second assignment overwrites, rather than hides, the existing definition.\qRscoping{exists()}

<<scope-02>>=
my.pie <- "raspberry pie"
my.pie
my.pie <- "apple pie"
my.pie
rm(my.pie)
exists("my.pie")
@

Name spaces play an important role in avoiding name clashes when contributed packages are attached (see section \ref{sec:packages:work} on page \pageref{sec:packages:work}).

\begin{explainbox}
Environments can be explicitly created with function \Rfunction{environment()}. However, \Rfunction{environment()} is rarely used in scripts while it can be useful within packages.
\end{explainbox}

\subsection{Ordinary functions}\label{sec:functions:sem}\label{sec:ordinary:functions}
\index{functions!defining new}

After the toy examples above, we will define a small but useful function: a function for calculating the standard error of the mean from a numeric vector. The standard error is given by $S_{\hat{x}} = \sqrt{S^2 / n}$. We can translate this into the definition of an \Rlang function called \code{SEM}.

<<fun-03>>=
SEM <- function(x){sqrt(var(x) / length(x))}
@

As a test, we call \Rfunction{SEM()} with both \code{a} and \code{a.na} as argument.

<<fun-04>>=
a <- c(1, 2, 3, -5)
a.na <- c(a, NA)
SEM(x = a)
SEM(x = a.na)
@

Our function \code{SEM(a)} never returns a wrong answer because \code{NA} values in its input always result in \code{NA} being returned. The downside is that unlike \Rlang's functions such as \code{var()}, \Rfunction{SEM()} does not support omitting \code{NA} values.

Adding \code{na.rm} as a second parameter and passing the argument it receives to the call to \Rfunction{var()} within the body of \code{SEM()} is not enough. To avoid returning wrong values, \code{NA} values should be also removed before counting the number of observations with \code{length()}. A good alternative is to define the function as follows.

<<fun-05>>=
sem <- function(x, na.rm = FALSE) {
 if (na.rm) {
   x <- na.omit(x)
 }
 sqrt(var(x)/length(x))
}
@

<<fun-06>>=
sem(x = a)
sem(x = a.na)
sem(x = a.na, na.rm = TRUE)
@

\Rlang does not provide a function for standard error, so the function above is generally useful. Its user interface is consistent with that of functionally similar existing functions. We have added a new word to the \Rlang vocabulary available to us.

In the definition of \code{sem()} we set a default argument for parameter \code{na.omit} which is used unless the user explicitly passes an argument to this parameter.

%In addition if names of the parameters are supplied arguments can be passed in any order. If parameter names are not supplied arguments are matched to parameters based on their position. Once one parameter name is given, all later arguments need also to be explicitly named.

%We can assign to a variable defined `outside' a function with operator \code{<<-} but the usual recommendation is to avoid its use. This type of effects of calling a function are frequently called `side-effects'.

\begin{playground}
Define your own function to calculate the mean in a similar way as \Rfunction{SEM()} was defined above. Hint: function \Rfunction{sum()} could be of help.
\end{playground}

Within an expression, a function name followed by parentheses is interpreted as a call to the function, while the bare name of a function, returns its definition (similarly to any other \Rlang object). If the name is entered as a statement at the \Rpgrm console, its value is printed.

We first print (implicitly) the definition of our function from earlier in this section.

<<fun-07>>=
sem
@

Next, we print the definition of \Rlang's standard deviation function \code{sd()}.

<<fun-08>>=
sd
@

As can be seen at the end of the printouts, these functions written in the \Rlang language have been byte-compiled so that they execute faster. We can also see that the definition of \code{sd()} resides in \code{namespace:stats} because it has been attached from package \pkgname{stats}.

Functions that are part of the \Rlang language, but that are not coded using the \Rlang language, are called primitives and their full definition cannot be accessed through their name (c.f., \code{sem()} defined above and \code{sd}, with \code{list()} below).

<<fun-09>>=
list
@

\subsection{Operators}\label{sec:operator:functions}
\index{operators!defining new}

Operators are functions that use a different syntax for being called. If their name is enclosed in back ticks they can be called as ordinary functions. Binary operators like \code{+} have two formal parameters, and unary operators like unary \code{-} have only one formal parameter. The parameters of many binary \Rlang operators are named \code{e1} and \code{e2}. This is just a convention, not enforced by the \Rlang language.

<<oper-01>>=
1 / 2
`/`(1 , 2)
`/`(e1 = 1 , e2 = 2)
@

\Kern{1}{An important consequence of the possibility of calling operators using ordinary syntax is that operators can be used as arguments to \emph{apply} functions in the same way as ordinary functions. When passing operator names as arguments to \emph{apply} functions, we only need to enclose them in back ticks (see section \ref{sec:data:apply} on page \pageref{sec:data:apply}).}

The name by itself and enclosed in back ticks allows us to access the definition of an operator.

<<oper-02>>=
`/`
@

\begin{explainbox}
\textbf{Defining a new operator.} We will define a binary operator (taking two arguments) that subtracts from the numbers in a vector the mean of another vector. First, we need a suitable name, but we have less freedom as names of user-defined operators must be enclosed in percent signs. We will use \code{\%-mean\%} and as with any \emph{special name}, we need to enclose it in quotation marks for the assignment.

<<oper-EB01>>=
"%-mean%" <- function(e1, e2) {
  e1 - mean(e2)
}
@

We can then use our new operator in a example.

<<oper-EB02>>=
10:15 %-mean% 1:20
@

To print the definition, we enclose the name of our new operator in back ticks---i.e., we \emph{back quote} the special name.

<<oper-EB03>>=
`%-mean%`
@

\end{explainbox}

\section{Objects, Classes and Methods}\label{sec:script:objects:classes:methods}\label{sec:methods}
\index{objects}\index{classes}\index{methods}\index{object-oriented programming}
\index{S3 class system}\index{classes!S3 class system}\index{methods!S3 class system}
New classes are normally defined within packages rather than in user scripts. To be really useful implementing a new class involves not only defining a class but also a set of specialised functions or \emph{methods} that implement operations on objects belonging to the new class. Nevertheless, an understanding of how classes work is important even if only very occasionally a user will define a new method for an existing class within a script.

Classes are abstractions, but abstractions describing the shared properties of ``types'' or groups of similar objects. In this sense, classes are abstractions of ``actors'', they are like ``nouns'' in natural language. What we obtain with classes is the possibility of defining multiple versions of functions (or \emph{methods}) sharing the same name but tailored to operate on objects belonging to different classes. We have already been using methods with multiple \emph{specialisations} throughout the book, for example, \code{plot()} and \code{summary()}.

We start with a quotation from \citebooktitle{Burns1998} \autocite[][, page 13]{Burns1998}.
\begin{quotation}
The idea of object-oriented programming is simple, but carries a lot of weight.
Here's the whole thing: if you told a group of people ``dress for work,'' then
you would expect each to put on clothes appropriate for that individual's job.
Likewise it is possible for S[R] objects to get dressed appropriately depending on
what class of object they are.
\end{quotation}

We say that specific methods are \emph{dispatched} based on the class of the argument passed. This, together with the loose type checks of \Rlang, allows writing code that functions as expected on different types of objects, e.g., character and numeric vectors.

\Rlang has good support for the object-oriented programming paradigm, but as a system that has evolved over the years, currently \Rlang supports multiple approaches. The still most popular approach is called S3, and a more recent and powerful approach, with slower performance, is called S4. The general idea is that a name like ``plot'' can be used as a generic name and that the specific version of \Rfunction{plot()} called depends on the arguments of the call. Using computing terms we could say that the \emph{generic} of \Rfunction{plot()} dispatches the original call to different specific versions of \Rfunction{plot()} based on the class of the arguments passed. S3 generic functions dispatch, by default, based only on the argument passed to a single parameter, the first one. S4 generic functions can dispatch the call based on the arguments passed to more than one parameter and the structure of the objects of a given class is known to the interpreter. In S3 functions, the specialisations of a generic are recognised/identified only by their name. And the class of an object by a character string stored as an attribute to the object (see section \ref{sec:calc:attributes} on page \pageref{sec:calc:attributes} about attributes).

We first explore one of the methods already available in \Rlang. The definition of \code{mean} shows that it is the generic for a method.

<<object-classes-00>>=
mean
@

We can find out which specialisations of a method are available in the current search path using \Rfunction{methods()}.

<<object-classes-00a>>=
methods(mean)
@

We can also use \Rfunction{methods()} to query all methods, including operators, defined for objects of a given class.

<<object-classes-00b>>=
methods(class = "list")
@

\begin{explainbox}
S3 class information is stored as a character vector in an attribute named \code{"class"}. The most basic approach to the construction (= creation) of an object of a new S3 class, is to add the new class name to the \code{class} attribute of the object. As the implied class hierarchy is given by the order of the members of the character vector, the name of the new class must be added at the head of the vector. Even though this step can be done as shown here, in practice this step would normally take place within a \emph{constructor} function and the new class, if defined within a package, would need to be registered. We show here this bare-bones example only to demonstrate how S3 classes are implemented in \Rlang.

<<explain-object-classes-01>>=
a <- 123
class(a)
class(a) <- c("myclass", class(a))
class(a)
@

Now we create a print method specific to \code{"myclass"} objects. Internally we are using function \Rfunction{sprintf()} and for the format template to work we need to pass a \code{numeric} value as an argument---i.e., obviously \Rfunction{sprintf()} does not ``know'' how to handle objects of the class we have just created!

<<explain-object-classes-02>>=
print.myclass <- function(x) {
    sprintf("[myclass] %.0f", as.numeric(x))
}
@

Once a specialised method exists for a class, it will be used for objects of this class.

<<explain-object-classes-03>>=
print(a)
print(as.numeric(a))
@

Adding the name \code{"derivclass"} to the head of the \code{class} character vector, makes object \code{b} a member of both classes, \code{"myclass"} and \code{"derivclass"}, where \code{"derivclass"} is derived from \code{"myclass"}. As \code{"derivclass"} is at position \code{1}, it is for this object its \emph{most derived class}.

<<explain-object-classes-04>>=
b <- 456
class(b) <- c("derivclass", class(a))
@

A specialised \code{print()} method is not available for \code{"derivclass"}, the method for \code{"myclass"}, the next class name along the vector, is called.

<<explain-object-classes-05>>=
print(b)
@

<<explain-object-classes-05a>>=
print(as.numeric(b))
@
\end{explainbox}

\begin{warningbox}
 The S3 class system is ``lightweight'' in that it adds very little additional computation load, but it is rather ``fragile'' in that most of the responsibility for consistency and correctness of the design---e.g., not messing up dispatch by redefining functions or loading a package exporting functions with the same name, etc., is not checked by the \Rlang interpreter.
\end{warningbox}

%Defining a new S3 generic\index{generic method!S3 class system} is also simple. A generic method and a default method need to be created.
%
%<<explain-object-classes-04>>=
%my_print <- function (x, ...) {
%   UseMethod("my_print", x)
% }
%
%my_print.default <- function(x, ...) {
%   print(class(x))
%   print(x, ...)
%}
%@
%
%<<explain-object-classes-05>>=
%my_print(123)
%my_print("abc")
%@
%
%Up to now, \Rfunction{my\_print()}, has no specialisation. We now write one for data frames.
%
%<<explain-object-classes-06>>=
%my_print.data.frame <- function(x, rows = 1:5, ...) {
%   print(x[rows, ], ...)
%   invisible(x)
%}
%@
%
%We add the second statement so that the function invisibly returns the whole data frame, rather than the lines printed. We now do a quick test of the function.
%
%<<explain-object-classes-07>>=
%my_print(cars)
%@
%
%<<explain-object-classes-07a>>=
%my_print(cars, 8:10)
%@
%
%<<explain-object-classes-07b>>=
%b <- my_print(cars)
%str(b)
%nrow(b) == nrow(cars) # was the whole data frame returned?
%@
%
%%\begin{playground}
%%1) What would be the most concise way of defining a \code{my\_print()} specialization for \code{matrix}? Write one, and test it.
%%2) How would you modify the code of your \code{my\_print.matrix()} so that also the columns to print can be selected?
%%\end{playground}
%%
%\end{explainbox}

\section{Packages}\label{sec:script:packages}

\subsection{Sharing of \Rlang-language extensions}
\index{extensions to R}
The most elegant way of adding new features or capabilities to \Rlang is through packages. A package can contain any, several or all of \Rlang function and operator definitions, data objects, classes, and their methods, plus the corresponding documentation. Some packages available through \CRAN contain only one or two \Rlang objects while others contain hundreds of them. After loading and attaching a package, the objects that the package exports can be used as if they were part \Rlang itself.

Packages are, without doubt, the best mechanism for sharing extensions to \Rlang. However, in most situations, packages are also very useful for managing code that will be reused by a single person over time. \Rlang packages have strict rules about their contents, file structure, and documentation, which makes it possible among other things for the package documentation to be merged into \Rpgrm's help system when a package is loaded. With a few exceptions, packages can be written so that they will work on any computer where \Rpgrm runs.

\begin{explainbox}
In a ``source package'', the code written in \Rlang, and possibly in other programming languages, is contained in text files that are compressed together into a single archive file. In a ``binary package'' the source code is already processed into a form suitable for faster installation. Binary package files are specific to each major version of \Rlang, operating system, and computer architecture. In addition to being slower, package installation from sources can requires additional software, such as compilers. A compiler translates the text representation of a computer program written in \Clang, \Cpplang, \langname{FORTRAN}, etc., into machine code, i.e., instructions for the computer hardware. \Rlang code is compiled into instructions for a virtual machine, part of \Rlang, that does the final translation into machine code at runtime.
\end{explainbox}

For distribution, a single compressed archive file is used for aech package. Packages can be shared as source- or binary-code files, sent for example through e-mail. However, the largest public repository of \Rpgrm packages is called \CRAN (\url{https://cran.r-project.org/}), an acronym for Comprehensive R Archive Network. Packages available through \CRAN are guaranteed to work, in the sense of not failing any tests built into the packages and not crashing or aborting prematurely. They are tested daily, as they may depend on other packages whose code will change when updated. The number of packages available through \CRAN at the time of printing (\Sexpr{lubridate::today()}) was \Sexpr{signif(nrow(available.packages(repos = "https://cran.rstudio.com/")), 3)}.

A key repository for bioinformatics with \Rlang is Bioconductor\index{Bioconductor} (\url{https://www.bioconductor.org/}), containing packages that pass strict quality tests, adding an additional 3\,400 packages. rOpenScience\index{rOpenScience} has established guidelines and a system for code peer review for \Rlang packages. These peer-reviewed packages are available through \CRAN or other repositories and listed at the rOpenScience website (\url{https://ropensci.org/}). Occasionally, one may have, or want, to install packages or updates that are not yet in \CRAN, either from the R Universe (\url{https://r-universe.dev/}) repositories, or from Git repositories (e.g., from GitHub).

A good way of learning how the extensions provided by a package work, is to experiment with them. When using a function we are not yet familiar with, looking at its help to check all its features expands our ``toolbox''. While documentation of exported objects is enforced, many packages include, in addition, comprehensive user guides or articles as \emph{vignettes}. It is not unusual to decide which package to use from a set of alternatives based its documentation. In the case of packages adding extensive new functionality, they may be documented in depth in a book. Well-known examples are \citebooktitle{Pinheiro2000} \autocite{Pinheiro2000} and \citebooktitle{Wickham2016} \autocite{Wickham2016}.

\subsection{Download, installation and use}\label{sec:packages:install}

\index{packages!using}
In \Rlang speak, ``library'' is the location where packages are installed. Packages are sets of functions, and data, specific for some particular purpose, that can be loaded into an \Rlang session to make them available so that they can be used in the same way as built-in \Rlang functions and data. Function \Rfunction{library()} is used to load and attach packages that are already installed in the local \Rlang library. In contrast, function \Rfunction{install.packages()} is used to install packages.

\begin{warningbox}
The instructions below assume that the user has access to repositories on the internet and enough user rights to install packages. This is rarely the case in organisations using strict security protocols. In such cases, the organisation may keep a mirror of \CRAN in the intranet. The local/user's private \Rpgrm library can be kept in a folder where the user has writing and reading rights.
\end{warningbox}

\begin{faqbox}{How to install or update a package from CRAN?}
\CRAN is the default repository for \Rlang packages. If you use \RStudio or another IDE as a front end on any operating system or \pgrmname{RGUI} under \pgrmname{MS-Windows}, installation and updates can be done through a menu or GUI button. These menus use calls to \Rfunction{install.packages()} and \Rfunction{update.packages()} behind the scenes.

Alternatively, at the \Rpgrm command line, or in a script, \Rfunction{install.packages()} can be called with the name of the package as an argument. For example, to install package \pkgname{learnrbook} one can use

<<pkg-00, eval=FALSE>>=
install.packages("learnrbook")
@

\noindent
and to update already installed packages

<<pkg-00x, eval=FALSE>>=
update.packages()
@
\end{faqbox}

\begin{faqbox}{How to install or update a package from GitHub?}
Package \pkgname{remotes} makes it possible to install packages directly from \GitHub, \Bitbucket and other repositories based on \pgrmname{Git}. The code in the next chunk (not run here) can be used to install the latest, possibly, still under development, version of package \pkgname{learnrbook}.

<<remotes-00y, eval=FALSE>>=
remotes::install_github("aphalo/learnrbook-pkg")
@
\end{faqbox}

\begin{explainbox}
Function \Rfunction{pkg\_install()} from \pkgname{pak} can install packages, both from CRAN and Bioconductor repositories, and from \pgrmname{Git} repositories. The same function can be used to update specific already installed packages and dependencies.

<<pak-00z, eval=FALSE>>=
pak::pkg_install("learnrbook") # from CRAN
pak::pkg_install("aphalo/learnrbook-pkg") # from GitHub
@
\end{explainbox}

\Rpgrm packages can be installed either from sources, or from already built ``binaries''. Installing from sources, depending on the package, may require additional software to be available. This is because some \Rlang packages contain source code in other languages such as \Clang, \Cpplang or \langname{FORTRAN} that needs to be compiled into machine code during installation. Under \pgrmname{MS-Windows}, the needed shell, commands, and compilers are not available as part of the operating system. Installing them is not difficult as they are available prepackaged in an installer under the name \pgrmname{RTools} (available from \CRAN). \pgrmnameTwo{\hologo{MiKTeX}}{MiKTeX}) is usually needed to build the PDF of the package's manual.

Under \pgrmname{MS-Windows}, it is easier to install packages from binary \texttt{.zip} files than from \texttt{.tar.gz} source files. For \pgrmname{OS X} (Apple Mac) the situation is similar, with binaries available both for Intel and ARM (M1, M2 series) processors. Most, but not all, Linux distributions include in the default setup the tools needed for installation of \Rlang packages. Under Linux it is rather common to install packages from sources, although package binaries have recently become more easily available.

If the tools are available, packages can be easily installed from sources from within \RStudio. However, binaries are for most packages also readily available. In \CRAN, the binary for a new version of a package becomes available with a delay of one or two days compared to the source. For packages that need compilation, the installation from sources takes more time than installation from binaries.

\begin{advplayground}
Use \code{help} to look up the help page for \Rfunction{install.packages()}, and explore how to control whether the package is installed from a source or a binary file. Also explore, how to install a package from a file in a local disk instead of from a repository like \CRAN.
\end{advplayground}

Frequently the README file of a package includes instructions on how to install it from \CRAN or another online repository. Exceptionally, packages may require additionally the installation of software outside \Rpgrm before their installation and/or use. When present, these rather exceptional requirements are always listed in the DESCRIPTION under \code{SystemRequirements:} and explained in more detail in the README file. In \CRAN, each package has a home web page that can be easily found if one knows the name of the package, e.g., \url{https://CRAN.R-project.org/package=learnrbook}. Nowadays, it is common for the help for a package being also available as a web site, e.g., \url{https://docs.r4photobiology.info/learnrbook/}.

\begin{faqbox}{How to change the repository used to install packages?}
Function \Rfunction{setRepositories()} can be used to enable other repositories in addition or instead of \CRAN during an \Rpgrm session. In recent versions of \Rpgrm, the default list of repositories is taken from \Rlang option \code{"repos"} if defined. Consult \code{help("setRepositories")} for the details.

Alternatively, one can use function \Rfunction{pkg\_install()} from package \pkgname{pak} as this function attempts to automatically set the correct repository based on the name of the package.
\end{faqbox}

\begin{faqbox}{How to use an installed package?}
To use the functions and other objects defined in a package, the package must first be loaded, and for the names of these objects to be visible in the user's workspace, the package needs to be attached. Function \Rfunction{library()} loads and attaches one package at a time. For example, to load and attach package \pkgname{learnrbook} we use.

<<pkg-00a, eval=FALSE>>=
library("learnrbook")
@

\end{faqbox}

\begin{faqbox}{How to find the currently installed version of a package?}
Function \Rfunction{packageVersion} returns the version as an object of class \code{"package\_version"} that can not only be printed, but also meaningfully compared, e.g., to test for a minimum version requirement.

<<pkg-version-01>>=
packageVersion(pkg="learnrbook")
@
\end{faqbox}

As packages are contributed by independent authors, they should be cited in addition to citing \Rpgrm itself when they are used to obtain results or plots included in publications. \Rlang function \Rfunction{citation()} when called with the name of a package as its argument provides the reference that should be cited for the package, and without an explicit argument, the reference to cite for the version of \Rlang in use as shown below.

<<citation-1>>=
citation()
@

\begin{playground}
  Look at the help page for function \code{citation()} for a discussion of why it is important that users cite \Rpgrm and packages when using them.
\end{playground}

\begin{warningbox}
Conflicts among packages can easily arise, for example, when they use the same names for objects or functions. These are reported when the packages are attached (see section \ref{sec:packages:work} on page \pageref{sec:packages:work} for a workaround). In addition, many packages use functions defined in packages in the \Rlang distribution itself or other independently developed packages by importing them. Updates to depended-upon packages can ``break'' (make non-functional) the dependent packages or parts of them. The rigourous testing by \CRAN detects such problems in most cases when package revisions are submitted, forcing package maintainers to fix problems before distribution through \CRAN is possible. However, if you use other repositories, I recommend that you make sure that revised (especially if under development) versions do work with your own code, before their use in ``production'' (important) data analyses.
\end{warningbox}

\subsection{Finding suitable packages}

Due to the large number of contributed \Rlang packages, it can sometimes be difficult to find a suitable package for a task at hand. It is good to first check if the necessary capability is already built into base \Rlang. Base \Rlang plus the recommended packages (installed when \Rlang is installed) cover a lot of ground. Analysing data using almost any of the more common statistical methods does not require the use of contributed packages. Sometimes, contributed packages duplicate or extend the functionality in base \Rlang. When one considers the use of novel or specialised types of data analysis, the use of contributed packages can be unavoidable. Even in such cases, it is not unusual to have alternatives to choose from within the available contributed packages. Sometimes groups or suites of packages are designed to work well together.

The \CRAN repository has a very broad scope and includes a section called ``views''. \Rlang views are web pages providing annotated lists of packages frequently used within a given field of research, engineering, or specific applications. These views are maintained by different expert editors. The \Rlang views can be found at \url{https://cran.r-project.org/web/views/}.

The Bioconductor repository specialises in bioinformatics with \Rlang. It also has a section with ``views'' and within it, descriptions of different data analysis workflows. The workflows are especially good as they reveal which sets of packages work well together. These views can be found at \url{https://www.bioconductor.org/packages/release/BiocViews.html}.

\textsf{rOpenSci} \autocite{Ram2019} fosters a culture that values open and reproducible research using shared data and reusable software. One aspect of this is making possible peer-review of \Rlang packages. \textsf{rOpenSci} does not keep a separate package repository for the peer-reviewed packages, they keep an index at \url{https://ropensci.org/packages/}. The packages included have become more diverse, but initially the main focus was on facilitating access to open data sources.

The \CRAN repository keeps an archive of earlier versions of packages, on an individual package basis. This is also important for long-term reproducibility.

\subsection{How packages work}\label{sec:packages:work}

\Rlang packages define all objects within a \emph{namespace} with the same name as the package itself. Loading and attaching a package with \Rfunction{library()} makes visible only the exported objects. Attaching a package adds these objects to the search path so that they can be accessed without prepending the name of the namespace. Most packages do not export all the functions and objects defined in their code; some are kept internal, in most cases, to avoid making a commitment about their availability in future versions, which could constrain further development.

\begin{explainbox}
Package namespaces can be detached and also unloaded with function \Rscoping{detach()} using a slightly different notation for the argument from that which we described for data frames in section \ref{sec:calc:df:with} on page \pageref{sec:calc:df:with}. This is very seldom needed, but one case I have come across is a package that redefines a generic function of a method of a package it imports, thus preventing the normal use of a third package that depends on the original definition of the generic.
\end{explainbox}

When we reuse a name defined in a package, its definition in the package does not get overwritten, but instead, only hidden. These hidden objects remain accessible using the name \emph{qualified} by prepending the name of the package followed by two colons, e.g., \code{base:mean()}.

If two packages define objects with the same name, then which one is visible depends on the order in which the packages were attached with \Rfunction{library()}. To avoid confusion in such cases, in scripts it is best to use the qualified names for calling objects defined with the same name in two packages. Using the qualified name for an object from an already attached package, is inconsequential for its interpretation by \Rpgrm, but can enhance the readability of the code.

\begin{warningbox}
If one uses a qualified name for an object but does not attach the package with a call to \Rfunction{library()}, the package is only loaded. In other words, the names of the exported objects are not added to the search pass, but the code defining them is retrieved and available using qualified names.
\end{warningbox}

Some functions that are part of \Rlang are collected into packages grouped by category: \pkgname{base}, \pkgname{stats}, \pkgname{datasets}, etc., and can be called when needed using qualified names.  We can find out the search order by calling \Rfunction{search()}, with the search starting at the \code{".GlobalEnv"} for statements evaluated at the \Rlang command line.

\begin{playground}
Namespaces isolate the names defined within them from those in other namespaces. This helps prevent name clashes, and makes it possible to access objects even when they are ``hidden'' by a different object with the same name.

<<pkg-01, eval=eval_playground>>=
class(cars)
head(cars, 3)
getAnywhere("cars")$where # defined in package
@

<<pkg-01a, eval=eval_playground>>=
cars <- 1:10
class(cars)
head(cars, 3) # prints 'cars' defined in the global environment
rm(cars) # clean up
head(cars, 3)
getAnywhere("cars")$where # the first visible definition is in the global environemnt
@

\end{playground}

\begin{warningbox}
In the playground above, I used a data frame object, but the same mechanisms apply to all \Rlang objects including functions. The situation when one of the definitions is a function and the other is not, is slightly different in that a call using parenthesis notation will distinguish between a function and an object of the same name that is not a function. Relying on this distinction is anyway very confusing and, thus, a bad idea.

<<pkg-02a>>=
mean
@

<<pkg-02b>>=
mean <- mean(1:5)
mean
mean(8:9)
@

<<pkg-02c>>=
getAnywhere("mean")$where
rm(mean)
getAnywhere("mean")$where
@

In this last example, \code{rm(mean)} removed the variable we had assigned a value to. Package namespaces protect the objects defined in the package from deletion or overwriting. This is different to defining a new object with the same name, which is allowed. The two statements below trigger errors and are not evaluated when typesetting the book.

<<pkg-03, eval=FALSE>>=
datasets::cars <- "my car is green"
rm(datasets::cars)
@

The value returned by \Rfunction{getAnywhere()} has additional information than that in its member \code{where}. Do have a look at its help page with \code{help(getAnywhere)} for the details.

\end{warningbox}

\section{Further Reading}

Several\index{further reading!object oriented programming in R} books describe in detail the different class systems available and how to use them in \Rlang. For an in-depth treatment of the subject please consult the books \citebooktitle{Wickham2019} \autocite{Wickham2019} and \citebooktitle{Chambers2016} \autocite{Chambers2016}.

\index{further reading!package development}The development of \Rlang packages is accessibly explained in the book \citebooktitle{Wickham2023} \autocite{Wickham2023}, using a practical approach and tools developed by the author and his collaborators. The book \citebooktitle{Chambers2016} \autocite{Chambers2016} has its focus on \Rlang itself, how it works, and how to develop extensions both with simple and challenging goals.

<<eval=eval_diag, include=eval_diag, echo=eval_diag, cache=FALSE>>=
knitter_diag()
R_diag()
other_diag()
@