-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path9_24_24_classnotes.Rmd
1135 lines (873 loc) · 37.4 KB
/
9_24_24_classnotes.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "9_24_23_classnotes"
author: "Kate Gordon"
date: "2024-09-24"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
## Code written by the person who made the function f()
f <- function(num) {
hello <- "Hello, world!\n"
for (i in seq_len(num)) {
cat(hello)
}
chars <- nchar(hello) * num
chars
}
## Code written by the user of f() did not create a default value
f()
## Users are forced to specify a value for the argument "num"
f(num = 2)
f(2)
f("a")
## Create a version of f() with a default value for "num"
## (here we have the default as 1)
f <- function(num = 1) {
hello <- "Hello, world!\n"
for (i in seq_len(num)) {
cat(hello)
}
chars <- nchar(hello) * num
chars
}
## Using f() with no "num" in our global environment
stopifnot(!"num" %in% ls()) ## formally check that "num"
## doesn't exist
f() ## "num" will be created inside of the function f()
## with the default value of 1
f(num = 3) ## Use the code inside f() with num = 3
num <- 4
f() ## Here even though "num" exists on the global
## environment, "num" inside f() uses the default value of 1
## We could have called our argument "number_of_iterations"
## instead of "num"
f <- function(number_of_iterations = 1) {
hello <- "Hello, world!\n"
for (i in seq_len(number_of_iterations)) {
cat(hello)
}
chars <- nchar(hello) * number_of_iterations
chars
}
f("a")
f(0)
f(1)
f <- function(number_of_iterations = 1) {
## Check that the user provided the right type of input
stopifnot(is.numeric(number_of_iterations))
stopifnot(number_of_iterations >= 1)
hello <- "Hello, world!\n"
for (i in seq_len(number_of_iterations)) {
cat(hello)
}
chars <- nchar(hello) * number_of_iterations
chars
}
f("a")
f(0)
f(1)
#Arguments
## ------------------------------------------------------------------------------------------------------------------------
```{r}
str(rnorm)
mydata <- rnorm(100, 2, 1) ## Generate some data
```
## ------------------------------------------------------------------------------------------------------------------------
```{r}
## Positional match first argument, default for 'na.rm'
sd(mydata)
## Specify 'x' argument by name, default for 'na.rm'
sd(x = mydata)
## Specify both arguments by name
sd(x = mydata, na.rm = FALSE)
```
## ------------------------------------------------------------------------------------------------------------------------
```{r}
## Specify both arguments by name. This way you can switch positions and get the same value
sd(na.rm = FALSE, x = mydata)
```
## ------------------------------------------------------------------------------------------------------------------------
```{r}
sd(na.rm = FALSE, mydata) #once we have used a named argument, R looks at the first remaining argument for our data call
```
## ------------------------------------------------------------------------------------------------------------------------
```{r}
args(f) #shows default values of function
```
## ------------------------------------------------------------------------------------------------------------------------
Below is the argument list for the lm() function, which fits linear models to a dataset.
NULL
```{r}
args(lm)
```
The following two calls are equivalent.
```{r}
lm(data = mydata, y ~ x, model = FALSE, 1:100)
lm(y ~ x, mydata, 1:100, model = FALSE)
```
most useful model is:
lm(y ~ x, mydata, 1:100, model = FALSE)
## ------------------------------------------------------------------------------------------------------------------------
**Functions are for humans and computers**
As you start to write your own functions, it’s important to keep in mind that functions are not just for the computer, but are also for humans. Technically, R does not care what your function is called, or what comments it contains, but these are important for human readers.
The name of a function is important
In an ideal world, you want the name of your function to be short but clearly describe what the function does. This is not always easy, but here are some tips.
The function names should be verbs, and arguments should be nouns.
There are some exceptions:
nouns are ok if the function computes a very well known noun (i.e. mean() is better than compute_mean()).
A good sign that a noun might be a better choice is if you are using a very broad verb like “get”, “compute”, “calculate”, or “determine”.
Use your best judgement and do not be afraid to rename a function if you figure out a better name later.
## # Too short
```{r}
f()
```
##
## # Not a verb, or descriptive
```{r}
my_awesome_function()
```
##
## # Long, but clear
```{r}
impute_missing()
collapse_years()
```
##
##
**snake_case vs camelCase**
If your function name is composed of multiple words, use “snake_case”, where each lowercase word is separated by an underscore.
“camelCase” is a popular alternative. It does not really matter which one you pick, the important thing is to be consistent: pick one or the other and stick with it.
R itself is not very consistent, but there is nothing you can do about that. Make sure you do not fall into the same trap by making your code as consistent as possible.
```{r}
# Never do this! Do not use both snake_case and camelCase in same code
#be consistent
col_mins <- function(x, y) {}
rowMaxes <- function(x, y) {}
```
**Use a common prefix**
If you have a family of functions that do similar things, make sure they have consistent names and arguments.
It’s a good idea to indicate that they are connected. That is better than a common suffix because autocomplete allows you to type the prefix and see all the members of the family.
```{r}
# Good
input_select()
input_checkbox()
input_text()
# Not so good
select_input()
checkbox_input()
text_input()
```
**Avoid overriding exisiting functions**
Where possible, avoid overriding existing functions and variables.
It is impossible to do in general because so many good names are already taken by other packages, but avoiding the most common names from base R will avoid confusion.
```{r}
# Don't do this!
T <- FALSE #this replaces True value and can break code for people later
c <- 10 #c is used uiversally for combining
mean <- function(x) sum(x)
```
**Use comments**
Use comments are lines starting with #. They can explain the “why” of your code.
You generally should avoid comments that explain the “what” or the “how”. If you can’t understand what the code does from reading it, you should think about how to rewrite it to be more clear.
Do you need to add some intermediate variables with useful names?
Do you need to break out a subcomponent of a large function so you can name it?
However, your code can never capture the reasoning behind your decisions:
Why did you choose this approach instead of an alternative?
What else did you try that didn’t work?
It’s a great idea to capture that sort of thinking in a comment.
You can examine function names / uses using "apropos" or "help"
**Environment**
The last component of a function is its environment.
This is not something you need to understand deeply when you first start writing functions. However, it’s important to know a little bit about environments because they are crucial to how functions work.
The environment of a function controls how R finds the value associated with a name.
For example, take this function:
**Tips for exploring functions**
1. Remember what package something came from
```{r}
help(package = "usethis")
```
2. Look at function names listed in help
3. Look at examples under functions in help
4. You can also keep a list of packages and their functions in your own files
**Vectorization**
Writing for and while loops are useful and easy to understand, but in R we rarely use them.
As you learn more R, you will realize that vectorization is preferred over for-loops since it results in shorter and clearer code.
Vector arithmetics
**Rescaling a vector**
In R, arithmetic operations on vectors occur element-wise. For a quick example, suppose we have height in inches:
```{r}
inches <- c(69, 62, 66, 70, 70, 73, 67, 73, 67, 70)
```
inches <- c(69, 62, 66, 70, 70, 73, 67, 73, 67, 70)
and want to convert to centimeters.
```{r}
inches * 2.54
```
Notice what happens when we multiply inches by 2.54:
inches * 2.54
[1] 175.26 157.48 167.64 177.80 177.80 185.42 170.18 185.42 170.18 177.80
In the line above, we multiplied each element by 2.54.
You can do this with addition or subtraction also in R without using loops or with statements.
**Two vectors**
If we have two vectors of the same length, and we sum them in R, they will be added entry by entry as follows:
```{r}
x <- 1:10
y <- 1:10
x + y
```
it added x1 + y 1 = 1+1 = 2, x2+y2 = 2+2 = 4...
The same holds for other mathematical operations, such as -, * and /.
```{r}
x <- 1:10
sqrt(x)
```
That is, we don’t write a for loop that takes each element of x and each element of y, adds, them up, then puts them all in a single vector.
```{r}
y <- 1:10
x * y
```
This is the same for things like square root - you can have a vector and just use the notation and it will run the function.
```{r}
x <- 1:10
sqrt(x)
```
This shows the square root of 1 , the square root of 2, the sqrt of 3...
In many other programs, you have to create a "for loop" to do these types of functions.
Such as:
for each of the elements in X,
I want to compute
the sum of that ith element of x plus the ith element of y,
and save it to the ith element of my result.
Seen as:
for (i in seq_along(x)) {
result[i] <- x[i] + y[i]
}
```{r}
## 1. Check that x and y have the same length
stopifnot(length(x) == length(y))
## 2. Create our result object
result <- vector(mode = "integer", length = length(x)) #create result vectors
## 3.Loop through each element of x and y, calculate the sum,
## then store it on 'result'
for (i in seq_along(x)) {
result[i] <- x[i] + y[i]
}
## 4.Check that we got the same answer as the result
#of just using the R function of x+y
identical(result, x + y)
```
**Functional loops**
While "for loops" are perfectly valid, when you use vectorization in an element-wise fashion, there is no need for for loops because we can apply what are called functional loops.
**Sometimes we'll have complicated inputs. And we want to apply
the same function to each of those inputs**
Functional loops are functions that help us apply the same function to each entry in a vector, matrix, data frame, or list. Here are a list of them in base R:
lapply(): Loop over a list and evaluate a function on each element
sapply(): Same as lapply but try to simplify the result
apply(): Apply a function over the margins of an array
tapply(): Apply a function over subsets of a vector
mapply(): Multivariate version of lapply (won’t cover)
An auxiliary function split() is also useful, particularly in conjunction with lapply().
**Define a function**
function is my_sum
It's going to take 2 arguments "a" and "b", and inside of it it's just adding them.
```{r}
my_sum <- function(a, b) {
a + b
}
## Same but with an extra check to make sure that 'a' and 'b'
## have the same lengths.
my_sum <- function(a, b) {
## Check that a and b are of the same length, we will provide an error,
#no internal recycling
stopifnot(length(a) == length(b))
a + b
}
```
```{r}
```
**Use Roxygen Skeleton**
Syntax that can be used to document functions
Document how to use your your function with roxygen2
Extra: document your function with roxygen2 syntax. Code (or magic wand) -> Insert Roxygen skeleton
It lists the Title and gives space for a description and details as 3 separate paragraphs.
In the example below, "a" is asking you "like, "what is the value of a" and then "what is the value of b" return is asking you to document what is like.
Then output that this function gives back to the user or returns to the user
Export is "do you want to share this function?" Typically, yes. And so you don't have to do anything for that line.
Then, add examples, just provide any any code that you would like to
```{r}
#' Title
#'
#' @param a
#' @param b
#'
#' @return
#' @export
#'
#' @examples
my_sum <- function(a, b) {
## Check that a and b are of the same length
stopifnot(length(a) == length(b))
a + b
}
```
```{r}
#' Title
#'
#' Description
#'
#' Details
#'
#' @param a What is `a`?
#' @param b What is `b`?
#'
#' @return What does the function return?
#' @export ## Do we want to share this function? yes!
#'
#' @examples
#' ## How do you use this function?
my_sum <- function(a, b) {
## Check that a and b are of the same length
stopifnot(length(a) == length(b))
a + b
}
```
Full example of documenting a function
```{r}
#' Sum two vectors
#'
#' This function does the element wise sum of two vectors.
#'
#' It really is just an example function that is powered by the `+` operator
#' from [base::Arithmetic].
#'
#' @param a An `integer()` or `numeric()` vector of length `L`.
#' @param b An `integer()` or `numeric()` vector of length `L`.
#'
#' @return An `integer()` or `numeric()` vector of length `L` with
#' the element-wise sum of `a` and `b`.
#' @export
#'
#' @examples
#' ## Generate some input data
#' x <- 1:10
#' y <- 1:10
#'
#' ## Perform the element wise sum
#' my_sum(x, y)
my_sum <- function(a, b) {
## Check that a and b are of the same length
stopifnot(length(a) == length(b))
a + b
}
```
**More on understanding functional loops**
Apply your function
Here since we have two input vectors, we need to use mapply() which is one of the more complex functions. Note the weird order of the arguments where the function we will mapply() over comes before the inputs to said function.
**mapply() = multivariate apply**
This function is one you want to use when you have 2 or more inputs for your functions that you want to specify
The syntax is:
function (FUN, ..., MoreArgs = NULL, SIMPLIFY = TRUE, USE.NAMES = TRUE)
NULL
It takes the argument FUN,or your function, then after that it has ellipse (...), where you can specify as many inputs to your function,and then it has a few other arguments or specify like how you control mapply().
What it will do is is going to apply the function. In this case, my sum
to the 1st element of x, and the 1st element of y.
then it'll do it for the second element of X, the second element of y 3rd element and 3rd element. So you don't have to write a for loop.
But, the sum is a vectorized function that you don't actually need to use an mapply.
```{r}
## Check the arguments to mapply()
args(mapply)
```
Example where vector is 1:10 :
```{r}
## Apply mapply() to our function my_sum() with the inputs 'x' and 'y'
mapply(sum776::my_sum, x, y)
```
[1] 2 4 6 8 10 12 14 16 18 20
This solves the function sum776: mapply this function, my_sum, to my input vectors x and y.
X and y were vectors of the numbers one to 10.
```{r}
## Or write an anonymous function that is:
## * not documented
## * not tested
## * not shared
##
## :( If you're using this your function won't be documented, tested or shared
mapply(function(a, b) {
a + b
}, x, y)
```
**purrr alternative**
The purrr package, which is part of the tidyverse, provides an alternative framework to the apply family of functions in base R.
Both the developer and user both know what the formula and exceptions are
purr has a of of map functions
you're saying, like, I have all these different inputs.
I'm going to map those inputs to my function.
And then I'm going to get a result
```{r}
library("purrr") ## part of tidyverse
## Check the arguments of map2_int()
args(purrr::map2_int) #map2 = 2 inputs, int = integer vector as an output
```
In this very particular case, We have 2 inputs.
and the output is an integer vector
so we're using purr.
the appropriate function to use here would be map2_int()
The "2" comes from saying, we have 2 inputs and then the _int is short for underscore integer
Example:
```{r}
## Apply our function my_sum() to our inputs
purrr::map2_int(x, y, sum776::my_sum)
```
[1] 2 4 6 8 10 12 14 16 18 20
In this case, we're using X and Y as inputs to our function my_sum.
The order of the inputs here change compared to the mapply.
Here we're providing the x and y inputs first, then we're provided a function at the end
```{r}
## You can also use anonymous functions
purrr::map2_int(x, y, function(a, b) {
a + b
})
```
```{r}
## purrr even has a super short formula-like syntax
## where .x is the first input and .y is the second one
purrr::map2_int(x, y, ~ .x + .y)
```
Above is an anonymous function example. anonymous functions are not documented, not tested, not shared.
It also has a very short syntax a formula like syntax, which you use the Tilde operator (~).
and then you have something called the ".x" and the ".y".
Those are like the dot. X represents the 1st argument of your map function call, and the dot y represents the second one.
It has nothing to do with the inputs being called x and y, (e.g. 1:2 nd 3:4).
You still use the syntax of ".x" and ".y".
```{r}
## This formula syntax has nothing to do with the objects 'x' and 'y'
purrr::map2_int(1:2, 3:4, ~ .x + .y)
```
**Base R loops**
**List Apply and typically returns a list**
**lapply** only takes as input a list.
The M for mapply was for multivariate.
The L in L **lapply** is going to be for "list apply"
It takes the inputted list and applies a function to each element of that list.
And typically returns a list
The arguments for lapply takes "x",your input,
and then the second argument is your function (FUN).
```{r}
lapply()
```
function (X, FUN, ...)
{
FUN <- match.fun(FUN)
if (!is.vector(X) || is.object(X))
X <- as.list(X)
.Internal(lapply(X, FUN))
}
<bytecode: 0x15c1335d0>
<environment: namespace:base>
The lapply() function does the following simple series of operations:
it loops over a list, iterating over each element in that list
it applies a function to each element of the list (a function that you specify)
and returns a list (the l in lapply() is for “list”).
This function takes three arguments: (1) a list X; (2) a function (or the name of a function) FUN; (3) other arguments via its ... argument. If X is not a list, it will be coerced to a list using as.list().
The body of the lapply() function can be seen here.
f <- function(a, b) {
a^2
}
f(2)
**Here’s an example of lapply** applying the mean() function to all elements of a list. If the original list has names, the the names will be preserved in the output.
a = the numbers 1 thru 5 and b = 10 randomly generated numbers
```{r}
x <- list(a = 1:5, b = rnorm(10))
x
```
```{r}
lapply(x, mean)
```
Notice that here we are passing the mean() function as an argument to the lapply() function.
We see the mean of the "a" values and a mean of the "b" values as output
Similarly, with purr, the map function, takes us input as a list.
But if you use "map_Dbl", short for double which is how numbers with decimals are stored.
Now we get a numeric vector with the outputs of those means.
```{r}
purrr::map_dbl(x, mean)
```
What difference do you notice in terms of the output of lapply() and purrr::map_dbl()?
```{r}
purrr::map(x, mean)
```
Below is another example of using lapply().
this example is a lot longer where we have a list called X,
where the 1st element is "a", has 4 numbers. B has 10 numbers. C has
20 numbers. D has a hundred numbers.
It didn't really matter that we had, increasing number of elements in our
of lengths in the elements of our list. "x" with "lapply" still works
```{r}
x <- list(a = 1:4, b = rnorm(10), c = rnorm(20, 1), d = rnorm(100, 5))
lapply(x, mean)
```
You can use lapply() to evaluate a function multiple times each with a different argument.
Next is an example where I call the runif() function (to generate uniformly distributed random variables) four times, each time generating a different number of random numbers.
In this example, we want to get different lengths
```{r}
x <- 1:4
lapply(x, runif)
```
Lot of times we think about "lapply" as input only as lists. But it can also take as input a vector and just return a list.
When you pass a function to lapply(), lapply() takes elements of the list and passes them as the first argument of the function you are applying.
In the above example, the first argument of runif() is n, and so the elements of the sequence 1:4 all got passed to the n argument of runif().
This also works with purrr functions.
But it's just called "map". Can take a input as a list or vector and returns a list as output
```{r}
purrr::map(x, runif)
```
Here is where the argument to lapply() comes into play. Any arguments that you place in the argument will get passed down to the function being applied to the elements of the list.
Below, we want to get between one and 4 random numbers from the uniform distribution (runif), but we want them to have a minimum value of 0 and a maximum of 10
Here, the min = 0 and max = 10 arguments are passed down to runif() every time it gets called.
In runif(), the default numbers for Min is 0, and Max is 10.
You can specify the values to some arguments that you want and use the same value for every iteration.
```{r}
x <- 1:4 #want btwn 1 and 4 random numbers
lapply(x, runif, min = 0, max = 10)
```
The same thing can be done with with purr map. Ii is the same is the same syntax - input function , other name arguments that you want to use for your function.
So now, instead of the random numbers being between 0 and 1 (the default), the are all between 0 and 10. Again, this also works with purrr functions.
```{r}
purrr::map(x, runif, min = 0, max = 10)
```
**sapply()**
Simplify
The sapply() function behaves similarly to lapply(); the only real difference is in the return value. sapply() will try to simplify the result of lapply() if possible. Essentially, sapply() calls lapply() on its input and then applies the following algorithm:
If the result is a list where every element is length 1, then a vector is returned
If the result is a list where every element is a vector of the same length (> 1), a matrix is returned.
If it can’t figure things out, a list is returned
This can be a source of many headaches and one of the main motivations behind the purrr package. With purrr you know exactly what type of output you are getting!
Here’s the result of calling lapply(). With different lengths
```{r}
x <- list(a = 1:4, b = rnorm(10), c = rnorm(20, 1), d = rnorm(100, 5))
lapply(x, mean)
```
So if we use **S apply** we get a vector.
because R knows that all of these things are of length one.
so they can be simplified to a vector
```{r}
sapply(x, mean)
```
**With purrr**, if I want a list output, I use map().
If I want a double (numeric) output, we can use map_dbl().
```{r}
purrr::map(x, mean)
```
```{r}
purrr::map_dbl(x, mean)
```
**split()**
All of the apply functions can be combined with the function, split.
The split() function takes a vector or other objects and splits it into groups determined by a factor or list of factors.
The arguments to split() are
```{r}
str(split)
```
where x is a vector (or list) or data frame
f is a factor (or coerced to one) or a list of factors
factors are categorical variables or a list of factors.
drop indicates whether empty factors levels should be dropped
The combination of split() and a function like lapply() or sapply() is a common paradigm in R. The basic idea is that you can take a data structure, split it into subsets defined by another variable, and apply a function over those subsets. The results of applying that function over the subsets are then collated and returned as an object. This sequence of operations is sometimes referred to as “map-reduce” in other contexts.
This is how we go from a data frame to a list, we use a split function for that and now that you have a list, then you can like use some apply functions.
Let's say a data frame and you have a column, let's say country.
You can split your data into a list that now every element of the list will be the data for country one country, 2, country, 3, etc.
So it's similar to the "group by" function from dplyr
Here we simulate some data and split it according to a factor variable. Note that we use the gl() function to “generate levels” in a factor variable.
Example: We have random numbers. 10 random numbers from the normal distribution, 10 random numbers from the uniform distribution, and then 10 random numbers from the normal distribution with a mean of one
Now we have the actual values of X that were split.
So since we split it by the groups, now we can compute the mean
we SL apply or as apply if you want to.
So here we get. Now the means, 3 different means.
```{r}
x <- c(rnorm(10), runif(10), rnorm(10, 1))
f <- gl(3, 10) # generate factor levels 3 groups with 10 observations each
f
```
we'll used the gl function to like generate levels, 3 different groups.
Each of the groups has 10 observations, so that just creates this little factor called "f", where the 1st 10 values are ones, next 10 values are twos & last 10 are threes.
So now we have both a vector input (x) and a factor input (f) that are equal lengths.
```{r}
split(x, f)
```
So we can use the split function. We split and we get a list.
We now have a list for the 1st element of the list called $1. From this, the 1st level of the factor (f),
then the second element of the list is called $2. That's the second level of (f), etcetera inside of it we have the values of "x" that were split by group.
A common idiom is split followed by an lapply.
since we split it by the groups, we can compute the mean with "lapply"
So we get the 3 different means - one for each of the 3 groups we created
These were random numbers, we didn't specify a seed.
```{r}
lapply(split(x, f), mean)
```
If you run again, you get a different value because we asked for random numbers, not a "seed"
**Splitting a Data Frame**
This data frame is called air quality and has these variables:
ozone, solar wind temperature month and day.
```{r}
library("datasets")
head(airquality)
```
**Split data by months**
Month is our group variable. and we will split it so that we have separate sub-data frames for each month.
We provide as input (x), the whole data frame "airquality", and then for (f), we specify the the "Month" variable.
```{r}
s <- split(airquality, airquality$Month)
str(s)
```
We get a list of length 5. Because we have data for 5 unique months in this air quality data set. the 1st element is called $5, the second is called $6, then the 3rd is called $7, 4th is $8, and the 5th one is called 9.
We can see that we have a data frame stored as each of the elements of this list.
Once we've done this, kind of like "groupby", now we can compute the the column means of the variables, ozone, solar, r and wind.
We use the function "colMeans" for that
Then we can take the column means for Ozone, Solar.R, and Wind for each sub-data frame.
```{r}
lapply(s, function(x) {
colMeans(x[, c("Ozone", "Solar.R", "Wind")])
})
```
We get 3 different means for those those 3 variables. It has 5 elements named 5, 6, 7, 8, and 9.
And each of those elements is a named vector that has the 3 different means.
Using sapply() might be better here since R recognizes that all of these elements of the output list have the same length
sapply gives us a more readable output.
Puts it into a matrix
```{r}
sapply(s, function(x) {
colMeans(x[, c("Ozone", "Solar.R", "Wind")])
})
```
We get a matrix where we have the 3 different variables as a rows.
the 5 different input groups for a month as the columns. And we get a table
We see NAs because we are missing data
Unfortunately, there are NAs (missing observations) in the data so we cannot simply take the means of those variables. However, we can tell the colMeans function to remove the NAs before computing the mean.
specify "na.rm = TRUE" inside of colMeans.
Then we get the means after removing the missing observations.
```{r}
sapply(s, function(x) {
colMeans(x[, c("Ozone", "Solar.R", "Wind")],
na.rm = TRUE
)
})
```
We can also do this with purrr as shown below.
purr "map" will return a list with 5 elements.
But not simplified or returned into a data frame
```{r}
purrr::map(s, function(x) {
colMeans(x[, c("Ozone", "Solar.R", "Wind")], na.rm = TRUE)
})
```
Or use the currently supported function purrr::list_cbind(). Though we also need to do a bit more work behind the scenes.
```{r}
## Make sure we get data.frame / tibble outputs for each element
## of the list
purrr:::map(s, function(x) {
tibble::as_tibble(colMeans(x[, c("Ozone", "Solar.R", "Wind")], na.rm = TRUE))
})
```
**R recommends using the "list_cbind" to combine the different columns**
```{r}
## Now we can combine them with list_cbind()
purrr:::map(s, function(x) {
tibble::as_tibble(colMeans(x[, c("Ozone", "Solar.R", "Wind")], na.rm = TRUE))
})
purrr::list_cbind()
```
```
```{r}
## And we can then add the actual variable it came from with mutate()
purrr:::map(s, function(x) {
tibble::as_tibble(colMeans(x[, c("Ozone", "Solar.R", "Wind")], na.rm = TRUE))
}) %>%
purrr::list_cbind() %>%
dplyr::mutate(Variable = c("Ozone", "Solar.R", "Wind"))
```
# A tibble: 3 × 6
`5`$value `6`$value `7`$value `8`$value `9`$value Variable
<dbl> <dbl> <dbl> <dbl> <dbl> <chr>
1 23.6 29.4 59.1 60.0 31.4 Ozone
2 181. 190. 216. 172. 167. Solar.R
3 11.6 10.3 8.94 8.79 10.2 Wind
But, all of these versions are they're not tidy because we don't have like one row per observation.
We just have this, all of the data in the wide format.
So, we’ll use the t() function for transposing a vector and purrr:list_rbind().
```{r}
## Get data.frame / tibble outputs, but with each variable as a separate
## column. Here we used the t() or transpose() function.
purrr:::map(s, function(x) {
tibble::as_tibble(t(colMeans(x[, c("Ozone", "Solar.R", "Wind")], na.rm = TRUE)))
})
```
head(airquality)
```{r}
## Now we can row bind each of these data.frames / tibbles into a
## single one
purrr:::map(s, function(x) {
tibble::as_tibble(t(colMeans(x[, c("Ozone", "Solar.R", "Wind")], na.rm = TRUE)))
}) %>% purrr::list_rbind()
```
```{r}
## Then with mutate, we can add the Month back
purrr:::map(s, function(x) {
tibble::as_tibble(t(colMeans(x[, c("Ozone", "Solar.R", "Wind")], na.rm = TRUE)))
}) %>%
purrr::list_rbind() %>%
dplyr:::mutate(Month = as.integer(names(s)))
```
For the above task though, we might prefer to use dplyr functions.
```{r}
## group_by() is in a way splitting our input data.frame / tibble by
## our variable of interest. Then summarize() helps us specify how we
## want to use that data, before it's all put back together into a
## tidy tibble.
airquality %>%
dplyr::group_by(Month) %>%
dplyr::summarize(
Ozone = mean(Ozone, na.rm = TRUE),
Solar.R = mean(Solar.R, na.rm = TRUE),
Wind = mean(Wind, na.rm = TRUE)
)
```
**tapply**
Table apply
Shortcut for tapply() is used to apply a function over subsets of a vector. It can be thought of as a combination of split() and sapply() for vectors only. I’ve been told that the “t” in tapply() refers to “table”, but that is unconfirmed.
So this is kind of like a shortcut, for like splitting, then lapplying or
sapplying, so it's a short code for that.
```{r}
str(tapply)
```
Given a vector of numbers, one simple operation is to take group means.
```{r}
## Simulate some data
x <- c(rnorm(10), runif(10), rnorm(10, 1))
## Define some groups with a factor variable
f <- gl(3, 10)
f
```
```{r}
tapply(x, f, mean)
```
[1] 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3
Levels: 1 2 3
> tapply(x, f, mean)
1 2 3
-0.4384190 0.4017609 1.2924346
**apply()**
The apply() function is used to a evaluate a function (often an anonymous one) over the margins of an array. It is most often used to apply a function to the rows or columns of a matrix (which is just a 2-dimensional array). However, it can be used with general arrays, for example, to take the average of an array of matrices. Using apply() is not really faster than writing a loop, but it works in one line and is highly compact.
```{r}
str(apply)
```
The arguments to apply() are
X is an array
MARGIN is an integer vector indicating which margins should be “retained”.
FUN is a function to be applied
... is for other arguments to be passed to FUN
**Col/Row Sums and Means**
They're very fast versions of applying and computing the sum or the mean across the rows or the columns.
Pro-tip
For the special case of column/row sums and column/row means of matrices, we have some useful shortcuts.
rowSums = apply(x, 1, sum)
rowMeans = apply(x, 1, mean)
colSums = apply(x, 2, sum)
colMeans = apply(x, 2, mean)
## ------------------------------------------------------------------------------------------------------------------------
Here’s an example of a function that computes the sum of squares given some data, a mean parameter and a standard deviation.
Let's say, we have this function over here.
where we take a vector, of observations in X.
But then we want to subtract a specific, mean value for each of those observations.
Square them, then, divided by specific variance or a standard deviation square.
so we'll call this specific, mean letter "mu", and we'll call the specific variance letter "sigma".
This function takes a mean mu, a standard deviation sigma, and some data in a vector x.
mu and sigma are length 1 and x is numeric vector of any size