-
Notifications
You must be signed in to change notification settings - Fork 28
/
03_mean.Rmd
executable file
·328 lines (237 loc) · 13.6 KB
/
03_mean.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
---
title: "Mean, Median, and Mode"
output:
bookdown::html_document2:
includes:
in_header: assets/03_mean_image.html
after_body: assets/foot.html
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, fig.align="center")
library(ggplot2)
#library(fGarch)
#library(gifski)
library(emdbook)
library(plotly)
library(here)
```
:::obj
**Module learning objectives**
1. Describe 3 measures of centrality
1. Explain the mathematical notation used for calculating the mean
1. Write a function which calculates the mean for any vector
1. Write a function which calculates the median for any vector
1. Describe how the reliability of a sample mean will scale with increasing sample size
:::
# What are measures of centrality?
\
```{r, echo=FALSE}
set.seed(12)
x <- rnorm(50, 10, 2)
x2 <- rnorm(50, 18, 1.2)
x <- data.frame(x = x, type = "Island #1")
x2 <- data.frame(x = x2, type = "Island #2")
d <- rbind(x, x2)
# p <- ggplot(data = d, aes(d$x, fill = d$type)) + geom_histogram(binwidth = 1,
# color = "white") + theme_light() + scale_fill_manual(values = c("green3",
# "turquoise3")) + labs(x = "Teacup Giraffe heights", y = "Frequency",
# fill = NULL) + scale_y_continuous(expand = c(0, 0)) + theme(panel.border = element_blank(),
# panel.grid.minor = element_blank(), axis.ticks.y = element_blank(),
# legend.position = c(0.165, 0.92), legend.background = element_blank())
#
# ggsave(filename = "/Users/Desiree/Documents/New R Projects/Cars/p.png",
# width = 5, height = 3, p)
```
```{r, out.width="500px", echo= FALSE}
knitr::include_graphics("images/03_mean/mean_hist.png")
```
\
You've just collected a lot data and graphed heights. Although informative, a graphical display of these data is difficult to summarize -- we need to describe these heights with a single number that will be meaningful and allow us to do statistics.
We can do this with a **measure of centrality**, the concept that one number in the "center" of the data set is a good summary of all the values. Below are examples of different measures of centrality.
* The **mean** is the average and the measure of centrality that you are probably most familiar with. This is a good measure to use when the data are normally distributed. We describe it in detail below.
* The **median** is the value in the middle of the data set. Half of the observations lie above the median and half below. When the data are normally distributed, the median and the mean will be very close to each other. When your data are not normally distributed (skewed to the left or right) the median is a more appropriate measure of centrality (see the animation below).
\
```{r, fig.show='animate', animation.hook='gifski', fig.width=6, fig.height=2, echo=FALSE, message=FALSE, warning=FALSE, results='hide', interval=0.5, cache=TRUE}
# skew <- seq(0.5, 1, 0.05)
# skew2 <- seq(1.1, 2, 0.1)
# skew3 <- seq(1.9, 1, -0.1)
# skew4 <- seq(0.95, 0.55, -0.05)
# skew <- c(skew, skew2, skew3, skew4)
# plot1 <- function(x) {
# d <- lapply(1:40, function(x) {
# d <- data.frame(x = rsnorm(1e+05, mean = 0, sd = 2, xi = skew[x]),
# frame = x)
# return(d)
# })
# medians <- c(seq(0.31, -0.31, -0.031), seq(-0.279, 0.279, 0.031))
# # medians <<- lapply(1:40, function(x) median(d[[x]]$x))
# p <- lapply(1:40, function(y) ggplot(data = d[[y]], aes(x)) + geom_histogram(binwidth = 0.25,
# color = "white", fill = "skyblue2") + theme_light() + theme(panel.border = element_blank(),
# panel.grid.minor = element_blank(), axis.ticks = element_blank(),
# axis.text = element_blank()) + guides(fill = FALSE) + labs(x = NULL,
# y = NULL) + scale_y_continuous(expand = c(0, 0), limits = c(0,
# 5600), breaks = c(0, 2000, 4000)) + geom_vline(xintercept = 0,
# size = 0.5, linetype = "dashed") + geom_vline(xintercept = medians[y],
# size = 0.5) + xlim(-5, 5) + annotate("text", label = "Mean", size = 3.4,
# x = -4.1, y = 5300, hjust = 0) + annotate("text", label = "Median",
# size = 3.4, x = -4.1, y = 4600, hjust = 0) + geom_segment(aes(x = -4.8,
# xend = -4.3, y = 5300, yend = 5300), linetype = "dashed") + geom_segment(aes(x = -4.8,
# xend = -4.3, y = 4600, yend = 4600)))
# print(p)
# }
#
# gif_file <- file.path(getwd(), "median.gif")
# save_gif(plot1(), gif_file = gif_file, progress = FALSE, loop = TRUE, delay = 0.5,
# width = 400, height = 133, res = 100)
#
# utils::browseURL(gif_file)
```
<center> ![](images/03_mean/median.gif) </center>
\
* The **mode** is the value (height, in our case) that occurs most frequently in the data set. It's not typically used in statistics, and we won't cover it further here.
\
# Taking the mean
The mean of a variable is the sum of its values, divided by the number of values.
This concept can be represented with equation below. In our case, each "x" represents a giraffe height (i.e. a single observation), and the numerical subscript indicates its order in the sample. We'll use ${\bar{x}}$ (read "x-bar") to represent the mean of the height variable.
<div style="margin-bottom:50px"></div>
\begin{equation}
(\#eq:equation1)
\Large{\bar{x}} = \frac{x_1 + x_2 + ... + x_n}{n}
\end{equation}
<div style="margin-bottom:50px">
</div>
To make this more efficient, instead of writing "${x_1 + x_2 + ... + x_n}$", we can use the uppercase sigma symbol $\sum{}$ to represent summation of all the observations.
\
\begin{equation}
(\#eq:equation2)
\Large{\bar{x}} = \frac{\sum_{i=1}^{n}{x_i}}{n}
\end{equation}
\
This might look intimidating, but equation \@ref(eq:equation2) is really showing the same thing as \@ref(eq:equation1). Let's go through the steps again, breaking the symbols apart a bit (see annotated equation \@ref(eq:equation3) below). The sigma means 'add up'. What are we adding up? All the heights "x". The "i = " part indicates which term to begin with. For our purposes, this will always be the first observation, hence $i$ = 1. The character on top of the sigma is the last observation we include in our summation. In this case it's n -- because we're adding all n = 50 observations in each group of giraffes. In both equations, we still divide by the total number of observations in each group we have: again, n.
\
<center>
\begin{equation}
(\#eq:equation3)
\vcenter{\img[width=400px]{images/03_mean/eq_annotated.png}}
\end{equation}
</center>
# Notation for sample vs population
Recall our discussion about a [sample versus a population](02_bellCurve.html#bell-ttta). Different symbols are used to represent the mean for each of these. We've already discussed $\bar{x}$ for the sample mean. The analogous symbol for the population mean is ${\mu}$ (read "mu"). Additionally, when referring to the size of the population, we will use a capital ${N}$ instead of a lowercase one.
\
# Code it up
Using \@ref(eq:equation2), it's easy to translate this equation into code in R. The heights recorded from island 1 have been stored in a vector called `heights_island1`. Below we show the first few observations from this vector, using the `head()` function.
<div style="margin-bottom:15px">
</div>
```{r, echo=FALSE}
set.seed(12)
heights_island1 <- rnorm(50, 10, 2)
```
```{r, echo=TRUE}
head(heights_island1)
```
\
Use the interactive window below to calculate the mean "by hand".
<!---LEARNR EX 1-->
<iframe class="interactive" id="myIframe1" src="https://tinystats.shinyapps.io/03-mean-ex1/" scrolling="no" frameborder="no"></iframe>
<!------------->
# Create your own function
Now it's your turn to write your own function. Call it "my_mean" and have it calculate the mean of any given vector. You're going to use the rules for writing a function in R that you've used previously. As a reminder, you'll use `function()` and embed your code (that you completed in the window above) within curly brackets`{}`. The advantage of making a "homemade" function is that you can string together all the steps from the previous exercise into a single command.
<!---LEARNR EX 2-->
<iframe class="interactive" id="myIframe2" src="https://tinystats.shinyapps.io/03-mean-ex2/" scrolling="no" frameborder="no"></iframe>
<!------------->
You can also complete the exercise above in RStudio on your local computer. This way you will be able to save your `my_mean()` function and script for future use.
\
# Take a tea break!
```{r, out.width="600px", echo= FALSE}
knitr::include_graphics("images/03_mean/Teacup.png")
```
# Taking the median
To calculate the median go through the following steps:
* Assess whether there is an odd or even number of observations
* Order all observations from smallest to largest
* If an odd number, then the median is the middle value at position: (n + 1) / 2
* If an even number, then:
+ Find the value at the position: n / 2
+ Find the value at the position: (n / 2) + 1
+ The median will be the mean of the values at these two positions.
Before you write your own median function, two concepts need to be introduced: 1) the modulus operator `%%` and 2) `if...else` statements.
The **modulus operation** gives the remainder after division of one number by another. For example, in R `11 %% 5` returns the `1`, which is the remainer of `11` divided by `5`. If the modulus operation returns `0`, then there is no remainder. It is useful to apply the modulus operation `x %% 2` to determine whether a number `x` is even or odd by testing if the result is exactly equal to 0. See example code below.
<div style="margin-bottom:15px">
<style>
.col2 {
columns: 2 200px; /* number of columns and width in pixels*/
-webkit-columns: 2 200px; /* chrome, safari */
-moz-columns: 2 200px; /* firefox */
}
</style>
</div>
<div class="col2">
```{r, tut=FALSE, prompt=TRUE}
10 %% 2
10 %% 2 == 0
11 %% 2
11 %% 2 == 0
```
</div>
\
An `if...else` statement is useful when you want to specify distinct outcomes for objects dependent on whether they meet your set criteria. See below.
<!---LEARNR EX 3-->
<iframe class="interactive" id="myIframe3" src="https://tinystats.shinyapps.io/03-mean-ex3/" scrolling="no" frameborder="no"></iframe>
<!------------->
Now that you have a sense for how the `%%` operator can be used to test whether a number is EVEN or ODD, and how `if...else` statements work, use both of these concepts in the window below to write your own function that calculates the median of any vector.
<!---LEARNR EX 4-->
<iframe class="interactive" id="myIframe4" src="https://tinystats.shinyapps.io/03-mean-ex4b/" scrolling="no" frameborder="no"></iframe>
<!------------->
# Things to think about
Remember that the sample mean is an estimate of the entire population's mean (which would often be impossibly large to measure). How reliably does the mean of a sample represent the population mean? *Warning*: if a small sample has been used, the sample mean may not be a reliable at all! Estimates from small samples are subject to the whims of randomness. On the other hand, the larger the sample, the closer the sample size appraches the population size, and the more reliable the sample estimate becomes.
Pressing 'Play' on the plot below will illustrate this concept.
<a name="mean_animation">
```{r, tut=FALSE, echo=FALSE, message= FALSE, cache=TRUE}
m <- list(l = 50, r = 50, b = 10, t = 10, pad = 4)
accumulate_by <- function(dat, var) {
var <- lazyeval::f_eval(var, dat)
lvls <- plotly:::getLevels(var)
dats <-
lapply(seq_along(lvls), function(x) {
cbind(dat[var %in% lvls[seq(1, x)], ], frame = lvls[[x]])
})
dplyr::bind_rows(dats)
}
d <- do.call(rbind, lapply(lseq(10, 10000, 300), function(x) {
d <- data.frame(x = rnorm(x), frame = x/300, N = x)
return(d)
}))
dd <-
aggregate(data = d, x ~ frame + N, FUN = mean) %>%
accumulate_by(~N)
p <-
dd %>%
plot_ly(x = ~log10(N), y = ~x, frame = ~frame...4, type = "scatter", mode = "lines",
line = list(simplyfy = F, color = "orangered"), width = 550, height = 350) %>%
animation_opts(frame = 10, transition = 0, redraw = FALSE) %>%
config(displayModeBar = F) %>%
layout(xaxis = list(title = "Sample Size (log10)", zeroline = F),
yaxis = list(range = c(-0.7, 0.7), title = "Mean", zeroline = F),
autosize = F, margin = m) %>%
animation_slider(hide = T) %>%
animation_button(x = 1, xanchor = "right", y = 0, yanchor = "bottom")
htmltools::save_html(p, here("images/03_mean/Law_of_large_numbers.html"))
```
</a>
<center><iframe style="margin: 0px;" src="images/03_mean/Law_of_large_numbers.html" width="570" height="400" scrolling="yes" seamless="seamless" frameBorder="0"> </iframe></center>
<script type="text/x-mathjax-config">
MathJax.Ajax.config.path["img"] = "https://cdn.rawgit.com/pkra/mathjax-img/1.0.0/";
MathJax.Hub.Config({
extensions: ["tex2jax.js","[img]/img.js"],
jax: ["input/TeX","output/HTML-CSS"],
tex2jax: {inlineMath: [["$","$"],["\\(","\\)"]]},
});
</script>
* The animation above shows the values of means calculated from increasingly larger samples: small samples on the left and larger samples to the right (on the x-axis).
* Each point on the zig-zag line is the mean calculated from a random sample. The true mean of the population is 0.
* The y-axis shows what the mean is for a sample of that particular size. Though the y-values vary here, remember that if the sample were a good estimate of the population, the y-values should be very close to 0.
* You can see that when the samples are small the sample mean isn't necessarily a good representation of the population that it was sampled from--and that is not a good thing.
For further reading see the [Law of Large Numbers](https://en.wikipedia.org/wiki/Law_of_large_numbers).
<script>
iFrameResize({}, ".interactive");
</script>