-
Notifications
You must be signed in to change notification settings - Fork 0
/
Palmer Penguins.RMD
354 lines (277 loc) · 13 KB
/
Palmer Penguins.RMD
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
---
title: "Palmer Penguins - Bootstraping"
date: "`r Sys.Date()`"
author: Petr Hrobař
output:
github_document:
pandoc_args: --webtex
highlight: pygments
editor_options:
chunk_output_type: console
always_allow_html: true
---
# Introduction
In this example we will use data from [tidytuesday project](https://github.com/rfordatascience/tidytuesday). We will perform basic sample bootstraping as well as logistic regression for classification. Dataset is focusing on penguins and provides nice dataset for analyses. Detailed descripton of each varaibles can be found [here](https://allisonhorst.github.io/palmerpenguins/).
```{r set knitr options, echo = FALSE}
# set some knitr options
knitr::opts_chunk$set(echo = T,
message = FALSE,
warning = FALSE,
fig.width=8, fig.height=5)
```
First, let's do some basic data loading and cleaning:
```{r}
# packages
library(tidyverse)
library(reshape2)
library(knitr)
library(kableExtra)
# Setting seed
set.seed(1)
theme_set(theme_light())
# Data loading and minor cleaning
penguins.csv <-
readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-07-28/penguins_raw.csv') %>%
janitor::clean_names() %>%
mutate(species =
case_when(species == "Adelie Penguin (Pygoscelis adeliae)" ~ "Adelie",
species == "Gentoo penguin (Pygoscelis papua)" ~ "Gentoo",
species == "Chinstrap penguin (Pygoscelis antarctica)" ~ "Chinstrap"),
sex = as.factor(sex),
species = as.factor(species)) %>%
separate(date_egg, c("year", "month", "day"), "-") %>%
select(-day) %>%
filter(!is.na(sex),
!is.na(species))
levels(penguins.csv$sex) = c("Female", "Male")
```
Now let's inspect how much are penguins relocating over time. We can also notice visible custering tendencies within each species. This can be pretty usefull for future analyses. Final cleaned dataset looks as follows:
```{r}
penguins.csv %>%
select(-study_name, -sample_number, -region, -island, -stage, -year, -month) %>%
head() %>%
knitr::kable()
```
Before full analysis, we can inspect how each species tend to cluster (over time).
```{r}
# plot penguins spread over time - are they clustering (based on location)
penguins.csv %>%
ggplot(aes(culmen_length_mm, culmen_depth_mm, color = species, shape = sex)) +
geom_point() +
coord_flip() +
theme(legend.position="bottom") +
labs(title = "Are different species of penguins clustering over time?") +
facet_wrap(~year)
```
We have three species of penguins at our disposal, i.e. $\textbf{Adelie, Gentoo, Chinstrap}$. Also we are observing each individuals sex. Therefore, we can take this factor into account as well. Firstly, let's inspect distributions of each obseeved metrics, i.e. $\textbf{Body mass, Culmen depth, Culmen length, Flipper length}.$
```{r}
# Let's inspect distribution of four measures for each species (ignoring sex)
penguins.csv %>%
select(species, sex,
culmen_length_mm,
culmen_depth_mm,
flipper_length_mm,
body_mass_g) %>%
gather(key, value, -sex, -species) %>%
ggplot(aes(value)) +
geom_density(aes(fill = species), alpha = 0.4) +
facet_wrap(~key, scales = "free")
```
# Bootstraping
Bootstrap is a powefull technique of obtaining (almost) any statistic using random sampling methods. Main idea is that if our sample we are working with is somewhat representative, we can bootstrap - sample with replacement in order to obtain more precise characteristic of underlying data-generating process distribution.
In this study, we will show to bootstrap methods can be used to obtain more robust estimation
of underlying distribution. More precisely,how to obtain smaller standart error and perform statistical inference. Since we have information about species as well as sex of particular penguin, we shall nest data to account for variations not only among species but also among sex within species.
```{r}
grouped_boot <-
penguins.csv %>%
select(species, sex,
culmen_length_mm) %>%
group_by(species, sex) %>%
nest()
grouped_boot
```
An example of nested dataset can be observed. We are in fact working with not one big dataset but 6 smaller dataset at the same time.
Now let's writte bootstrap function:
```{r}
############################################################################## #
########### Function for Performing BOOTSTRAP ###########
############################################################################## #
# Function arguments: df: dataframe to run bootstrap on,
# n: number of bootstrap replications to perform (default is 500)
# variable: metric to perform bootstrap on (default is culmen_length_mm)
boot_means <- function(df = df, n = 500, variable = "culmen_length_mm") {
# create an empty dataframe for bootsraping results
boot_means = tibble()
# Iterate over random bootstrap samples
for (i in 1:n) {
message(i)
mean_i <-
cbind(mean =
sample_n(df, size = nrow(df), replace = T) %>%
pull(variable) %>% mean(),
sd = sample_n(df, size = nrow(df), replace = T) %>%
pull(variable) %>% sd())
# Save the results to the dataframe
boot_means = rbind(boot_means, mean_i)
#colnames(boot_means) = variable
}
return(boot_means)
}
```
Once the function is finished, we can illustrate power of the bootstrap.
We might be interested in distribution of $\textbf{Culmen length}$ in Adelia species penguins population. However, we only have random sample of Adelia males penguins population. In order to generalize and understand real distribution we can perform bootstrap. To demonstrate power of bootstraping, we show before and after plots for comparison.
```{r}
bootstrap_sample <-
penguins.csv %>%
filter(species == "Adelie",
sex == "Male") %>%
select(culmen_length_mm) %>%
mutate(Type = "Before Bootstrap")%>%
bind_rows(penguins.csv %>%
filter(species == "Adelie",
sex == "Male") %>%
boot_means(., n = 1000) %>%
mutate(Type = "After Bootstrap"))
gg_boot_hists <-
bootstrap_sample %>%
gather(key, value, -Type) %>%
mutate(key =
case_when(key == "culmen_length_mm" ~ "Sample distribution",
key == "mean" ~ "Bootstraped mean",
key == "sd" ~ "Bootstraped sd"))
gg_boot_hists$key = factor(gg_boot_hists$key, levels = c("Sample distribution", "Bootstraped mean", "Bootstraped sd"))
gg_boot_hists %>%
ggplot(aes(value)) +
geom_histogram(aes(fill = Type), bins = 20, color = "white") +
facet_wrap(~key, scales = "free") +
labs(title = "Distribution of culmen length for Adelia species - Males only",
subtitle = "Before and after bootstraping samples")
```
As can be seen, bootstrap distribution seems to be somewhat normally distributed. Now that we see power of bootstraping, let's apply it to every single metrics: $\textbf{Body mass, Culmen depth, Culmen length, Flipper length}$, for each single subgroup (sex within the species).
Using bootstraped distributions we can distinguish how are each metrics (presumably) distributed in the population. Since all values seem to be normaly distributed, parametrs of normal distribution can be used to perform statistical inference.
Using $\textbf{Maximum likelihood estimation}$, we can estimate parametrs of a distribution of culmen lenghts (for perticular specie and sex).
```{r}
# Maximum likelihood estimation of parametrs of normal dist.
boot <-
grouped_boot %>%
mutate(dist_mle = map(data, ~MASS::fitdistr(pull(.), "normal")),
tidy_mle = map(dist_mle, ~broom::tidy(.))) %>%
ungroup() %>%
mutate(
boot_charactersitic = map(data, ~boot_means(df = ., n = 2000))) %>%
unnest(boot_charactersitic) %>%
select(-data)
levels(boot$sex) = c("Female", "Male")
boot$sex = factor(boot$sex, levels = c("Male", "Female"))
boot %>%
unnest(tidy_mle) %>%
#filter(sex == "MALE") %>%
filter(term != "sd") %>%
ggplot(aes(mean)) +
geom_density(aes(fill = species), alpha = 0.5) +
facet_grid(~sex ~ species) +
geom_vline(aes(xintercept = estimate), lty = 2, color = "red") +
labs(title = "Are distributions of culmen length different for species and sex?", subtitle = "Bootstrapped sample's means")
boot %>%
unnest(tidy_mle) %>%
#filter(sex == "MALE") %>%
filter(term == "sd") %>%
ggplot(aes(sd)) +
geom_density(aes(fill = species), alpha = 0.5) +
facet_grid(~sex ~ species) +
geom_vline(aes(xintercept = estimate), lty = 2, color = "red") +
labs(title = "Are distributions of culmen length different for species and sex?", subtitle = "Bootstrapped sample's sd's")
```
```{r}
dist_arg <-
penguins.csv %>%
filter(sex == "Male",
species == "Adelie") %>%
pull(culmen_length_mm) %>%
MASS::fitdistr(., "normal") %>%
broom::tidy()
# Estimated parametrs
dist_arg %>%
knitr::kable()
boot %>%
filter(sex == "Male",
species == "Adelie") %>%
select(-sd, -dist_mle, -tidy_mle) %>%
mutate(`Normal Distribution` = rnorm(2000, mean = dist_arg$estimate[1], sd = dist_arg$estimate[2])) %>%
rename(`Bootstrap mean` = mean) %>%
gather(key, value, -species, -sex) %>%
mutate(key = ifelse(key == "culmen_length_mm", "Empirical Distribution", key)) %>%
ggplot(aes(value)) +
geom_density(aes(fill = key), alpha = 0.45) +
labs(title = "Estimated distribution of Culmen Lenght (mm) for males of Adelie species",
subtitle = "Normal distribution is assumed",
fill = "distribution") +
scale_x_continuous(breaks = c(seq(32, 46, by = 2)))
```
It can be confirmed that normal distribution indeed seems to be a good estimation. Now that we have paramets of distribution at our disposal, basic statistical inference can be performed. For Example if we are interested in probability that randomly picked male penguin from Adelie specie has Culmen Lenght smaller or equal to some value, let's say 40, basic test can be used.
$$P(X \leq 40) = P(Z \leq -0.176) = 0.43 $$,
where $Z$ is Z-score of a standart normal distribution. See below:
```{r}
z <- rnorm(3000, mean = dist_arg$estimate[1], sd = dist_arg$estimate[2])
dens <- density(z)
data <- tibble(x = dens$x, y = dens$y) %>%
mutate(variable = case_when(
(x <= 40) ~ "On",
TRUE ~ NA_character_))
ggplot(data, aes(x, y)) + geom_line() +
geom_area(data = filter(data, variable == 'On'), fill = '#00AFBB', alpha = 0.2) +
geom_vline(aes(xintercept = 40), lty = 2, color = "red")
zz1 <- pnorm(40, mean = dist_arg$estimate[1], sd = dist_arg$estimate[2])
zz1
```
Alternatively, we cancalculate what is a probability that randomly picked male penguin from Adelie specie has Culmen Lenght between 40.5 and 45. Once again, quantiles of standart normal distribution can be used
$$P(40.5 \leq X \leq 45)$$:
```{r}
z <- rnorm(4000, mean = dist_arg$estimate[1], sd = dist_arg$estimate[2])
dens <- density(z)
data <- tibble(x = dens$x, y = dens$y) %>%
mutate(variable = case_when(
(x >= 40.5 & x <= 45) ~ "On",
TRUE ~ NA_character_))
ggplot(data, aes(x, y)) + geom_line() +
geom_area(data = filter(data, variable == 'On'), fill = '#00AFBB', alpha = 0.2) +
geom_vline(aes(xintercept = 40.5), lty = 2, color = "red") +
geom_vline(aes(xintercept = 45), lty = 2, color = "red")
zz2 <- pnorm(45, mean = dist_arg$estimate[1], sd = dist_arg$estimate[2]) -
pnorm(40.5, mean = dist_arg$estimate[1], sd = dist_arg$estimate[2])
zz2
```
Where area displayed above is equal to `r round(zz2, 4)`. Using MLE estimation to obtain parameters of normal distribution is one approach that can be used. Alternatively, since we are working with bootstraped sample we can estimate distribution's mean and standart deviation as a $\textbf{sample mean}$, e.i. $\hat\mu = \overline x$ and $\textbf{sample standart deviation}$ as $\hat\sigma = \frac{S'x}{\sqrt{n}}$.
```{r}
conf_intervals_original_data <-
grouped_boot %>%
unnest() %>%
group_by(species, sex) %>%
summarise(mean = mean(culmen_length_mm),
sd = sd(culmen_length_mm),
count = n(),
sd_est = sd/sqrt(count)) %>%
mutate(conf.low = mean - (qnorm(0.975) * sd_est),
conf.high = mean + (qnorm(0.975) * sd_est)) %>%
mutate(Method = "Sample Estimation")
conf_intervals_MLE <-
boot %>%
ungroup() %>%
unnest(tidy_mle) %>%
select(species, sex, term, estimate, std.error) %>%
distinct() %>%
mutate(conf.low = estimate - (qnorm(0.975) * std.error),
conf.high = estimate + (qnorm(0.975) * std.error))
conf_intervals_MLE <-
conf_intervals_MLE %>%
pivot_wider(values_from = estimate:conf.high,
names_from = term) %>%
select(species, sex, mean = estimate_mean, sd = estimate_sd,
conf.low = conf.low_mean, conf.high = conf.high_mean) %>%
mutate(Method = "ML - Estimation")
bind_rows(conf_intervals_original_data, conf_intervals_MLE) %>%
ggplot(aes(mean, species, color = species)) +
geom_point() +
geom_errorbar(aes(xmin = conf.low, xmax = conf.high)) +
facet_grid(~Method~sex, scales = "free_y")
```