Skip to content

Latest commit

 

History

History
1023 lines (651 loc) · 19.4 KB

Palmer-Penguins.md

File metadata and controls

1023 lines (651 loc) · 19.4 KB

Palmer Penguins - Bootstraping

Petr Hrobař 2020-08-06

Introduction

In this example we will use data from tidytuesday project. We will perform basic sample bootstraping as well as logistic regression for classification. Dataset is focusing on penguins and provides nice dataset for analyses. Detailed descripton of each varaibles can be found here.

First, let’s do some basic data loading and cleaning:

# packages
library(tidyverse)
library(reshape2)
library(knitr)
library(kableExtra)

# Setting seed
set.seed(1)

theme_set(theme_light())


# Data loading and minor cleaning
penguins.csv <- 
  readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-07-28/penguins_raw.csv') %>% 
  janitor::clean_names() %>% 
  mutate(species = 
           case_when(species == "Adelie Penguin (Pygoscelis adeliae)" ~ "Adelie",
                     species == "Gentoo penguin (Pygoscelis papua)" ~ "Gentoo", 
                     species == "Chinstrap penguin (Pygoscelis antarctica)" ~ "Chinstrap"),
         sex = as.factor(sex),
         species = as.factor(species)) %>% 
  separate(date_egg, c("year", "month", "day"), "-") %>% 
  select(-day) %>% 
  filter(!is.na(sex),
         !is.na(species))


levels(penguins.csv$sex) = c("Female", "Male")

Now let’s inspect how much are penguins relocating over time. We can also notice visible custering tendencies within each species. This can be pretty usefull for future analyses. Final cleaned dataset looks as follows:

penguins.csv %>% 
  select(-study_name, -sample_number, -region, -island, -stage, -year, -month) %>% 
  head() %>% 
  knitr::kable()

species

individual_id

clutch_completion

culmen_length_mm

culmen_depth_mm

flipper_length_mm

body_mass_g

sex

delta_15_n_o_oo

delta_13_c_o_oo

comments

Adelie

N1A1

Yes

39.1

18.7

181

3750

Male

NA

NA

Not enough blood for isotopes.

Adelie

N1A2

Yes

39.5

17.4

186

3800

Female

8.94956

-24.69454

NA

Adelie

N2A1

Yes

40.3

18.0

195

3250

Female

8.36821

-25.33302

NA

Adelie

N3A1

Yes

36.7

19.3

193

3450

Female

8.76651

-25.32426

NA

Adelie

N3A2

Yes

39.3

20.6

190

3650

Male

8.66496

-25.29805

NA

Adelie

N4A1

No

38.9

17.8

181

3625

Female

9.18718

-25.21799

Nest never observed with full clutch.

Before full analysis, we can inspect how each species tend to cluster (over time).

# plot penguins spread over time - are they clustering (based on location)
penguins.csv %>% 
  ggplot(aes(culmen_length_mm, culmen_depth_mm, color = species, shape = sex)) +
  geom_point() + 
  coord_flip() + 
  theme(legend.position="bottom") + 
  labs(title = "Are different species of penguins clustering over time?") + 
  facet_wrap(~year)

We have three species of penguins at our disposal, i.e. \textbf{Adelie, Gentoo, Chinstrap}. Also we are observing each individuals sex. Therefore, we can take this factor into account as well. Firstly, let’s inspect distributions of each obseeved metrics, i.e. \textbf{Body mass, Culmen depth, Culmen length, Flipper length}.

# Let's inspect distribution of four measures for each species (ignoring sex)
penguins.csv %>% 
  select(species, sex, 
         culmen_length_mm, 
         culmen_depth_mm, 
         flipper_length_mm, 
         body_mass_g) %>% 
  gather(key, value, -sex, -species) %>% 
  ggplot(aes(value)) + 
  geom_density(aes(fill = species), alpha = 0.4) + 
  facet_wrap(~key, scales = "free")

Bootstraping

Bootstrap is a powefull technique of obtaining (almost) any statistic using random sampling methods. Main idea is that if our sample we are working with is somewhat representative, we can bootstrap - sample with replacement in order to obtain more precise characteristic of underlying data-generating process distribution.

In this study, we will show to bootstrap methods can be used to obtain more robust estimation of underlying distribution. More precisely,how to obtain smaller standart error and perform statistical inference. Since we have information about species as well as sex of particular penguin, we shall nest data to account for variations not only among species but also among sex within species.

grouped_boot <- 
  penguins.csv %>% 
  select(species, sex, 
         culmen_length_mm) %>% 
  group_by(species, sex) %>% 
  nest()

grouped_boot
## # A tibble: 6 x 3
## # Groups:   species, sex [6]
##   species   sex    data             
##   <fct>     <fct>  <list>           
## 1 Adelie    Male   <tibble [73 × 1]>
## 2 Adelie    Female <tibble [73 × 1]>
## 3 Gentoo    Female <tibble [58 × 1]>
## 4 Gentoo    Male   <tibble [61 × 1]>
## 5 Chinstrap Female <tibble [34 × 1]>
## 6 Chinstrap Male   <tibble [34 × 1]>

An example of nested dataset can be observed. We are in fact working with not one big dataset but 6 smaller dataset at the same time.

Now let’s writte bootstrap function:

############################################################################## #
###########              Function for Performing BOOTSTRAP           ###########
############################################################################## #

# Function arguments:   df: dataframe to run bootstrap on,
#                       n: number of bootstrap replications to perform (default is 500)
#                       variable: metric to perform bootstrap on (default is culmen_length_mm)

boot_means <- function(df = df, n = 500, variable = "culmen_length_mm") {
  
  # create an empty dataframe for bootsraping results
  boot_means = tibble()
  
  # Iterate over random bootstrap samples
  for (i in 1:n) {
    message(i)
    mean_i <- 
      cbind(mean = 
      sample_n(df, size = nrow(df), replace = T) %>% 
      pull(variable) %>% mean(), 
           sd = sample_n(df, size = nrow(df), replace = T) %>% 
      pull(variable) %>% sd())
    
    # Save the results to the dataframe
    boot_means = rbind(boot_means, mean_i)
    #colnames(boot_means) = variable
  }
  return(boot_means)
}

Once the function is finished, we can illustrate power of the bootstrap. We might be interested in distribution of \textbf{Culmen length} in Adelia species penguins population. However, we only have random sample of Adelia males penguins population. In order to generalize and understand real distribution we can perform bootstrap. To demonstrate power of bootstraping, we show before and after plots for comparison.

bootstrap_sample <- 
  penguins.csv %>% 
  filter(species == "Adelie",
         sex == "Male") %>% 
  select(culmen_length_mm) %>% 
  mutate(Type = "Before Bootstrap")%>% 
  bind_rows(penguins.csv %>% 
  filter(species == "Adelie",
         sex == "Male") %>% 
  boot_means(., n = 1000) %>% 
    mutate(Type = "After Bootstrap"))

gg_boot_hists <- 
  bootstrap_sample %>% 
  gather(key, value, -Type) %>% 
  mutate(key = 
           case_when(key == "culmen_length_mm" ~ "Sample distribution",
                     key == "mean" ~ "Bootstraped mean",
                     key == "sd" ~ "Bootstraped sd"))

gg_boot_hists$key = factor(gg_boot_hists$key, levels = c("Sample distribution", "Bootstraped mean", "Bootstraped sd"))

gg_boot_hists %>% 
  ggplot(aes(value)) + 
  geom_histogram(aes(fill = Type), bins = 20, color = "white") + 
  facet_wrap(~key, scales = "free") + 
  labs(title = "Distribution of culmen length for Adelia species - Males only", 
       subtitle = "Before and after bootstraping samples")

As can be seen, bootstrap distribution seems to be somewhat normally distributed. Now that we see power of bootstraping, let’s apply it to every single metrics: \textbf{Body mass, Culmen depth, Culmen length, Flipper length}, for each single subgroup (sex within the species).

Using bootstraped distributions we can distinguish how are each metrics (presumably) distributed in the population. Since all values seem to be normaly distributed, parametrs of normal distribution can be used to perform statistical inference.

Using \textbf{Maximum likelihood estimation}, we can estimate parametrs of a distribution of culmen lenghts (for perticular specie and sex).

# Maximum likelihood estimation of parametrs of normal dist. 

boot <- 
  grouped_boot %>% 
  mutate(dist_mle = map(data, ~MASS::fitdistr(pull(.), "normal")), 
         tidy_mle = map(dist_mle, ~broom::tidy(.))) %>% 
  ungroup() %>% 
  mutate(
    boot_charactersitic = map(data, ~boot_means(df = ., n = 2000))) %>% 
  unnest(boot_charactersitic) %>% 
  select(-data)


levels(boot$sex) = c("Female", "Male")
boot$sex = factor(boot$sex, levels = c("Male", "Female"))

boot %>%   
  unnest(tidy_mle) %>% 
  #filter(sex == "MALE") %>% 
  filter(term != "sd") %>% 
  ggplot(aes(mean)) + 
  geom_density(aes(fill = species), alpha = 0.5) + 
  facet_grid(~sex ~ species) + 
  geom_vline(aes(xintercept = estimate), lty = 2, color = "red") +
  labs(title = "Are distributions of culmen length different for species and sex?", subtitle = "Bootstrapped sample's means")

boot %>%   
  unnest(tidy_mle) %>% 
  #filter(sex == "MALE") %>% 
  filter(term == "sd") %>% 
  ggplot(aes(sd)) + 
  geom_density(aes(fill = species), alpha = 0.5) + 
  facet_grid(~sex ~ species) + 
   geom_vline(aes(xintercept = estimate), lty = 2, color = "red") +
  labs(title = "Are distributions of culmen length different for species and sex?", subtitle = "Bootstrapped sample's sd's")

dist_arg <- 
  penguins.csv %>% 
  filter(sex == "Male", 
         species == "Adelie") %>% 
  pull(culmen_length_mm) %>% 
  MASS::fitdistr(., "normal") %>% 
  broom::tidy()

# Estimated parametrs
dist_arg %>% 
  knitr::kable()

term

estimate

std.error

mean

40.39041

0.2646862

sd

2.26148

0.1871614

boot %>% 
  filter(sex == "Male", 
         species == "Adelie") %>% 
  select(-sd, -dist_mle, -tidy_mle) %>% 
  mutate(`Normal Distribution` = rnorm(2000, mean = dist_arg$estimate[1], sd = dist_arg$estimate[2])) %>% 
  rename(`Bootstrap mean` = mean) %>% 
  gather(key, value, -species, -sex) %>% 
  mutate(key = ifelse(key == "culmen_length_mm", "Empirical Distribution", key)) %>% 
  ggplot(aes(value)) + 
  geom_density(aes(fill = key), alpha = 0.45) +
labs(title = "Estimated distribution of Culmen Lenght (mm) for males of Adelie species",
       subtitle = "Normal distribution is assumed",
         fill = "distribution") + 
  scale_x_continuous(breaks = c(seq(32, 46, by = 2)))

It can be confirmed that normal distribution indeed seems to be a good estimation. Now that we have paramets of distribution at our disposal, basic statistical inference can be performed. For Example if we are interested in probability that randomly picked male penguin from Adelie specie has Culmen Lenght smaller or equal to some value, let’s say 40, basic test can be used.

P(X \leq 40) = P(Z \leq -0.176) = 0.43
, where Z is Z-score of a standart normal distribution. See below:

z <-  rnorm(3000,  mean = dist_arg$estimate[1], sd = dist_arg$estimate[2])

dens <- density(z)

data <- tibble(x = dens$x, y = dens$y) %>% 
    mutate(variable = case_when(
      (x <= 40) ~ "On",
      TRUE ~ NA_character_))

ggplot(data, aes(x, y)) + geom_line() +
  geom_area(data = filter(data, variable == 'On'), fill = '#00AFBB', alpha = 0.2) + 
  geom_vline(aes(xintercept = 40), lty = 2, color = "red")

zz1 <- pnorm(40, mean = dist_arg$estimate[1], sd = dist_arg$estimate[2])
zz1
## [1] 0.4314691

Alternatively, we cancalculate what is a probability that randomly picked male penguin from Adelie specie has Culmen Lenght between 40.5 and 45. Once again, quantiles of standart normal distribution can be used

P(40.5 \leq X \leq 45)
:

z <-  rnorm(4000,  mean = dist_arg$estimate[1], sd = dist_arg$estimate[2])

dens <- density(z)

data <- tibble(x = dens$x, y = dens$y) %>% 
    mutate(variable = case_when(
      (x >= 40.5 & x <= 45) ~ "On",
      TRUE ~ NA_character_))

ggplot(data, aes(x, y)) + geom_line() +
  geom_area(data = filter(data, variable == 'On'), fill = '#00AFBB', alpha = 0.2) + 
  geom_vline(aes(xintercept = 40.5), lty = 2, color = "red") + 
  geom_vline(aes(xintercept = 45), lty = 2, color = "red")

zz2 <- pnorm(45, mean = dist_arg$estimate[1], sd = dist_arg$estimate[2]) - 
  pnorm(40.5, mean = dist_arg$estimate[1], sd = dist_arg$estimate[2])

zz2
## [1] 0.4599156

Where area displayed above is equal to 0.4599. Using MLE estimation to obtain parameters of normal distribution is one approach that can be used. Alternatively, since we are working with bootstraped sample we can estimate distribution’s mean and standart deviation as a \textbf{sample mean}, e.i. \hat\mu = \overline x and \textbf{sample standart deviation} as \hat\sigma = \frac{S'x}{\sqrt{n}}.

conf_intervals_original_data <- 
  grouped_boot %>% 
  unnest() %>% 
  group_by(species, sex) %>% 
  summarise(mean = mean(culmen_length_mm),
            sd = sd(culmen_length_mm),
            count = n(),
            sd_est = sd/sqrt(count)) %>% 
  mutate(conf.low = mean - (qnorm(0.975) * sd_est),
         conf.high = mean + (qnorm(0.975) * sd_est)) %>% 
   mutate(Method = "Sample Estimation")

  
conf_intervals_MLE <- 
  boot %>% 
  ungroup() %>% 
  unnest(tidy_mle) %>% 
  select(species, sex, term, estimate, std.error) %>% 
  distinct() %>% 
  mutate(conf.low = estimate - (qnorm(0.975) * std.error),
         conf.high = estimate + (qnorm(0.975) * std.error))

conf_intervals_MLE <- 
  conf_intervals_MLE %>% 
  pivot_wider(values_from = estimate:conf.high, 
              names_from = term) %>% 
  select(species, sex, mean = estimate_mean, sd = estimate_sd,
         conf.low = conf.low_mean, conf.high = conf.high_mean) %>% 
  mutate(Method = "ML - Estimation")


bind_rows(conf_intervals_original_data, conf_intervals_MLE) %>% 
  ggplot(aes(mean, species, color = species)) + 
  geom_point() + 
  geom_errorbar(aes(xmin = conf.low, xmax = conf.high)) +
  facet_grid(~Method~sex, scales = "free_y")