week10_tutorial.qmd

---
title: "ETC1010/5510 Tutorial 10"
subtitle: "Introduction to Data Analysis"
author: "Patrick Li"
date: "Sep 30, 2024"
format: 
  html:
    toc: true
    embed-resources: true
---


```{r setup, include = FALSE}
knitr::opts_chunk$set(
  echo = TRUE,
  eval = FALSE,
  message = FALSE, 
  warning = FALSE,
  error = FALSE, 
  out.width = "70%",
  fig.width = 8, 
  fig.height = 6,
  fig.retina = 3)
set.seed(6)
filter <- dplyr::filter
```


## `r emo::ji("target")` Workshop Objectives
This workshops will be a bit different to what we've done before. Effective clustering does not work well with blind application of techniques. In this workshop, you will try to gain an understanding of

- what types of questions you can answer with cluster analysis;
- how to interpret the output of cluster analysis; 
- how to choose between clustering methods. 


## `r emo::ji("wrench")` Instructions


Install the necessary packages.

```{r}
library(tidyverse)
library(ggdendro)
```

 
## 🥡 Exercise: Understanding Cluster Analysis

You're unsettled by the current political and economic situation in the United States, so you obtain the following dataset and wish to see if their is some relationship that explains the occurrence of violent crime in the United States. 

```{r}
load("data/USArrests_simp.Rdata")
arrest <- df_new
rm(df_new)
```

The dataset contains statistics on arrests per 100k residents for assault and murder in each of the 50 states. 

You want to understand the relationship between the murder rate, assault rate, and the urban population across the United States. 

### EDA

First, complete the following basic EDA to try and get some understanding of the data. 

1. Using graphical means, analyse the relationship between Assaults and Murders, controlling for the effect of population.  Do this in `ggplot` using a scatterplot and by choosing the size of the points via `UrbanPop`.
  - Describe the relationship between assaults and murders in words. 
  - What do you notice about the relationship between population size and murders/assaults? 

2. Another way to analyze the relationship between assaults and murders, conditional on population size, is to break up the population size into categories based on the quantiles of the variable `UrbanPop`, and then analyze the distribution across those categories. This can be done using  

```{r}
arrest <- arrest %>%
  mutate(category = cut(UrbanPop, 
                        breaks = quantile(UrbanPop, c(0, 1/3 , 2/3 , 1)), 
                        labels = c("low", "middle", "high"), 
                        include.lowest = TRUE))
```
  
  - Report the mean number of assaults and murders by category. What do you notice? 
  - Plot and describe the relationships between murders/assaults by category using boxplots. 
  - Contrast your answers from the two previous questions. 
  - Visualise the relationship between murders/assaults, conditional on the category for `UrbanPop`. What do you notice about the low population states?

```{r}
# Controlling for UrbanPop
p <- arrest %>% 
  ggplot() + 
  geom_point(aes(x = ___, y = ___, col = ___, label = ___))

plotly::ggplotly(p)
```

3. Are the relationships you've found in part 2 the same for states that are in the southern United States, versus elsewhere? To answer this, use the following definition of "southern states", and answer the following questions. 

```{r}
south <- c("Alabama", "Arkansas", "Delaware", "Florida", "Georgia", "Kentucky", 
           "Louisiana", "Maryland", "Mississippi", "North Carolina", "Oklahoma", 
           "South Carolina", "Tennessee", "Texas", "Virginia", "West Virginia")
```
  
  - Compare the means of assaults/murders for southern states against the totals from earlier. What do you notice?
  - Visualise the relationship between murders/assaults, conditional on the category for `UrbanPop`, and use different shapes for southern states to the remaining states; i.e., use `shape = xxx` inside `geom_point`.

### Clustering

Clearly, there is a relationship between the size of the population, its geographic location, and the number of murders and assaults. However, as our graphical analysis demonstrated, there is not a single relationship. So, it may be best to try and cluster these observations. 

We will consider two different types of clustering: hierarchical and k-means.

### Hierarchical Clustering. 

1. Use hierarchical clustering and complete linkage to cluster the data. 
  - Keep only numerical variables. The dummy variable `South` can also be kept.
  - We do not need to standardize the variables because the solution will be better without scaling.
  - Compute the distance matrix using Euclidean distance.
  - Create the dendogram. 
  - Compare the solutions across different clusters. Which solution seems most stable? 
  
```{r}
hclust_result <- arrest %>%
  select(-States, -category) %>%
  `rownames<-`(arrest$States) %>%
  dist(method = ___) %>%
  ___(___)

ggdendrogram(___)
```
  
2. Cut the tree by 4 clusters. Use the output to answer the following questions
  - Plot the relationship between Assault and Murder again, but choose the color of the plot corresponding to the Cluster in which the observation lies. What pattern do you see?
  
```{r}
p <- arrest %>%
  ggplot() +
  geom_point(aes(x = ___, 
                 y = ___, 
                 col = factor(cutree(___, ___)),
                 size = category,
                 label = States),
             alpha = 0.6) +
  labs(col = "cluster")

plotly::ggplotly(p)
```

  - Now, overlay the previous plot with the geographic information (South) by choosing the shape of the plot to depend on the variable South. What do you notice about the southern states? 
  
```{r}
p <- arrest %>%
  ggplot() +
  geom_point(aes(x = ___, 
                 y = ___, 
                 col = factor(cutree(___, ___)),
                 size = category,
                 shape = ___,
                 label = States),
             alpha = 0.6) +
  labs(col = "cluster")

plotly::ggplotly(p)
```

### K-means Clustering. 

1. Obtain the K-means clusters with $k=3$ and $k=4$ clusters. 

```{r}
kmeans_result_3 <- arrest %>%
  select(-States, -category) %>%
  `rownames<-`(arrest$States) %>%
  kmeans(centers = ___)

kmeans_result_4 <- arrest %>%
  select(-States, -category) %>%
  `rownames<-`(arrest$States) %>%
  kmeans(centers = ___)
```


2. Plot the different cluster membership totals in each cluster by category using `geom_bar()`. 

```{r}
p1 <- arrest %>%
  mutate(cluster = ___) %>%
  ggplot() +
  geom_bar(aes(x = ___, fill = ___)) +
  ggtitle("Number of observations in each cluster by category",
          subtitle = "(3-cluster solution)")

p2 <- arrest %>%
  mutate(cluster = ___) %>%
  ggplot() +
  geom_bar(aes(x = ___, fill = ___)) +
  ggtitle("Number of observations in each cluster by category",
          subtitle = "(4-cluster solution)")

patchwork::wrap_plots(p1, p2, ncol = 2)
```


3. For $k=4$ clusters, plot the relationship between Assault and Murder again, but choose the color of the plot corresponding to the Cluster in which the observation lies. 

```{r}
p <- arrest %>%
  mutate(cluster = ___) %>%
  ggplot() +
  geom_point(aes(x = Murder, y = Assault, col = ___, size = category, label = States),
             alpha = 0.6) +
  labs(col = "cluster")

plotly::ggplotly(p)
```

  - Now, overlay the geographic indicator. How do the results differ from the other clustering algorithm?
  
```{r}
p <- arrest %>%
  mutate(cluster = ___) %>%
  ggplot() +
  geom_point(aes(x = Murder, y = Assault, col = ___, size = category, shape = ___, label = States),
             alpha = 0.6) +
  labs(col = "cluster")

plotly::ggplotly(p)
```

  
4. Which algorithm gives more meaningful interpretations to the data? Why?