-
Notifications
You must be signed in to change notification settings - Fork 0
/
week10_tutorial.qmd
229 lines (171 loc) · 7.59 KB
/
week10_tutorial.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
---
title: "ETC1010/5510 Tutorial 10"
subtitle: "Introduction to Data Analysis"
author: "Patrick Li"
date: "Sep 30, 2024"
format:
html:
toc: true
embed-resources: true
---
```{r setup, include = FALSE}
knitr::opts_chunk$set(
echo = TRUE,
eval = FALSE,
message = FALSE,
warning = FALSE,
error = FALSE,
out.width = "70%",
fig.width = 8,
fig.height = 6,
fig.retina = 3)
set.seed(6)
filter <- dplyr::filter
```
## `r emo::ji("target")` Workshop Objectives
This workshops will be a bit different to what we've done before. Effective clustering does not work well with blind application of techniques. In this workshop, you will try to gain an understanding of
- what types of questions you can answer with cluster analysis;
- how to interpret the output of cluster analysis;
- how to choose between clustering methods.
## `r emo::ji("wrench")` Instructions
Install the necessary packages.
```{r}
library(tidyverse)
library(ggdendro)
```
## 🥡 Exercise: Understanding Cluster Analysis
You're unsettled by the current political and economic situation in the United States, so you obtain the following dataset and wish to see if their is some relationship that explains the occurrence of violent crime in the United States.
```{r}
load("data/USArrests_simp.Rdata")
arrest <- df_new
rm(df_new)
```
The dataset contains statistics on arrests per 100k residents for assault and murder in each of the 50 states.
You want to understand the relationship between the murder rate, assault rate, and the urban population across the United States.
### EDA
First, complete the following basic EDA to try and get some understanding of the data.
1. Using graphical means, analyse the relationship between Assaults and Murders, controlling for the effect of population. Do this in `ggplot` using a scatterplot and by choosing the size of the points via `UrbanPop`.
- Describe the relationship between assaults and murders in words.
- What do you notice about the relationship between population size and murders/assaults?
2. Another way to analyze the relationship between assaults and murders, conditional on population size, is to break up the population size into categories based on the quantiles of the variable `UrbanPop`, and then analyze the distribution across those categories. This can be done using
```{r}
arrest <- arrest %>%
mutate(category = cut(UrbanPop,
breaks = quantile(UrbanPop, c(0, 1/3 , 2/3 , 1)),
labels = c("low", "middle", "high"),
include.lowest = TRUE))
```
- Report the mean number of assaults and murders by category. What do you notice?
- Plot and describe the relationships between murders/assaults by category using boxplots.
- Contrast your answers from the two previous questions.
- Visualise the relationship between murders/assaults, conditional on the category for `UrbanPop`. What do you notice about the low population states?
```{r}
# Controlling for UrbanPop
p <- arrest %>%
ggplot() +
geom_point(aes(x = ___, y = ___, col = ___, label = ___))
plotly::ggplotly(p)
```
3. Are the relationships you've found in part 2 the same for states that are in the southern United States, versus elsewhere? To answer this, use the following definition of "southern states", and answer the following questions.
```{r}
south <- c("Alabama", "Arkansas", "Delaware", "Florida", "Georgia", "Kentucky",
"Louisiana", "Maryland", "Mississippi", "North Carolina", "Oklahoma",
"South Carolina", "Tennessee", "Texas", "Virginia", "West Virginia")
```
- Compare the means of assaults/murders for southern states against the totals from earlier. What do you notice?
- Visualise the relationship between murders/assaults, conditional on the category for `UrbanPop`, and use different shapes for southern states to the remaining states; i.e., use `shape = xxx` inside `geom_point`.
### Clustering
Clearly, there is a relationship between the size of the population, its geographic location, and the number of murders and assaults. However, as our graphical analysis demonstrated, there is not a single relationship. So, it may be best to try and cluster these observations.
We will consider two different types of clustering: hierarchical and k-means.
### Hierarchical Clustering.
1. Use hierarchical clustering and complete linkage to cluster the data.
- Keep only numerical variables. The dummy variable `South` can also be kept.
- We do not need to standardize the variables because the solution will be better without scaling.
- Compute the distance matrix using Euclidean distance.
- Create the dendogram.
- Compare the solutions across different clusters. Which solution seems most stable?
```{r}
hclust_result <- arrest %>%
select(-States, -category) %>%
`rownames<-`(arrest$States) %>%
dist(method = ___) %>%
___(___)
ggdendrogram(___)
```
2. Cut the tree by 4 clusters. Use the output to answer the following questions
- Plot the relationship between Assault and Murder again, but choose the color of the plot corresponding to the Cluster in which the observation lies. What pattern do you see?
```{r}
p <- arrest %>%
ggplot() +
geom_point(aes(x = ___,
y = ___,
col = factor(cutree(___, ___)),
size = category,
label = States),
alpha = 0.6) +
labs(col = "cluster")
plotly::ggplotly(p)
```
- Now, overlay the previous plot with the geographic information (South) by choosing the shape of the plot to depend on the variable South. What do you notice about the southern states?
```{r}
p <- arrest %>%
ggplot() +
geom_point(aes(x = ___,
y = ___,
col = factor(cutree(___, ___)),
size = category,
shape = ___,
label = States),
alpha = 0.6) +
labs(col = "cluster")
plotly::ggplotly(p)
```
### K-means Clustering.
1. Obtain the K-means clusters with $k=3$ and $k=4$ clusters.
```{r}
kmeans_result_3 <- arrest %>%
select(-States, -category) %>%
`rownames<-`(arrest$States) %>%
kmeans(centers = ___)
kmeans_result_4 <- arrest %>%
select(-States, -category) %>%
`rownames<-`(arrest$States) %>%
kmeans(centers = ___)
```
2. Plot the different cluster membership totals in each cluster by category using `geom_bar()`.
```{r}
p1 <- arrest %>%
mutate(cluster = ___) %>%
ggplot() +
geom_bar(aes(x = ___, fill = ___)) +
ggtitle("Number of observations in each cluster by category",
subtitle = "(3-cluster solution)")
p2 <- arrest %>%
mutate(cluster = ___) %>%
ggplot() +
geom_bar(aes(x = ___, fill = ___)) +
ggtitle("Number of observations in each cluster by category",
subtitle = "(4-cluster solution)")
patchwork::wrap_plots(p1, p2, ncol = 2)
```
3. For $k=4$ clusters, plot the relationship between Assault and Murder again, but choose the color of the plot corresponding to the Cluster in which the observation lies.
```{r}
p <- arrest %>%
mutate(cluster = ___) %>%
ggplot() +
geom_point(aes(x = Murder, y = Assault, col = ___, size = category, label = States),
alpha = 0.6) +
labs(col = "cluster")
plotly::ggplotly(p)
```
- Now, overlay the geographic indicator. How do the results differ from the other clustering algorithm?
```{r}
p <- arrest %>%
mutate(cluster = ___) %>%
ggplot() +
geom_point(aes(x = Murder, y = Assault, col = ___, size = category, shape = ___, label = States),
alpha = 0.6) +
labs(col = "cluster")
plotly::ggplotly(p)
```
4. Which algorithm gives more meaningful interpretations to the data? Why?