Petr Hrobař 2020-07-30
In this example we shall apply Principal component analysis (PCA) and Clustering methods for classifying paintings by Bob Ross. Data used for purposes of this analysis are comming from tidyverse community project. See more infromations here.
We are interested in dataset named . Dataset as well as its description can be found here.
First, let's do some basic data cleaning:
# packages
library(tidyverse)
library(reshape2)
library(knitr)
library(kableExtra)
theme_set(theme_light())
# Loading the data
bob_ross <- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-08-06/bob-ross.csv")
# Data Cleaning
df_cleaned <-
bob_ross %>%
janitor::clean_names() %>%
gather(element, value, -title, -episode) %>%
mutate(title = str_to_title(str_remove_all(title, '"')),
element = str_to_title(str_replace(element, "_", " "))) %>%
extract(episode, c("season", "episode"), "S(.*)E(.*)",
convert = T) %>%
arrange(season, episode) %>%
filter(value == 1)
df_cleaned %>%
head()
## # A tibble: 6 x 5
## season episode title element value
## <int> <int> <chr> <chr> <dbl>
## 1 1 1 A Walk In The Woods Bushes 1
## 2 1 1 A Walk In The Woods Deciduous 1
## 3 1 1 A Walk In The Woods Grass 1
## 4 1 1 A Walk In The Woods River 1
## 5 1 1 A Walk In The Woods Tree 1
## 6 1 1 A Walk In The Woods Trees 1
Data are cleaned and gathered, i.e. each row is the element of painting e.g. "Beach", "Tree" etc, where we observe wheater one particular painting has perticulars elements.
Principal component analysis is a Dimensionality reduction technique that has been around for almost a century now. It turns out that PCA is so popular it has been independently reinvendet multiple times. Here we provide basic math overview of technique and various approahces to computation itself.
Let's say we have data matrix that is matrix. Where is the number of observation and is the number of observed features (variables). Firstly, we standardize the data so that columns means are equal to zero.
Steps above can be perform in using following code:
D <- df_cleaned %>%
acast(title ~ element)
D[c(1:10) ,c(25:35)]
## Flowers Fog Framed Grass Guest Half Circle_frame
## A Mild Winter's Day 0 0 0 0 0 0
## A Pretty Autumn Day 0 0 0 1 0 0
## A Walk In The Woods 0 0 0 1 0 0
## Absolutely Autumn 0 0 0 1 0 0
## After The Rain 0 0 0 0 0 0
## Anatomy Of A Wave 0 0 0 0 1 0
## Angler's Haven 0 0 0 0 0 0
## Arctic Beauty 0 0 0 1 0 0
## Arctic Winter Day 0 0 0 0 0 0
## Arizona Splendor 0 0 0 1 0 0
## Half Oval_frame Hills Lake Lighthouse Mill
## A Mild Winter's Day 0 0 0 0 0
## A Pretty Autumn Day 0 0 0 0 0
## A Walk In The Woods 0 0 0 0 0
## Absolutely Autumn 0 0 0 0 0
## After The Rain 0 0 0 0 0
## Anatomy Of A Wave 0 0 0 0 0
## Angler's Haven 0 0 1 0 0
## Arctic Beauty 0 0 0 0 0
## Arctic Winter Day 0 0 1 0 0
## Arizona Splendor 0 0 0 0 0
Now we have created a data matrix , where rows of the matrix are particular paintings and columns are indicading binary presence of the element in the painting. Now, we want to standardize the matrix and hence create a centered matrix , where column's means are equal to zero.
Formally, data centering can be written as:
.
Additionally, if each variables of data (columns of ) are measured on different units, we should not only subtract column means from each observation, but also devide each observation by associated column's standart deviation. This conversion proces is called the and it is commontly used to transfort normal distribution to standart normal distribution. However, since all of our columns in data matrix are binary indicators of element presence, no variance transformations are required.
# Centering the data - substracting columns means from each column
C <- t(t(D) - colMeans(D))
# Column means are now equal to zero.
colMeans(C[ ,c(2, 4, 5, 10)])
## Aurora Borealis Beach Boat Cactus
## -1.830613e-19 -3.242800e-18 -1.830613e-19 -5.840526e-19
As mentioned above, PCA has multiple approaches to the solution. We will focus on two solution that are at disposal.
-
- Eigenvector and values decomposition of Variance-Covariance Matrix
-
- Singular value decompositon of Centered data Matrix
First approach of solving PCA requires to have the of (Centered) dataset. Using Centered data matrix we can calculate the variance-covariance matrix as:
After this step, we can calculate PCA as a eigenvector and eigenvalue decompositon of variance-covariance matrix.
# Variance-Covariance Matrix
COV <- (t(C) %*% C) * 1/(nrow(C) - 1)
# eigenvectors and eigenvalues of covariance matrix
PCA_eigen <- eigen(COV)
# Weights of elements on first four PC space.
PCA_eigen$vectors %>%
as.data.frame() %>%
mutate(elements = colnames(D)) %>%
select(elements, PC1 = V1, PC2 = V2, PC3 = V3, PC4 = V4) %>%
tibble() %>%
head()
## # A tibble: 6 x 5
## elements PC1 PC2 PC3 PC4
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Apple Frame -0.00179 0.00366 0.00213 0.00527
## 2 Aurora Borealis 0.00428 0.00617 0.0151 -0.00180
## 3 Barn -0.0357 0.0446 0.0561 0.0173
## 4 Beach -0.129 -0.223 0.0521 0.0114
## 5 Boat -0.00728 -0.00791 0.00363 0.0101
## 6 Bridge -0.0125 0.0262 0.00991 -0.00214
This approach was performed just to demenstrate basic algebra behind the PCA.
Second approach of solving PCA uses . It turns out that computing variance-covariance matrix and then performing eigenvectors and eigenvalues caluclation is not necessarily the most computationally efficient way of performing PCA.
Now let's use SVD to obtain PCA results. Firstly, we take our center matrix and perform SVD.
# SVD on center matrix C
PCA_svd <- svd(C)
# Weights of elements on first four PC space.
PCA_svd %>%
broom::tidy(matrix = "v") %>%
filter(PC <= 4) %>%
mutate(elements = colnames(D)[column]) %>%
pivot_wider(names_from = PC,
names_prefix = "PC") %>%
select(-column) %>%
head()
## # A tibble: 6 x 5
## elements PC1 PC2 PC3 PC4
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Apple Frame -0.00179 0.00366 -0.00213 0.00527
## 2 Aurora Borealis 0.00428 0.00617 -0.0151 -0.00180
## 3 Barn -0.0357 0.0446 -0.0561 0.0173
## 4 Beach -0.129 -0.223 -0.0521 0.0114
## 5 Boat -0.00728 -0.00791 -0.00363 0.0101
## 6 Bridge -0.0125 0.0262 -0.00991 -0.00214
So we can see that both results are equal. Quickly, we can use build-in function to confirm that the results are exactly whe same:
# Build in function (Hastie)
PCA_func <- prcomp(D, center = T)
# Weights of elements on first four PC space.
PCA_func$rotation %>%
as.data.frame() %>%
mutate(elements = colnames(D)) %>%
select(elements, PC1, PC2, PC3, PC4) %>%
tibble() %>%
head()
## # A tibble: 6 x 5
## elements PC1 PC2 PC3 PC4
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Apple Frame -0.00179 0.00366 -0.00213 0.00527
## 2 Aurora Borealis 0.00428 0.00617 -0.0151 -0.00180
## 3 Barn -0.0357 0.0446 -0.0561 0.0173
## 4 Beach -0.129 -0.223 -0.0521 0.0114
## 5 Boat -0.00728 -0.00791 -0.00363 0.0101
## 6 Bridge -0.0125 0.0262 -0.00991 -0.00214
Since all the results are the same, we can from now on work only with one PCA object. Let's work with object we named PCA_svd. We are interested in what percentage of variability is explained by each component.
PCA_svd %>%
broom::tidy(matrix = "d") %>%
select(-std.dev) %>%
gather(key, value, -PC) %>%
ggplot(aes(PC, value, color = key)) +
geom_line(size = 1) +
geom_point(size = 2.5) +
scale_y_continuous(labels = scales::percent_format()) +
scale_x_continuous(breaks = c(seq(0, 66, by = 5))) +
labs(x = "PC",
y = "Percentage of variability explained",
title = "Percentage of variability explained by each PC",
color = "Type of measure")
As can be seen, using just 2 PCs we are able to explain about 27% of variablitity of elements presence. Using 3 PCs number of variability explained will increase to 37%.
Let's examine whether some form of clustering is present.
PCA_svd %>%
broom::tidy(matrix = "v") %>%
filter(PC <= 4) %>%
mutate(elements = colnames(D)[column]) %>%
pivot_wider(names_from = PC,
names_prefix = "PC") %>%
select(-column) %>%
ggplot(aes(PC1, PC2, label = elements)) +
geom_point(color = "dodgerblue", size = 2) +
geom_text(color = "gray21") +
geom_hline(aes(yintercept = 0), lty = 2, color = "red") +
geom_vline(aes(xintercept = 0), lty = 2, color = "red") +
labs(title = "Element's Weights on PC1 and PC2 space")
Above we can see minor clustering of elements of the paintings that tend to appear together. For example, . Still a lot of variability is not explained using simple 2-dimensions. Now we can inspect what paintings tend to be more similiar:
set.seed(2020)
Paintings_clusters <-
PCA_svd %>%
broom::tidy(matrix = "u") %>%
filter(PC <= 4) %>%
mutate(painting = rownames(D)[row]) %>%
pivot_wider(names_from = PC,
names_prefix = "PC") %>%
select(painting, PC1, PC2) %>%
filter(grepl("Winter's|Beach|Waves|Waterfall|sun|Sun", painting)) %>%
mutate(color_element =
case_when(
grepl("Winter's", painting) ~ "Winter Theme",
grepl("Beach", painting) ~ "Beach Theme",
grepl("Waves", painting) ~ "Beach Theme")) %>%
filter(!is.na(color_element))
Paintings_clusters %>%
ggplot(aes(PC1, PC2, label = painting, color = color_element)) +
geom_point() +
geom_text(check_overlap = TRUE, size = 3) +
geom_hline(aes(yintercept = 0), lty = 2, color = "red") +
geom_vline(aes(xintercept = 0), lty = 2, color = "red") +
labs(title = "Painting's clustering using first two Principal Components",
subtitle = "We are inspecting themes of elements, e.g. Winter, beach, Sun etc.")
We can see that we can somewhat cluster paintings using PCA of painting's elements. Lets inspect some paintings here.
Clusts <-
Paintings_clusters %>%
select(painting, color_element) %>%
mutate(color_element =
case_when(color_element == "Winter Theme" ~ 1,
TRUE ~ 2)) %>%
pivot_wider(names_from = color_element, values_from = painting) %>%
apply(2, unlist)
Clusts
## 1 2
## [1,] "A Mild Winter's Day" "Balmy Beach"
## [2,] "Perfect Winter's Day" "Rowboat On The Beach"
## [3,] "Winter's Elegance" "Secluded Beach"
## [4,] "Winter's Grace" "Sunset Over The Waves"
## [5,] "Winter's Paradise" "Waves Of Wonder"
## [6,] "Winter's Peace" "Windy Waves"
Now we can inspect how much similiar are each paintings within the clusters.
all painting used for purposes of this application of PCA are property of ®Bob Ross.