-
Notifications
You must be signed in to change notification settings - Fork 1
/
SKL38-OutlierDetection.tex
149 lines (142 loc) · 5.57 KB
/
SKL38-OutlierDetection.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
\documentclass[SKL-MASTER.tex]{subfiles}
\section*{Using KMeans for outlier detection}
In this chapter, we'll look at both the debate and mechanics of KMeans for outlier detection.
It can be useful to isolate some types of errors, but care should be taken when using it.
\subsection*{Removal of Outliers}
* In this section, we'll use KMeans to do outlier detections on a cluster of points.
* It's important
to note that there are many schools of thought when it comes to outliers and outlier detection. On one
hand, we're potentially removing points that were generated by the data-generating process
by removing outliers.
* On the other hand, outliers can be due to a measurement error or some
other outside factor.
* This is the most credence we'll give to the debate; the rest of this recipe is about finding outliers;
we'll work under the assumption that our choice to remove outliers is justified.
* The act of outlier detection is a matter of finding the centroids of the clusters, and then
identifying points that are potential outliers by their distances from the centroid. content...
\subsection*{Generating a Data Set}
First, we'll generate a single blob of 100 points, and then we'll identify the 5 points that are
furthest from the centroid. These are the potential outliers:
<pre>
\begin{verbatim}
>>> from sklearn.datasets import make_blobs
>>> X, labels = make_blobs(100, centers=1)
>>> import numpy as np
\end{verbatim}
\end{framed}
It's important that the KMeans cluster has a single center. This idea is similar to a one-class
SVM that is used for outlier detection:
<pre>
\begin{verbatim}
>>> from sklearn.cluster import KMeans
>>> kmeans = KMeans(n_clusters=1)
>>> kmeans.fit(X)
\end{verbatim}
\end{framed}
%========================================================%
% % Building Models with Distance Metrics
% 112
Now, let's look at the plot. For those playing along at home, try to guess which points will be
identified as one of the five outliers:
<pre>
\begin{verbatim}
>>> f, ax = plt.subplots(figsize=(7, 5))
>>> ax.set_title("Blob")
>>> ax.scatter(X[:, 0], X[:, 1], label='Points')
>>> ax.scatter(kmeans.cluster_centers_[:, 0],
kmeans.cluster_centers_[:, 1],
label='Centroid',
color='r')
>>> ax.legend()
\end{verbatim}
\end{framed}
The following is the output:
Now, let's identify the five closest points:
<pre>
\begin{verbatim}
>>> distances = kmeans.transform(X)
# argsort returns an array of indexes which will sort the array in
ascending order
# so we reverse it via [::-1] and take the top five with [:5]
>>> sorted_idx = np.argsort(distances.ravel())[::-1][:5]
\end{verbatim}
\end{framed}
%========================================================%
% % Chapter 3
% % 113
Now, let's see which plots are the farthest away:
<pre>
\begin{verbatim}
>>> f, ax = plt.subplots(figsize=(7, 5))
>>> ax.set_title("Single Cluster")
>>> ax.scatter(X[:, 0], X[:, 1], label='Points')
>>> ax.scatter(kmeans.cluster_centers_[:, 0],
kmeans.cluster_centers_[:, 1],
label='Centroid', color='r')
>>> ax.scatter(X[sorted_idx][:, 0], X[sorted_idx][:, 1],
label='Extreme Value', edgecolors='g',
facecolors='none', s=100)
>>> ax.legend(loc='best')
\end{verbatim}
\end{framed}
The following is the output:
It's easy to remove these points if we like:
<pre>
\begin{verbatim}
>>> new_X = np.delete(X, sorted_idx, axis=0)
\end{verbatim}
\end{framed}
%========================================================%
% % Building Models with Distance Metrics
% % 114
Also, the centroid clearly changes with the removal of these points:
<pre>
\begin{verbatim}
>>> new_kmeans = KMeans(n_clusters=1)
>>> new_kmeans.fit(new_X)
\end{verbatim}
\end{framed}
Let's visualize the difference between the old and new centroids:
<pre>
\begin{verbatim}
>>> f, ax = plt.subplots(figsize=(7, 5))
>>> ax.set_title("Extreme Values Removed")
>>> ax.scatter(new_X[:, 0], new_X[:, 1],
label='Pruned Points')
>>> ax.scatter(kmeans.cluster_centers_[:, 0],
kmeans.cluster_centers_[:, 1],
label='Old Centroid',
color='r', s=80, alpha=.5)
>>> ax.scatter(new_kmeans.cluster_centers_[:, 0],
new_kmeans.cluster_centers_[:, 1],
label='New Centroid',
color='m', s=80, alpha=.5)
>>> ax.legend(loc='best')
\end{verbatim}
\end{framed}
The following is the output:
Clearly, the centroid hasn't moved much, which is to be expected when only removing the
five most extreme values. This process can be repeated until we're satisfied that the data
is representative of the process.
%========================================================%
% % Chapter 3
% % 115
\subsection*{Gaussian Distribution}
As we've already seen, there is a fundamental connection between the Gaussian distribution
and the KMeans clustering. Let's create an empirical Gaussian based of the centroid and
sample covariance matrix and look at the probability of each point—theoretically, the five
points we removed. This just shows that we have in fact removed the values with the least
likelihood. This idea between distances and likelihoods is very important, and will come
around quite often in your machine learning training.
Use the following command to create an empirical Gaussian:
<pre>
\begin{verbatim}
>>> from scipy import stats
>>> emp_dist = stats.multivariate_normal(
kmeans.cluster_centers_.ravel())
>>> lowest_prob_idx = np.argsort(emp_dist.pdf(X))[:5]
>>> np.all(X[sorted_idx] == X[lowest_prob_idx])
True
\end{verbatim}
\end{framed}
\end{document}