clustering.tex

Spectral clustering is a crucial algorithm for learning on large data sources.
We will analyze Ng et al's \cite{ng2002spectral} spectral clustering paper from
NIPS 2001, which proposes a concise algorithm and provides a theoretical
backing for its choices. This paper reduces the problem to simple k-means
clustering in the eigenvector space of the Laplacian matrix of the data,
provides theoretical bounds relating the clusters in this space to clusters in
the data space, and shows initial experimental results using this algorithm.

\subsection{The Clustering Problem}
Spectral clustering aims to solve the problem of grouping large amounts of high
dimensional data into a small number of clusters. In the absence of labels for
data, clustering can be extremely useful for data analysis, visualization, and
unsupervised learning. We desire a clustering algorithm that takes as input the
following:
\begin{align*}
\textbf{Ideal input: } \label{clustering:idealinput} \\
X &: \begin{pmatrix}
        x_1 \\
        x_2 \\
        \vdots \\
        x_n
     \end{pmatrix} \in \R^{n \times k} \\
s(x_i, x_j) &: \text{A real valued similarity.}
\end{align*}

and output an assignment of points to clusters as follows:
\begin{align*}
    \textbf{Ideal output: } \\
C &: \{ C_{i_1}, C_{i_2}, \cdots, C_{i_n} \} ; i \in \{1,...,k\} \\
k &: \text{Number of clusters}
\end{align*}

Unfortunately, the problem as described above is slightly ill-posed.
After all, we have not specified what kind of cluster assignments do we desire.
An algorithm which simply assigns a unique cluster index to each data point is
a valid solution, as is one that assigns every datapoint to the same cluster.
Intuitively, if we have no further information about the data points, we want
clusters to be \textit{tight} and \textit{separated}. This intuition can be more
formally described via \textit{internal} cluster evaluation metrics, such as the
Davies-Bouldin index \cite{davies1979cluster}.

\subsubsection{Relation to graphs}
This problem, then, has a neat graph analog, as follows. Create a graph $G$, and
let each point $x_i$ be a node, with edge weights equal to the
similarities between them, as defined by $s(x_i, x_j)$. In this formulation, we
desire clusters corresponding to components that arise from \textit{sparse}
cuts.

\subsection{Ng, Jordan, Weiss (NJW) Algorithm}
The Ng, Jordan, Weiss algorithm presents a spectral approach to clustering. It
takes as input $X$ as defined in the ideal input, and the following
\begin{align*}
    k &: \text{The number of clusters expected} \\
    \sigma &: \text{A scaling parameter for the similarity function.}
\end{align*}

The similarity metric is pre-defined as follows:
\begin{align*}
    s(x_i, x_j) = \exp\{\frac{\norm{x_i - x_j}^2}{2\sigma^2}\}
\end{align*}

Note that this algorithm requires advanced knowledge of the number of clusters,
an issue that we will attempt to alleviate later.

The algorithm is as follows:
\begin{enumerate}
    \item Construct the matrix $A$ such that $A_{ij} = \begin{cases}s(x_i, x_j) &\;\text{ if } i \neq j \\
                                                                    0           &\;\text{ otherwise}\end{cases}$
    \item Let $D$ be $diag(A \vec{\mathbf{1}})$. Let $\cA = D^{-1/2} A D^{-1/2}$. (Normalize $A$.)
    \item Find the $k$ largest eigenvectors of $\cA: v_1,...,v_k$, and form the
            matrix $V = [v_1, ..., v_k]$.
    \item Form the matrix $Y$ by normalizing the rows of $X$ to have unit length:
            $Y_{ij} = X_{ij} / \sum_j X_{ij}$.
    \item Treat $Y$ as a data matrix, so that $Y = [y_1, ..., y_n]^T$, and
        cluster these rows of $Y$ to obtain the cluster assignments for $y_i$:
        $C'_{i_1}, C'_{i_2}, ..., C'_{i_n}$.
    \item Let $C_{i_1} = C'_{i_1}$; that is, assign $x_i, x_j$ to the same cluster
        iff $y_i, y_j$ were assigned to the same cluster.
\end{enumerate}

\subsubsection{Intuition}
\cite{ng2002spectral} provides a detailed intuition for this algorithm in their paper,
along with illustrative graphics. Here, we roughly follow their proof in an
abbreviated manner.

First, note that if we view our data as a graph $G$, $\cA$ is the normalized
adjacency, and $\cA = I - \cL$, where $\cL$ is the normalized Laplacian of $G$.
Thus, the eigenvectors of $\cA$ are the same as that of the Laplacian, and the
eigenvalues of $\cA$ are equal to the negative of the eigenvalues of $\cL$ plus
1. Now, as in \cite{ng2002spectral}, consider the ideal input data, where a
datapoint has non-zero similarity only to other datapoints in its cluster, and
there are exactly $k$ clusters. If we construct the corresponding graph $G$, it
must have exactly $k$ disjoint components. We know that for a connected
component $i$ with vertices $S_i$, the largest eigenvector $v_i$ has an entry
with value $1$ for vertices in $S_i$, and zero elsewhere, and has eigenvalue
$1$.  Thus, if there
are $k$ independent clusters, there must be exactly $k$ eigenvectors that are
mutually orthogonal.

\subsubsection{Assumptions}
Of course, if our adjacency matrix $\cA$ was ideal, clustering would not be
particularly interesting. A number of issues prevent us from obtaining this
ideal. Perhaps most obviously, our similarity metric is never exactly $0$.
However, other issues arise as well: our data may have noise, whether inherent
or due to measurement, leading to erroneous similarities. How does this method
perform in the presence of such less-than-ideal data, then?

\cite{ng2002spectral} assumes that there is, in fact, a true set of $k$ clusters
which gives rise to $X$, and further that there is an ideal $\hat{\cA}$ of which
$\cA$ is a perturbation (that is, $\cA = \hat{\cA} + E$, where $E$ is a
perturbation matrix). Then, we know that there must be $k$ orthogonal
eigenvectors of $\hat{\cA}$.

Ng. et. al provide a number of assumptions about $\cA$, which lead to a bound
on the distance between the eigenvectors of $\cA$ and the eigenvectors of
$\hat{\cA}$. We show and discuss these assumptions, and provide an initial
attempt at proving a theorem that relates the perturbation of $\cA$ to the
distance between the eigenvectors of $\cA$ and $\hat{\cA}$. We first discuss
these assumptions and provide some motivation for them.

Let $\lambda_i$ refer to the $i$th eigenvalue of $\cA$. Let $\hat{\lambda}_i$
be the $i$th eigenvalue of $\hat{\cA}$. Finally, let $\hat{\lambda}^{(j)}_i$
refer to the $i$th eigenvalue of the $j$th component in $\hat{\cA}$. Notice that
the $\hat{\lambda}_i$ are the union of $\hat{\lambda}^{(j)}_i$.

Before we express Assumption 1, we provide some motivation. We would like
the eigenspace of $\hat{\cA}$ to be robust to perturbations to $\hat{\cA}$. As
it turns out, results from Matrix Perturbation Theory provide us with a relation
between these two. The Davis-Kahan theorem, discussed later, shows that the
larger the \textit{eigengap} between $\lambda_i$ and $\lambda_{i+1}$ (that is,
$\abs{\lambda_{i+1} - \lambda_i}$), the more robust the eigenspace of the top
$i$ eigenvectors is to perturbations in the matrix. With this intuition, we
make the following assumption:

\textbf{Assumption 1} There exists $\delta > 0$ such that
$\hat{\lambda}_{k+1} \leq 1 - \delta$.

We know that $\hat{\lambda}_k = \hat{\lambda}^{(j)}_1$ for some $j$, as in the
ideal matrix, the $k$ components must each have a unique largest eigenvalue of
$1$. Then, $\hat{\lambda}_{k+1} = \max_j \hat{\lambda}^{(j)}_2$, and this
assumption is equivalent to the following:
\[ \exists \delta, \hat{\lambda}_k - \hat{\lambda}_{k+1} = 1 - \max_j \hat{\lambda^{(j)}_2} \leq \delta \]

To see that this is a reasonable assumption in arbitrary data, notice that if
$\delta \approx 0$, then for some component $j$, $\lambda^{(j)}_2 \approx 1$,
which must mean that this component itself can be separated into two components;
that is, one expected cluster is actually two independent clusters in disguise,
in which case our $k$ must have been poorly chosen.

The next few assumptions are more straight forward, and rely on the intra and
inter cluster similarities in our data. Let $\hat{d}_{j} = D_{jj} = \sum_i A_{ij}$,
and $\hat{d}_{j}^{(k)} = \sum_i A_{ij}^{(k)}$, which is the $\hat{d}_j$ vector
for datapoints that come from cluster $k$. Finally, let the notation $j \in
C_i$ denote the event that $x_j$ is from cluster $C_i$. Further, let the
notation $\{1,...,n_i\}$ refer to the indices of points in cluster $C_i$.

\textbf{Assumption 2}
    There exists an $\epsilon_1 > 0$ such that, for every
    $i_1, i_2 \in \{1,...,k\}$, $i_1 \neq i_2$, we have that
    \[ \sum_{j \in C_{i_1}} \sum_{j \in C_{i_2}} \frac{\cA_{jk}^2}{\hat{d}_j \hat{d}_k} \leq \epsilon_1 \]

    This assumption relates the intra-cluster distance of each cluster to the
    inter-cluster distances.

\textbf{Assumption 3}
    There exists an $\epsilon_2 > 0$ such that, for every $i \in \{1,...,k\}$,
    $j \in C_i$
    \[ \frac{\sum_{k : k \not\in C_i} \cA_{jk}}{\hat{d}_j} \leq
            \epsilon_2 \left(\sum_{k, l \in C_i} \frac{\cA_{kl}^2}{\hat{d}_k\hat{d}_l}\right)^2 \]

    This assumption bounds how close a point can be to points in other clusters
    relative to how close it is to points in its cluster.

\textbf{Assumption 4}
    There is a constant $C > 0$ such that, for every $i \in \{1,...,k\}$,
    $j \in \{1,...,n_i\}$.
    \[ \hat{d}_j^{(i)} \geq \frac{\sum_{k=1}^{n_i} \hat{d}_k^{(i)}}{Cn_i} \]

    Finally, this assumption states that a point should not be too far from
    other points in its cluster, relative to the clusters' intra-cluster
    distance.

\subsubsection{NJW Theorem}
Using the above assumptions, \cite{ng2002spectral} provides the following
theorem:

\textlcsc{Theorem 2.1, NJW Theorem}: Let Assumptions 1, 2, 3, and 4 hold. Set
$\epsilon = \sqrt{k(k-1)\epsilon_1 + k \epsilon_2^2}$. If
$\delta > (2 + \sqrt{2}) \epsilon$, then there exist $k$ orthogonal vectors
$r_1, ..., r_k$ so that $V$'s rows satisfy
\[ \frac{1}{n} \sum_{i=1}^k \sum_{j=1}^{n_i} \norm{v_j^{(i)} - r_i}_2^2 \leq
    4C (4 + 2 \sqrt{k})^2 \frac{\epsilon^2}{(\delta - \sqrt{2}\epsilon)^2} \]

Surprisingly, we are unable to find any published proofs of this Theorem, nor
any attempts at proving it. The full derivation would rely heavily on matrix
perturbation theory, which is outside the scope of this paper. However, we
provide an initial sketch and important results in perturbation theory that
would lead to the proof of this theorem.

First, we discuss the Davis Kahan theorem as it applies to $\cA$ and $\hat{\cA}$,
with help from \cite{von2007tutorial}.

\textlcsc{Theorem 2.2, Davis-Kahan Theorem} (modified for our purposes):
$\hat{\cA}, E$ be symmetric matrices, and let $\norm{\cdot}$ be the two-norm for
matrices. Consider $\cA = \hat{\cA} + E$, a perturbed version of $\hat{\cA}$. Let
$\delta = \abs{\lambda_k - \lambda_{k+1}}$ (where $\lambda_k$ is the $k$th
eigenvalue of $\cA$), $V_1$ the eigenspace spanned by the $k$ largest eigenvectors
of $\cA$, and $\hat{V}_1$ the corresponding space for $\hat{\cA}$. Then,
\[ d(V_1, \hat{V}_1) = \norm{\sin \Theta(V_1, \hat{V}_1)} \leq \frac{\norm{E}}{\delta} \]

Note that the distance between two subspaces here is defined as the angle
between them. We appeal to \cite{ee381vlec6} to relate this to
a more direct distance metric, and see that

\[ d(V_1, \hat{V}_1) \leq \min_{R} \norm{V_1 - \hat{V}_1R}_2 \leq \sqrt{2} d(V_1, \hat{V}_1) \]

where $R$ is an orthogonal matrix. Intuitively, the above bounds the distance
of $V_1$ from some rotation of $\hat{V}_1$, leading to the following bound
\[ \norm{V_1 - \hat{V}_1R}_2 \leq \sqrt{2} \frac{\norm{E}}{\delta} \]

This provides a matrix perturbation approach towards proving the NJW theorem,
bounding the distance between the eigenvectors of $\cA$ and some rotation of
the eigenvectors of $\hat{\cA}$.

\subsection{Memory constraints}
Having shown that the algorithm does, in fact, provide some bounds on the
clustering in a setting with perturbed clusters, we turn to issues with its
implementation. First, let us consider the memory constraints of constructing
$\cA$. Notice that $\cA$ is a similarity matrix of $n$ data points, requiring,
if implemented na\"{i}vely, $n^2$ doubles. With just 10 million data points,
this could require up to $800 TB$ of memory. A number of papers since have
focused on reducing this memory storage, either through approximation or
sparsification of $\cA$. Unfortunately, the usual techniques for these tasks
require first having a representation of $\cA$ directly, such as, for example,
the sparsification algorithm in \cite{spielman2011graph}. We discuss two
particular algorithms, the Nystr\"{o}m method for approximating $\cA$, and the
t-nearest neighbors method for ``sparsification" of $\cA$.

\subsubsection{Nystr\"{o}m Method}
The Nystr\"{o}m method for approximating $\cA$ begins by sampling $l$ rows
of $\cA$ to construct $C \in \R^{n \times l}$, and forms $W \in \R^{l \times l}$
as the submatrix of $\cA$ indexed by the $l$ samples. $\cA$ can then be
approximated by $\tilde{\cA} = C W_r^+ C^T$, where $W_r^+$ is the pseudoinverse
of the best rank $r$ approximation of $W$. Given this approximation,
\cite{choromanska2013fast} proves a theorem similar to that of NJW under
certain conditions, bounding the perturbation of the eigenvectors of
$\tilde{\cA}$ from $\hat{\cA}$. For the statement and proof of this theorem, we
refer the reader to \cite{choromanska2013fast}.

\subsubsection{t-nearest neighbors}
\cite{song2008parallel} proposes a simple algorithm for sparsifying the similarity matrix
$\cA$: for each datapoint, maintain only the similarity to the top $t$ nearest
neighbors; for the rest, clamp the similarity to $0$. Clearly, this significantly
reduces the number of non-zero entries in $\cA$. Further,
\cite{song2008parallel} shows empirical improvements in clustering efficacy
over the Nystr\"{o}m Method; unfortunately, this paper does not provide any
theoretical bounds on the clustering. In fact, this technique may be
particularly tricky to analyze theoretically: it is easy to imagine that $t$
depends heavily on the structure and size of the dataset in hand, and may
easily lead to a number of small clusters due to a $t << n$.

\subsection{Selecting $\sigma, k$}
The algorithm described earlier took as input two crucial parameters: $\sigma$
and $k$. These parameters can significantly alter the outputs from this
algorithm; for example, the very proof of the NJW algorithms' effectiveness
relies on the fact that $k$ is selected accurately (as otherwise, the eigengap
may be infinitesimally small!).

Zelnik-Manor and Perona \cite{zelnik2004self} provide an empirically (but not
theoretically) validated selection for $\sigma_i = d(x_i, x_{i_K})$ where
$x_{i_K}$ is the $K$th closest neighbor to $x_i$. This allows a scaling
parameter that is \textit{local} to each data point; although this slightly
increases computation cost, it allows data in dense regions to have a
similarity that falls off quickly, while sparse data is brought closer
together. $\cA$ is modified so that \[ \cA_{ij : j \neq i} =
\exp{\left(\frac{-d^2(x_i,x_j)}{\sigma_i \sigma_j}\right)} \]

The question of determining the number of clusters can be approached in a number
of ways. Since we originally were looking for a large eigengap, we can simply
choose the $k$ that maximizes $\abs{\lambda_k - \lambda_{k+1}}$. However,
\cite{zelnik2004self} shows empirically that the value of this eigengap is not
particularly useful; intuitively, the eigengap relies on the structure of each
cluster component, which must be very well clustered for the eigengap to be
large at $k$. As we drift towards large $k$, our assumptions on the structure
of these components become weaker, and the eigengap is not easily bounded.
Instead, \cite{zelnik2004self} uses the key insight that for an optimal $k$,
there exists a set of $k$ close to orthogonal eigenvectors that correspond to
the eigenvectors of each component. If $k$ is too large, these eigenvectors
will start exhibiting dependencies. This naturally leads to a criterion for
selecting $k$ based on whether or not the eigenvectors of $\cA$ can be rotated
to match the elementary basis. In particular, Zelnik-Manor and Perona propose
selecting a $k$ that minimize a cost based on the rotation of the eigenvectors.
In particular, they define the cost of a given set of eigenvectors as follows,
and choose $k$ that minimizes it.
\begin{align*}
    V&: \text{Matrix of eigenvectors} \in \R^{n \times k} \\
    R&: \text{Rotation matrix} \in \R^{k \times k} \\
    M&: M_{i} = \max_j (VR)_{ij}; M \in \R^{n}\\
    J &= \min_R \sum_{i=1}^n \sum_{j=1}^k \frac{(VR)_{ij}^2}{M_i^2} \text{(Cost)}\\
\end{align*}

\subsection{Relevant work}
Many improvements to the spectral clustering approach have been proposed in
recent years. \cite{song2008parallel} discuss the parallelism of this algorithm,
among other runtime considerations. \cite{lee2014multiway} proves the conjecture
that there exist $k$ eigenvalues of the Laplacian of the graph close to 0 if and
only if the graph can be separated into $k$ components with sparse cuts; this
backs the approach by NJW which assumes that the first $k$ eigenvalues of $\cA$
are close to $1$ if there are $k$ clusters. Finally, \cite{stewart1990matrix}
can be referred to for many results in Matrix Perturbation Theory, which may
be helpful in proving the NJW Theorem.