1. Clustering

Clustering is an unsupervised learning task that is equivalent to the idea of separating things into groups. A clustering algorithm groups datapoints into clusters based on some notion of similarity.
Cluster assignments represent discrete summaries of datapoints, and clustering is like an extreme form of compression to provide a characterization.
In general, there is no definition of what the right clustering is.
Intuitively, clustering aims to group similar items together and to place separate dissimilar items into different groups. But two objectives can contradict each other: similarity is not a transitive relation, while being in the same cluster is an equivalence relation.

1.1. Type of Clustering

Model-free clustering (discriminative).
Model-based clustering (generative).

1.2. Desired Fundamental Property

Suppose clustering method is a map \(\mathcal{F}: (\mathcal{D}=\{\mathbf{x}_i\}_{i=1}^n, \rho) \mapsto \{C_1, \ldots, C_K\}\) which takes dataset \(\mathcal{D}\) and a dissimilarity function \(\rho\) as an input, and returns a partition of \(\mathcal{D}\). Three basic properties are required:

Scale invariance. For any \(\alpha>0\), \(\mathcal{F}(\mathcal{D}, \alpha\rho)=\mathcal{F}(\mathcal{D}, \rho)\).
Richness. For any partition \(C=\{C_1, \ldots, C_K\}\) of \(\mathcal{D}\), there exists a dissimilarity function \(\rho_C\) s.t. \(\mathcal{F}(\mathcal{D}, \rho_C)=C\).
Consistency. Suppose \(\rho\) and \(\rho'\) are two dissimilarities s.t. the following hold for all \(\mathbf{x}_i, \mathbf{x}_j \in \mathcal{D}\): (1) if \(\mathbf{x}_i\) and \(\mathbf{x}_j\) belong to the same cluster in \(\mathcal{F}(\mathcal{D}, \rho)\), then \(\rho'(\mathbf{x}_i, \mathbf{x}_j) \leq \rho(\mathbf{x}_i, \mathbf{x}_j)\); (2) if \(\mathbf{x}_i\) and \(\mathbf{x}_j\) belong to different clusters in \(\mathcal{F}(\mathcal{D}, \rho)\), then \(\rho'(\mathbf{x}_i, \mathbf{x}_j) \geq \rho(\mathbf{x}_i, \mathbf{x}_j)\). Then \(\mathcal{F}(\mathcal{D}, \rho')=\mathcal{F}(\mathcal{D}, \rho)\).

Kleinberg (2003) proved that no clustering method can simultaneously satisfy all three properties (but it is easy to ensure two properties hold). Therefore, we cannot derive axioms of clustering, and thus different choices will lead to different behavior.

2. \(K\)-Means

\(K\)-means algorithm is a simple partition-based, model-free, clustering algorithm.
\(K\)-means algorithm uses a preassigned number of clusters \(K\), represents each cluster using a cluster centroid (or prototype) \(\boldsymbol{\mu}_k\), and measures the quality of each cluster \(C_k\) using its within-cluster deviance from the cluster centroids \[W(C_k, \boldsymbol{\mu}_k)=\sum_{i \in C_k} \|\mathbf{x}_i-\boldsymbol{\mu}_k\|_2^2.\]
The overall quality of the clustering is given by the total within-cluster deviance \[W=\sum_{k=1}^K W(C_k, \boldsymbol{\mu}_k)=\sum_{k=1}^K \sum_{i \in C_k} \|\mathbf{x}_i-\boldsymbol{\mu}_k\|_2^2=\sum_{i=1}^n \|\mathbf{x}_i-\boldsymbol{\mu}_{c_i}\|_2^2\] where \(c_i=k\) iff \(i \in C_k\).

2.1. Optimization

\(W\) is the objective function used to select both the cluster centroids and the assignment of points to clusters.
The joint optimization is intractable, but we can use a coordinate descent approach. Given partition \(\{C_k\}\), we can find the optimal centroids by differentiating \(W\) w.r.t. \(\boldsymbol{\mu}_k\): \[\frac{\partial W}{\partial\boldsymbol{\mu}_k}=2\sum_{i \in C_k}(\mathbf{x}_i-\boldsymbol{\mu}_k)=0 \Rightarrow \boldsymbol{\mu}_k=\frac{1}{|C_k|}\sum_{i \in C_k} \mathbf{x}_i.\] Given the centroids, we can find the optimal partition by assigning each datapoint to the closest centroid: \[c_i=\mathop{\arg\min}_k \|\mathbf{x}_i-\boldsymbol{\mu}_k\|_2^2.\] Thus we can employ an iterative alternating optimization.

2.2. \(K\)-Means Algorithm

Algorithm 1 \(K\)-Means Algorithm.

    initialize \(K\) cluster centroids \(\boldsymbol{\mu}_1, \ldots, \boldsymbol{\mu}_K\);
    repeat
        for \(i=1, \ldots, n\) do
            assign each \(\mathbf{x}_i\) to the cluster with the nearest centroid, i.e., \(c_i:=\mathop{\arg\min}_k \|\mathbf{x}_i-\boldsymbol{\mu}_k\|_2^2\);
        end for
        set \(C_k:=\{i: c_i=k\}\) for each \(k\);
        set \(\boldsymbol{\mu}_1, \ldots, \boldsymbol{\mu}_K\) to the averages of the new clusters, i.e., \(\boldsymbol{\mu}_k:=|C_k|^{-1} \sum_{i \in C_k} \mathbf{x}_i\);
    until convergent;
    return the partition \(\{C_1, \ldots, C_K\}\) and means \(\boldsymbol{\mu}_1, \ldots, \boldsymbol{\mu}_K\);

2.3. Property

\(K\)-Means algorithm stops in a finite number of iterations.
\(K\)-Means algorithm need not converge to global optimum.

2.4. Practicality

Performance can be very sensitive to certain choices:

Distance measure. Euclidean distance can be greatly affected by measurement unit and by strong correlations, and we can use Mahalanobis distance instead: \[\|\mathbf{x}-\mathbf{y}\|_M=\sqrt{(\mathbf{x}-\mathbf{y})^\top M^{-1}(\mathbf{x}-\mathbf{y})}\] where \(M\) is a positive semi-definite matrix.
Initialization. We can perform a number of runs from different configurations and pick the end result with minimum \(W\). In addition, we can use \(K\)-means\(++\).
\(K\). Since \(W\) will always improve with larger number of clusters \(K\), then determination of \(K\) requires an additional regularization criterion, e.g., DP-means.

2.5. \(K\)-Means\(++\)

\(K\)-means\(++\) is a simple and effective approach for initializing centroids.

Algorithm 2 \(K\)-Means\(++\) Algorithm.

    choose a datapoint \(\mathbf{x}_{i1}\) at random from \(\mathcal{D}\) and set \(\boldsymbol{\mu}_1 \leftarrow \mathbf{x}_{i1}\);
    choose a second datapoint w.p. \(P(i)=\rho_i^2/\sum_{j=1}^n \rho_j^2\) where \(\rho_i^2=\|\mathbf{x}_i-\boldsymbol{\mu}_1\|_2^2\) and set \(\boldsymbol{\mu}_2\) to its value;
    for \(k=3, \ldots, K\) do
        choose \(k\)th datapoint w.p. \(P(i)\) where \(\rho_i^2=\min\{\|\mathbf{x}_i-\boldsymbol{\mu}_1\|_2^2, \ldots, \|\mathbf{x}_i-\boldsymbol{\mu}_{k-1}\|_2^2\}\);
        set \(\boldsymbol{\mu}_k\) to its value;
    end for

It is proved that \[\mathbb{E}[W(\{C_k^{++}\})] \leq 8(\ln K+2)W^*\] where \(W^*\) is the within-cluster deviance of the globally optimal clustering and the expectation is taken over the random sampling used in the initialisation.

2.6. Dirichlet-Process-Means

The Dirichlet-process-means, a.k.a. DP-means algorithm, is a variant that allows \(K\) to be set adaptively by iteratively adding clusters. Like \(K\)-means, DP-means terminates after a finite number of iterations.

Algorithm 3 DP-Means Algorithm.

    choose a cluster cost hyperparameter \(\lambda\) (with larger \(\lambda\) producing less clusters);
    initialize \(K=1\) and \(\boldsymbol{\mu}_1=n^{-1}\sum_{i=1}^n \mathbf{x}_i\);
    repeat
        for \(i=1, \ldots, n\) do
            if \(\min_{k=1, \ldots, K} \|\mathbf{x}_i-\boldsymbol{\mu}_k\|_2^2>\lambda\) then
                set \(c_i \leftarrow K+1\), \(\boldsymbol{\mu}_{K+1} \leftarrow \mathbf{x}_i\);
            else
                set \(c_i \leftarrow \mathop{\arg\min}_{k=1, \ldots, K} \|\mathbf{x}_i-\boldsymbol{\mu}_k\|_2^2\);
                update the previous and current cluster centroids that item \(i\) belongs to appropriately;
            end if
        end for
    until stop changing the cluster assignments for all items;