1. Notation
Assume the data consists of \(n\) datapoints of \(p\) variables. Let \(\mathbf{X}=(x_{ij}) \in \mathbb{R}^{n \times p}\) with \(x_{ij}\) being the \(j\)th variable for the \(i\)th datapoint.
Denote the \(i\)th datapoint by \(\mathbf{x}_i \in \mathbb{R}^p\) and assume \(\mathbf{x}_1, \ldots, \mathbf{x}_n\) are i.i.d. samples of a random vector \(X\) over \(\mathbb{R}^p\). The \(j\)th dimension of \(X\) will be denoted \(X^{(j)}\).
2. Data Representation
A key class of unsupervised methods are converting (input) data into alternative forms by learning some mapping \(\mathbf{x} \mapsto \mathbf{z}\), where \(\mathbf{x} \in \mathbb{R}^p\) and \(\mathbf{z} \in \mathbb{R}^k\).
Sometimes we map to a higher dimensional space (\(k>p\)) to introduce additional features (e.g., kernel method).
More commonly, we perform dimensionality reduction by mapping to a lower dimensional space (\(k<p\), typically \(k \ll p\)).
3. Principle Component Analysis (PCA)
PCA is a linear dimensionality reduction technique, and it finds a new basis for the data \[\mathbf{z}_i=V^\top\mathbf{x}_i.\]
We want to choose a \(V^\top\) that captures the most information about variability in \(X\).
PCA uses variance as a proxy: it finds an orthogonal basis that maximizes variance of \(Z=V^\top X\).
PCA is often a useful data preprocessing step (particularly with high-dimensional data) or for exploratory data analysis.
3.1. Definition
PCA is a technique to find an orthogonal basis \(\{\mathbf{v}_1, \ldots, \mathbf{v}_p\}\) for the data space s.t.
the first principal component (PC) \(\mathbf{v}_1\) is the direction of greatest variance of data: \[\mathbf{v}_1=\mathop{\arg\max}_{\mathbf{v}: \mathbf{v}^\top\mathbf{v}=1} \text{Var}[\mathbf{v}^\top X];\]
the \(j\)th PC \(\mathbf{v}_j\) is the direction orthogonal to \(\mathbf{v}_1, \ldots, \mathbf{v}_{j-1}\) of greatest variance for \(j=2, \ldots, p\): \[\mathbf{v}_j=\mathop{\arg\max}_{\mathbf{v}: \mathbf{v}^\top\mathbf{v}=1 \cap \mathbf{v}^\top\mathbf{v}_i=0, \forall i<j} \text{Var}[\mathbf{v}^\top X].\]
3.2. Derivation
Recall that \(\displaystyle \mathbf{v}_1=\mathop{\arg\max}_{\mathbf{v}} \text{Var}[\mathbf{v}^\top X]\) s.t. \(\mathbf{v}^\top\mathbf{v}=1\). Define \[\begin{aligned} \mathcal{L}(\mathbf{v}_1, \lambda)&=\text{Var}[\mathbf{v}^\top X]-\lambda(\mathbf{v}^\top\mathbf{v}-1) \\&=\mathbf{v}^\top\text{Var}[X]\mathbf{v}-\lambda(\mathbf{v}^\top\mathbf{v}-1) \\&\approx \mathbf{v}^\top S\mathbf{v}-\lambda(\mathbf{v}^\top\mathbf{v}-1) \end{aligned}\] where \(\displaystyle S=\frac{1}{n-1}\sum_{i=1}^n \mathbf{x}_i\mathbf{x}_i^\top\) is the sample covariance matrix. Let \[\frac{\partial}{\partial \mathbf{v}}\mathcal{L}(\mathbf{v}_1, \lambda)=2S\mathbf{v}_1-2\lambda\mathbf{v}_1=0\] and we have \(S\mathbf{v}_1=\lambda\mathbf{v}_1\), i.e., \(\mathbf{v}_1\) is the eigenvector of \(S\).
Since \(\mathbf{v}^\top S\mathbf{v}=\mathbf{v}^\top\lambda\mathbf{v}=\lambda\), then if we want to maximize, we want eigenvector corresponding to the largest eigenvalue. Similarly, \(\mathbf{v}_j\) is the eigenvector of \(S\) with \(j\)th largest eigenvalue.
The PCA components can be found by the eigendecomposition \[S=\frac{1}{n-1} \mathbf{X}^\top\mathbf{X}=V\Lambda V^\top\] where \(\Lambda\) is a diagonal matrix with eigenvalues (variances along each PC) \(\lambda_1 \geq \cdots \geq \lambda_p \geq 0\), and \(V\) (loadings matrix) is a \(p \times p\) orthogonal matrix whose columns are the \(p\) eigenvectors of \(S\), i.e., the PCs \(\mathbf{v}_1, \ldots, \mathbf{v}_p\).
3.3. Optimal Linear Reconstruction
The \(k\)-dimensional representation of data item \(\mathbf{x}_i\) is the vector of projections of \(\mathbf{x}_i\) onto first \(k\) PCs: \[\mathbf{z}_i=V_{1:k}^\top\mathbf{x}_i=[\mathbf{v}_i^\top \mathbf{x}_i, \ldots, \mathbf{v}_k^\top \mathbf{x}_i]^\top \in \mathbb{R}^k\] where \(V_{1:k}=\begin{bmatrix}\mathbf{v}_1 & \cdots & \mathbf{v}_k\end{bmatrix}\).
Transformed data matrix, a.k.a. scores matrix: \[\mathbf{Z}=\mathbf{X}V_{1:k} \in \mathbb{R}^{n \times k}.\]
Reconstruction of \(\mathbf{x}_i\): \[\widehat{\mathbf{x}}_i=V_{1:k}V_{1:k}^\top \mathbf{x}_i.\]
PCA gives the optimal linear reconstruction of the original data based on a \(k\)-dimensional compression.
3.4. Property
\(S\) is a real symmetric matrix, and thus eigenvectors (PCs) are orthogonal.
Projection to the \(j\)th PC, \(Z^{(j)}=\mathbf{v}_j^\top X\), has sample variance \(\lambda_j\), since \[\widehat{\text{Var}}[Z^{(j)}]=\mathbf{v}_j^\top S\mathbf{v}_j=\lambda_j\mathbf{v}_j^\top \mathbf{v}_j=\lambda_j.\]
Projections to PCs are uncorrelated, since for \(i \neq j\), \[\widehat{\text{Cov}}(Z^{(i)}, Z^{(j)})=\mathbf{v}_i^\top S\mathbf{v}_j=\lambda_j\mathbf{v}_i^\top\mathbf{v}_j=0.\]
The total sample variance is given by \[\text{Tr}(S)=\sum_{i=1}^p S_{ii}=\lambda_1+\cdots+\lambda_p\] so the proportion of total variance explained by the \(j\)th PC is \(\displaystyle \frac{\lambda_j}{\lambda_1+\cdots+\lambda_p}\).
4. Singular Value Decomposition (SVD)
Recall that any real-valued \(n \times p\) matrix \(\mathbf{X}\) can be written as \(\mathbf{X}=UDV^\top\), where \(U\) is an \(n \times n\) orthogonal matrix (i.e., \(UU^\top=U^\top U=I_n\)), \(D\) is an \(n \times p\) matrix with decreasing non-negative elements on the diagonal (the singular values) and zero off-diagonal elements, and \(V\) is a \(p \times p\) orthogonal matrix (i.e., \(VV^\top=V^\top V=I_p\)).
SVD always exists even for non-square matrices.
Fast and numerically stable algorithms for SVD are available in most languages.
Let \(\mathbf{X}=UDV^\top\) be the SVD of the \(n \times p\) data matrix \(\mathbf{X}\). We have \[(n-1)S=\mathbf{X}^\top\mathbf{X}=(UDV^\top)^\top(UDV^\top)=VD^\top U^\top UDV^\top=VD^\top DV^\top.\]
By analogy to the eigendecomposition \(S=V \Lambda V^\top\):
\(V\) from SVD of \(\mathbf{X}\) is the same as from eigendecomposition of \(S\), and thus the columns in \(V\) are the PCs;
\(\displaystyle \Lambda=\frac{1}{n-1}D^\top D\) where \(D_{ii}=\sqrt{(n-1)\lambda_i}\) for all \(i \leq \min(n, p)\), and if \(p>n\), \(\lambda_i=0\) for all \(i>n\).
5. PCA Projection from Gram Matrix
Recall that \(\mathbf{K}=\mathbf{X}\mathbf{X}^\top\), where \(\mathbf{K}_{ij}=\mathbf{x}_i^\top\mathbf{x}_j\) is called the Gram matrix of dataset \(\mathbf{X}\).
\(\mathbf{K}\) and \((n-1)S=\mathbf{X}^\top\mathbf{X}\) have the same nonzero eigenvalues, equal to the nonzero squared singular values of \(\mathbf{X}\).
We have \[\mathbf{X}\mathbf{X}^\top=(UDV^\top)(UDV^\top)^\top=UDV^\top VD^\top U^\top=UDD^\top U^\top.\] Note that the columns of \(U\) represent the eigenvectors of \(\mathbf{K}\), but they are different from the eigenvectors of \(\mathbf{X}^\top\mathbf{X}\).
If we consider projection to all PCs, the transformed data matrix is \[\mathbf{Z}=\mathbf{X}V=UDV^\top V=UD.\] Hence, \(\mathbf{Z}\) can be obtained from the eigendecomposition of Gram matrix \(\mathbf{K}\), where eigendecomposition gives us \(U\) directly, and \(D_{ii}=\sqrt{(n-1)\lambda_i}\). When \(p \gg n\), eigendecomposition of \(\mathbf{K}\) requires much less computation (\(\mathcal{O}(n^3)\)) than the eigendecomposition of the covariance matrix (\(\mathcal{O}(p^3)\)).