1. Notation
Assume the data consists of
datapoints of variables. Let with being the th variable for the th datapoint.Denote the
th datapoint by and assume are i.i.d. samples of a random vector over . The th dimension of will be denoted .
2. Data Representation
A key class of unsupervised methods are converting (input) data into alternative forms by learning some mapping
, where and .Sometimes we map to a higher dimensional space (
) to introduce additional features (e.g., kernel method).More commonly, we perform dimensionality reduction by mapping to a lower dimensional space (
, typically ).
3. Principle Component Analysis (PCA)
PCA is a linear dimensionality reduction technique, and it finds a new basis for the data
We want to choose a
that captures the most information about variability in .PCA uses variance as a proxy: it finds an orthogonal basis that maximizes variance of
.PCA is often a useful data preprocessing step (particularly with high-dimensional data) or for exploratory data analysis.
3.1. Definition
PCA is a technique to find an orthogonal basis
the first principal component (PC)
is the direction of greatest variance of data:the
th PC is the direction orthogonal to of greatest variance for :
3.2. Derivation
Recall that
Since
The PCA components can be found by the eigendecomposition
3.3. Optimal Linear Reconstruction
The
-dimensional representation of data item is the vector of projections of onto first PCs: where .Transformed data matrix, a.k.a. scores matrix:
Reconstruction of
:PCA gives the optimal linear reconstruction of the original data based on a
-dimensional compression.
3.4. Property
is a real symmetric matrix, and thus eigenvectors (PCs) are orthogonal.Projection to the
th PC, , has sample variance , sinceProjections to PCs are uncorrelated, since for
,The total sample variance is given by
so the proportion of total variance explained by the th PC is .
4. Singular Value Decomposition (SVD)
Recall that any real-valued
matrix can be written as , where is an orthogonal matrix (i.e., ), is an matrix with decreasing non-negative elements on the diagonal (the singular values) and zero off-diagonal elements, and is a orthogonal matrix (i.e., ).SVD always exists even for non-square matrices.
Fast and numerically stable algorithms for SVD are available in most languages.
Let
By analogy to the eigendecomposition
from SVD of is the same as from eigendecomposition of , and thus the columns in are the PCs; where for all , and if , for all .
5. PCA Projection from Gram Matrix
Recall that
, where is called the Gram matrix of dataset . and have the same nonzero eigenvalues, equal to the nonzero squared singular values of .
We have
If we consider projection to all PCs, the transformed data matrix is