0%

Principle Component Analysis

1. Notation

  • Assume the data consists of n datapoints of p variables. Let X=(xij)Rn×p with xij being the jth variable for the ith datapoint.

  • Denote the ith datapoint by xiRp and assume x1,,xn are i.i.d. samples of a random vector X over Rp. The jth dimension of X will be denoted X(j).

2. Data Representation

  • A key class of unsupervised methods are converting (input) data into alternative forms by learning some mapping xz, where xRp and zRk.

  • Sometimes we map to a higher dimensional space (k>p) to introduce additional features (e.g., kernel method).

  • More commonly, we perform dimensionality reduction by mapping to a lower dimensional space (k<p, typically kp).

3. Principle Component Analysis (PCA)

PCA is a linear dimensionality reduction technique, and it finds a new basis for the data zi=Vxi.

  • We want to choose a V that captures the most information about variability in X.

  • PCA uses variance as a proxy: it finds an orthogonal basis that maximizes variance of Z=VX.

  • PCA is often a useful data preprocessing step (particularly with high-dimensional data) or for exploratory data analysis.

3.1. Definition

PCA is a technique to find an orthogonal basis {v1,,vp} for the data space s.t.

  • the first principal component (PC) v1 is the direction of greatest variance of data: v1=argmaxv:vv=1Var[vX];

  • the jth PC vj is the direction orthogonal to v1,,vj1 of greatest variance for j=2,,p: vj=argmaxv:vv=1vvi=0,i<jVar[vX].

3.2. Derivation

Recall that v1=argmaxvVar[vX] s.t. vv=1. Define L(v1,λ)=Var[vX]λ(vv1)=vVar[X]vλ(vv1)vSvλ(vv1) where S=1n1i=1nxixi is the sample covariance matrix. Let vL(v1,λ)=2Sv12λv1=0 and we have Sv1=λv1, i.e., v1 is the eigenvector of S.

Since vSv=vλv=λ, then if we want to maximize, we want eigenvector corresponding to the largest eigenvalue. Similarly, vj is the eigenvector of S with jth largest eigenvalue.

The PCA components can be found by the eigendecomposition S=1n1XX=VΛV where Λ is a diagonal matrix with eigenvalues (variances along each PC) λ1λp0, and V (loadings matrix) is a p×p orthogonal matrix whose columns are the p eigenvectors of S, i.e., the PCs v1,,vp.

3.3. Optimal Linear Reconstruction

  • The k-dimensional representation of data item xi is the vector of projections of xi onto first k PCs: zi=V1:kxi=[vixi,,vkxi]Rk where V1:k=[v1vk].

  • Transformed data matrix, a.k.a. scores matrix: Z=XV1:kRn×k.

  • Reconstruction of xi: x^i=V1:kV1:kxi.

  • PCA gives the optimal linear reconstruction of the original data based on a k-dimensional compression.

3.4. Property

  • S is a real symmetric matrix, and thus eigenvectors (PCs) are orthogonal.

  • Projection to the jth PC, Z(j)=vjX, has sample variance λj, since Var^[Z(j)]=vjSvj=λjvjvj=λj.

  • Projections to PCs are uncorrelated, since for ij, Cov^(Z(i),Z(j))=viSvj=λjvivj=0.

  • The total sample variance is given by Tr(S)=i=1pSii=λ1++λp so the proportion of total variance explained by the jth PC is λjλ1++λp.

4. Singular Value Decomposition (SVD)

  • Recall that any real-valued n×p matrix X can be written as X=UDV, where U is an n×n orthogonal matrix (i.e., UU=UU=In), D is an n×p matrix with decreasing non-negative elements on the diagonal (the singular values) and zero off-diagonal elements, and V is a p×p orthogonal matrix (i.e., VV=VV=Ip).

  • SVD always exists even for non-square matrices.

  • Fast and numerically stable algorithms for SVD are available in most languages.

Let X=UDV be the SVD of the n×p data matrix X. We have (n1)S=XX=(UDV)(UDV)=VDUUDV=VDDV.

By analogy to the eigendecomposition S=VΛV:

  • V from SVD of X is the same as from eigendecomposition of S, and thus the columns in V are the PCs;

  • Λ=1n1DD where Dii=(n1)λi for all imin(n,p), and if p>n, λi=0 for all i>n.

5. PCA Projection from Gram Matrix

  • Recall that K=XX, where Kij=xixj is called the Gram matrix of dataset X.

  • K and (n1)S=XX have the same nonzero eigenvalues, equal to the nonzero squared singular values of X.

We have XX=(UDV)(UDV)=UDVVDU=UDDU. Note that the columns of U represent the eigenvectors of K, but they are different from the eigenvectors of XX.

If we consider projection to all PCs, the transformed data matrix is Z=XV=UDVV=UD. Hence, Z can be obtained from the eigendecomposition of Gram matrix K, where eigendecomposition gives us U directly, and Dii=(n1)λi. When pn, eigendecomposition of K requires much less computation (O(n3)) than the eigendecomposition of the covariance matrix (O(p3)).