1. Least Squares

Four ways to solve least squares:

1. The SVD of \(A\) leads to its pseudo-inverse \(A^+\), then \(\mathbf{x}^+=A^+\mathbf{b}\).

2. When \(A\) has independent columns, solving \(A^\top A\widehat{\mathbf{x}}=A^\top \mathbf{b}\) to minimize \(\|A\mathbf{x}-\mathbf{b}\|^2\).

3. When \(A\) has independent columns, Gram-Schmidt \(A=QR\) leads to \(\widehat{\mathbf{x}}=R^{-1}Q^\top \mathbf{b}\).

4. Minimize \(\|A\mathbf{x}-\mathbf{B}\|^2+\delta^2\|\mathbf{x}\|^2\), and the best \(\mathbf{x}\) is the limit of \((A^\top A+\delta I)^{-1}A^\top \mathbf{b}\) as \(\delta \to 0\).

1.1. Pseudo-Inverse Method

If \(A\) is invertible, then \(A^+=A^{-1}\).
The pseudo-inverse of \(A=U\Sigma V^\top\) is \(A^+=V\Sigma^+U^\top\).
Sometimes we denote \(A^+\) to be \(A^\dagger\).

Example 1.1 Consider \(\Sigma=\begin{bmatrix} \sigma_1 & 0 & 0 & 0 \\ 0 & \sigma_2 & 0 & 0 \\ 0 & 0 & 0 & 0 \end{bmatrix}\), then \(\Sigma^+=\begin{bmatrix} 1/\sigma_1 & 0 & 0 \\ 0 & 1/\sigma_2 & 0 \\ 0 & 0 & 0 \\ 0 & 0 & 0 \end{bmatrix}\).

\(\mathbf{x}^+=A^+\mathbf{b}\) is the minimum norm least squares solution.

Example 1.2 Consider \(\begin{bmatrix} 3 & 0 \\ 0 & 0 \end{bmatrix}\begin{bmatrix} x_1 \\ x_2 \end{bmatrix}=\begin{bmatrix} 6 \\ 8 \end{bmatrix}\), then \(x^+=A^+\mathbf{b}=\begin{bmatrix} 1/3 & 0 \\ 0 & 0 \end{bmatrix}\begin{bmatrix} 6 \\ 8 \end{bmatrix}=\begin{bmatrix} 2 \\ 0 \end{bmatrix}\).

1.2. Normal Equation Method

Suppose \(A\) has independent columns, then \(\widehat{\mathbf{x}}=(A^\top A)^{-1}A^\top \mathbf{b}\).
Projection of \(\mathbf{b}\) onto the column space of \(A\) is \(\mathbf{p}=A\widehat{\mathbf{x}}=A(A^\top A)^{-1}A^\top \mathbf{b}\).

1.3. Gram-Schmidt Method

Suppose \(A\) has independent columns \(\mathbf{a}_1, \ldots, \mathbf{a}_n\).

Let \(\displaystyle \mathbf{q}_1=\frac{\mathbf{a}_1}{\|\mathbf{a}_1\|}\), \(\displaystyle A_k=\mathbf{a}_k-\sum_{i=1}^{k-1} (\mathbf{a}_k^\top \mathbf{q}_i)\mathbf{q}_i\), and \(\displaystyle \mathbf{q}_k=\frac{A_k}{\|A_k\|}\). Then each \(\mathbf{a}_k\) is a combination of \(\mathbf{q}_1\) to \(\mathbf{q}_k\): \(\mathbf{a}_1=\|\mathbf{a}_1\|\mathbf{q}_1\) and \(\displaystyle \mathbf{a}_k=\sum_{i=1}^{k-1} (\mathbf{a}_k^\top \mathbf{q}_i)\mathbf{q}_i+\|A_k\|\mathbf{q}_k\).

Hence, the matrix \(R=Q^\top A\) with \(r_{ij}=\mathbf{q}_i^\top \mathbf{a}_j\) is upper triangular, i.e., \[\begin{bmatrix}& \\ \mathbf{a}_1 & \cdots & \mathbf{a}_n \\ & \end{bmatrix}=\begin{bmatrix}& \\ \mathbf{q}_1 & \cdots & \mathbf{q}_n \\ &\end{bmatrix}\begin{bmatrix}r_{11} & \cdots & r_{1n} \\ 0 & \ddots & \vdots \\0 & 0 & r_{nn}\end{bmatrix}\] is \(A=QR\).

The MATLAB command is [Q, R]=qr(A).

Suppose \(A\) has independent columns, then by Gram-Schmidt, we have \(A^\top A=R^\top Q^\top QR=R^\top R\), and thus \(\widehat{\mathbf{x}}=(R^\top R)^{-1}R^\top Q^\top \mathbf{b}=R^{-1}Q^\top \mathbf{b}\).
We can modify Gram-Schmidt with column pivoting.

1.4. Penalty Method

Penalty method regularizes a singular problem.

The new normal equation is \((A^*)^\top A^*\widehat{\mathbf{x}}=(A^*)^\top \mathbf{b}^*\), where \(A^*=\begin{bmatrix} A \\ \delta I \end{bmatrix}\) and \(b^*=\begin{bmatrix} \mathbf{b} \\ \mathbf{0} \end{bmatrix}\), i.e., \[(A^\top A+\delta^2I)\widehat{\mathbf{x}}=A^\top \mathbf{b}.\]
For any \(A\), \((A^\top A+\delta^2I)^{-1}A^\top \to A^+\) as \(\delta \to 0\).

Proof. Suppose \(A=U\Sigma V^\top\), then we have \[A^\top A+\delta^2I=V\Sigma^\top U^\top U\Sigma V^\top +\delta^2I=V\Sigma^\top \Sigma V^\top +\sigma^2I=V(\Sigma^\top \Sigma+\delta^2I)V^\top \] Therefore, \[(A^\top A+\delta^2I)^{-1}A^\top =V(\Sigma^\top \Sigma+\delta^2I)^{-1}V^\top V\Sigma^\top U^\top =V[(\Sigma^\top \Sigma+\delta^2I)^{-1}\Sigma^\top ]U^\top.\]

Since \((\Sigma^\top \Sigma+\delta^2I)^{-1}\Sigma^\top\) has positive diagonal entries \(\displaystyle \frac{\sigma_i}{\sigma_i^2+\delta^2}\) and otherwise all zeros, then as \(\delta \to 0\), \((\Sigma^\top \Sigma+\delta^2I)^{-1}\Sigma^\top \to \Sigma^+\). Hence, as \(\delta \to 0\), \[V[(\Sigma^\top \Sigma+\delta^2I)^{-1}\Sigma^\top ]U^\top \to V\Sigma^+U^\top =A^+.\]

\(\square\)

2. Numerical Linear Algebra

2.1. Krylov Subspace and Arnoldi Iteration

Matrix-vector multiplication is fast, especially if \(A\) is sparse. If we start with \(A\) and \(\mathbf{b}\), we can quickly compute each of the vectors \(\mathbf{b}, A\mathbf{b}, A^2\mathbf{b}=A(A\mathbf{b}), \ldots, A^{j-1}\mathbf{b}\). The combinations of vectors above give Krylov subspace \(\mathcal{K}_j\). We want to find a closest approximation \(\mathbf{x}_j \in K_j\) to the desired solution \(\mathbf{x}\).
(Arnoldi Iteration) An orthogonal basis is better than the original basis \(\mathbf{b}, A\mathbf{b}, \ldots, A^{j-1}\mathbf{b}\).
- Let \(\displaystyle \mathbf{q}_1=\frac{\mathbf{b}}{\|\mathbf{b}\|}\), and \(\mathbf{q}_2, \ldots, \mathbf{q}_k\) are known;
- Let \(\mathbf{v}=A\mathbf{q}_k\);
- For \(j=1\) to \(k\):
  - \(h_{jk}=\mathbf{q}_j^\top \mathbf{v}\);
  - \(\mathbf{v}=\mathbf{v}-h_{jk}\mathbf{q}_j\);
- Let \(h_{k+1, k}=\|\mathbf{v}\|\);
- Let \(\displaystyle \mathbf{q}_{k+1}=\frac{\mathbf{v}}{h_{k+1, k}}\).

The Arnoldi iteration follows the Gram-Schmidt idea.

Assume we use the standard Gram-Schmidt to attain an orthogonal basis for \(\mathcal{K}_{j+1}\), then \[\mathbf{y}_j=A^j\mathbf{b}-\sum_{i=1}^j (\mathbf{q}_i^\top A^j\mathbf{b})\mathbf{q}_i\] and \[\mathbf{q}_{j+1}=\frac{\mathbf{y}_j}{\|\mathbf{y}_j\|}.\] Hence, \(\{\mathbf{q}_1, \ldots, \mathbf{q}_{j+1}\}\) is an orthogonal basis for \(\mathcal{K}_{j+1}\).

Since \(\displaystyle \mathbf{q}_1=\frac{\mathbf{b}}{\|\mathbf{b}\|}\) and \(A\mathbf{q}_1=c_1\mathbf{q}_1+c_2\mathbf{q}_2\), then \[\begin{aligned}\mathcal{K}_{j+1}&=\text{span}\{\mathbf{b}, A\mathbf{b}, A^2\mathbf{b}, \ldots, A^j\mathbf{b}\}\\&=\text{span}\{\mathbf{q}_1, A\mathbf{q}_1, A^2\mathbf{q}_1, \ldots, A^j\mathbf{q}_1\}\\&=\text{span}\{\mathbf{q}_1, c_1\mathbf{q}_1+c_2\mathbf{q}_2, A(c_1\mathbf{q}_1+c_2\mathbf{q}_2), \ldots, A^{j-1}(c_1\mathbf{q}_1+c_2\mathbf{q}_2)\}\\&=\text{span}\{\mathbf{q}_1, \mathbf{q}_2, A\mathbf{q}_2, \ldots, A^{j-1}\mathbf{q}_2\}\\&\ \ \vdots\\&=\text{span}\{\mathbf{q}_1, \mathbf{q}_2, \ldots, \mathbf{q}_j, A\mathbf{q}_j\}.\end{aligned}\]

Apply Gram-Schmidt again, we have \[\mathbf{v}_j=A\mathbf{q}_j-\sum_{i=1}^j (\mathbf{q}_i^\top A\mathbf{q}_j)\mathbf{q}_i\] and replace \[\mathbf{q}_{j+1}=\frac{\mathbf{v}_j}{\|\mathbf{v}_j\|}.\]

Since \[\mathbf{q}_{j+1}^\top \mathbf{v}_j=\mathbf{q}_{j+1}^\top A\mathbf{q}_j-\sum_{i=1}^j (\mathbf{q}_i^\top A\mathbf{q}_j)\mathbf{q}_{j+1}^\top \mathbf{q}_i=\mathbf{q}_{j+1}^\top A\mathbf{q}_j\] and \[\mathbf{q}_{j+1}^\top \mathbf{v}_j=\frac{\mathbf{v}_j^\top \mathbf{v}_j}{\|\mathbf{v}_j\|}=\|\mathbf{v}_j\|,\] then \(\|\mathbf{v}_j\|=\mathbf{q}_{j+1}^\top A\mathbf{q}_j\).

Therefore, \[\begin{aligned}A\mathbf{q}_j&=\mathbf{v}_j+\sum_{i=1}^j (\mathbf{q}_i^\top A\mathbf{q}_j)\mathbf{q}_i\\&=\|\mathbf{v}_j\|\mathbf{q}_{j+1}+\sum_{i=1}^j (\mathbf{q}_i^\top A\mathbf{q}_j)\mathbf{q}_i\\&=\mathbf{q}_{j+1}^\top A\mathbf{q}_j\mathbf{q}_{j+1}+\sum_{i=1}^j (\mathbf{q}_i^\top A\mathbf{q}_j)\mathbf{q}_i\\&=\sum_{i=1}^{j+1} (\mathbf{q}_i^\top A\mathbf{q}_j)\mathbf{q}_i\\&=\sum_{i=1}^{j+1} h_{ij}\mathbf{q}_i\\&=\begin{bmatrix}& \\ \mathbf{q}_1 & \cdots & \mathbf{q}_j \\ &\end{bmatrix}\begin{bmatrix}h_{11} \\ \vdots \\ h_{jj}\end{bmatrix}+\mathbf{q}_{j+1}h_{j+1, j}.\end{aligned}\]

Hence, \[\begin{aligned}AQ_j&=\begin{bmatrix}& \\ A\mathbf{q}_1 & A\mathbf{q}_2 & \cdots & A\mathbf{q}_{j-1} & A\mathbf{q}_j \\ &\end{bmatrix}\\&=\begin{bmatrix}& \\ \mathbf{q}_1 & \mathbf{q}_2 & \cdots & \mathbf{q}_j & \mathbf{q}_{j+1} \\ &\end{bmatrix}\begin{bmatrix}h_{11} & h_{12} & h_{13} & \cdots & h_{1j} \\h_{21} & h_{22} & h_{23} & \cdots & h_{2j} \\ 0 & h_{32} & h_{33} & \cdots & h_{3j} \\ \vdots & \ddots & \ddots & \ddots & \vdots \\ 0 & \cdots & 0 & h_{j, j-1} & h_{jj} \\ 0 & \cdots & \cdots & 0 & h_{j+1, j}\end{bmatrix}\\&=Q_{j+1}H_{j+1, j}\end{aligned}.\]

Besides, we have \[Q^\top _jAQ_j=Q_j^\top Q_{j+1}H_{j+1, j}=\begin{bmatrix}& \\ I_{j \times j} & \mathbf{0}_{j \times 1} \\ &\end{bmatrix}\begin{bmatrix}H_j \\ (k+1)^\text{th}\ \text{row} \end{bmatrix}=H_j,\] where \(H_j\) is the Hessenberg matrix (an upper triangular matrix with one nonzero sub-diagonal). In addition, \(H_j=Q_j^\top AQ_j\) is the projection of \(A\) onto the Krylov space, using the basis of \(\mathbf{q}\)'s.

2.2. Computation of Eigenvalue and Singular Value

2.2.1. QR Algorithm for Eigenvalue

QR algorithm with shifts:

Step 1: Reduce \(A\) to a similar Hessenberg form \(A_0\);

Step 2: QR with shift:

(1) Choose a shift \(s_k\) at step \(k\), where \(k=0, \ldots\);

(2) Factor \(A_k-s_kI=Q_kR_k\);

(3) Reverse factors and shift back: \(A_{k+1}=R_kQ_k+s_kI\).

Well-chosen shifts \(s_k\) will greatly speed up the approach of the \(A\)'s to a diagonal matrix \(\Lambda\).
If the matrix \(S\) is symmetric, then we can reduce \(S\) to a symmetric Hessenberg matrix, i.e., tridiagonal matrix.

2.2.2. Golub-Kahan Algorithm for Singular Value

Eigenvalues are the same for \(S\) and \(Q^{-1}SQ=Q^\top SQ\), so we have limited freedom to create zeros in \(Q^{-1}SQ\). The good \(Q^{-1}SQ\) will be tridiagonal matrix.
Singular values are the same for \(A\) and \(Q_1AQ_2^\top\) even if \(Q_1\) is different from \(Q_2\). We have more freedom to create zeros in \(Q_1AQ_2^\top\). With the right \(Q\)'s, the good \(Q_1AQ_2^\top\) will be bidiagonal matrix.
Golub-Kahan algorithm:

Step 1: Find \(Q_1\) and \(Q_2\) so that \(Q_1AQ_2^\top\) is bidiagonal;

Step 2: Apply the shifted QR algorithm to find the eigenvalues of \[(Q_1AQ_2^\top )^\top (Q_1AQ_2^\top )=Q_2A^\top AQ_2^\top,\] which is tridiagonal.

3. Random Matrix Multiplication

Let \(\displaystyle p_j=\frac{\|\mathbf{a}_j\|\|\mathbf{b}_j^\top \|}{C}\), where \(\mathbf{a}_j\) is the column \(j\) of \(A\), \(\mathbf{b}_j^\top\) is the row \(j\) of \(B\), and \(\displaystyle C=\sum_{j=1}^n \|\mathbf{a}_j\|\|\mathbf{b}_j^\top \|\).
Take \(\mathbf{a}_j\) and \(\mathbf{b}_j^\top\) w.p. \(p_j\), and each of the \(s\) trials produces a matrix \(\displaystyle X_j=\frac{\mathbf{a}_j\mathbf{b}_j^\top }{sp_j}\). The mean of each trial is \[\mathbb{E}[X]=\sum_{j=1}^n p_jX_j=\frac{1}{s}\sum_{j=1}^n \mathbf{a}_j\mathbf{b}_j^\top =\frac{1}{s}AB.\] The approximation of \(AB\) is the sum of \(s\) trials, and we have \(s\mathbb{E}[X]=AB\).
The variance of each trial is \[\begin{aligned} \text{Var}[X]&=\mathbb{E}[X^2]-(\mathbb{E}[X])^2 \\&=\sum_{j=1}^n p_j\frac{\|\mathbf{a}_j\|^2\|\mathbf{b}_j^\top \|^2}{s^2p_j^2}-\frac{1}{s^2}\|AB\|_F^2 \\&=\frac{1}{s^2}(C^2-\|AB\|_F^2). \end{aligned}\] The variance of samples is \(\displaystyle s\text{Var}[X]=\frac{1}{s}(C^2-\|AB\|_F^2)\).
Norm-squared sampling uses the optimal probabilities \(p_j\) for minimum variance.

Proof. We want to solve \[\min \sum_{j=1}^n \frac{\|\mathbf{a}_j\|^2\|\mathbf{b}_j^\top \|^2}{sp_j}-\frac{1}{s}\|AB\|_F^2\ \ \text{s.t.}\ \ \sum_{j=1}^n p_j=1.\] By Lagrange, \[L(p_1, \ldots, p_n, \lambda)=\sum_{j=1}^n \frac{\|\mathbf{a}_j\|^2\|\mathbf{b}_j^\top \|^2}{sp_j}+\lambda\left(\sum_{j=1}^n p_j-1\right).\]

Take the partial derivatives w.r.t. \(p_j\) and \(\lambda\), let \[\frac{\partial L}{\partial p_j}=-\frac{\|\mathbf{a}_j\|^2\|\mathbf{b}_j^\top \|^2}{sp_j^2}+\lambda=0\] and \[\frac{\partial L}{\partial \lambda}=\sum_{j=1}^n p_j-1=0,\] then \(\displaystyle p_j=\frac{\|\mathbf{a}_j\|\|\mathbf{b}_j^\top \|}{\sqrt{s\lambda}}\) and \(\displaystyle \sum_{j=1}^n p_j=1\).

Therefore, \(\displaystyle \sum_{j=1}^n \frac{\|\mathbf{a}_j\|\|\mathbf{b}_j^\top \|}{\sqrt{s\lambda}}=1\) gives \(\sqrt{s\lambda}=C\). Hence, \(\displaystyle p_j=\frac{\|\mathbf{a}_j\|\|\mathbf{b}_j^\top \|}{C}\) solves the optimization problem, i.e., norm-squared sampling uses the optimal probabilities \(p_j\) for minimum variance.