0%

Kernel Method

1. Kernel Method

  • Kernel method is a powerful class of non-linear machine learning algorithm, which is based around employing linear method in nonlinearly transformed feature space.

2. Feature Map and Feature Space

Definition 2.1    Let \(\mathcal{H}\) be a vector space over \(\mathbb{R}\). A function \(\langle \cdot, \cdot \rangle_\mathcal{H}: \mathcal{H} \times \mathcal{H} \to \mathbb{R}\) is said to be an inner product on \(\mathcal{H}\) if

    (1) \(\langle \alpha_1f_1+\alpha_2f_2, g \rangle_\mathcal{H}=\alpha_1 \langle f_1, g \rangle_\mathcal{H}+\alpha_2 \langle f_2, g \rangle_\mathcal{H}\);

    (2) \(\langle f, g \rangle_\mathcal{H}=\langle g, f \rangle_\mathcal{H}\);

    (3) \(\langle f, f \rangle_\mathcal{H} \geq 0\);

    (4) \(\langle f, f \rangle_\mathcal{H}=0\) iff \(f=0\).

The norm is \(\|f\|_\mathcal{H}=\sqrt{\langle f, f \rangle_\mathcal{H}}\).

Definition 2.2    A Hilbert space is a vector space on which we define an inner product, \(\langle \cdot, \cdot \rangle_\mathcal{H}\), and which is complete w.r.t. \(\langle \cdot, \cdot \rangle_\mathcal{H}\).

Definition 2.3    A function \(k: \mathcal{X} \times \mathcal{X} \to \mathbb{R}\) is called a kernel if there exists a Hilbert space \(\mathcal{H}\) and a map \(\varphi: \mathcal{X} \to \mathcal{H}\) s.t. for all \(x, x' \in \mathcal{X}\), \[k(x, x'):=\langle \varphi(x), \varphi(x') \rangle_\mathcal{H}.\] \(\mathcal{H}\) is known as a feature space and \(\varphi\) is known as a feature map of \(k\).

There are no conditions on \(\mathcal{X}\): it does not need to be numeric. For example, we can define features and kernels for text.

Definition 2.4    A symmetric function \(k: \mathcal{X} \times \mathcal{X} \to \mathbb{R}\) is positive definite iff for all \(n \geq 1\), \(a_i \in \mathbb{R}\), \(x_i \in \mathcal{X}\), \[\sum_{i=1}^n \sum_{j=1}^n a_ia_jk(x_i, x_j) \geq 0.\] The function \(k(\cdot, \cdot)\) is strictly positive definite if for mutually distinct \(x_i\), the equality holds only when all the \(a_i\) are zero.

Theorem 2.1    Let \(\mathcal{H}\) be any Hilbert space, \(\mathcal{X}\) a non-empty set, and \(\varphi: \mathcal{X} \to \mathcal{H}\). Then \[k(x, x')=\langle \varphi(x), \varphi(x') \rangle_\mathcal{H}\] is a positive definite function.

Proof. We have \[\sum_{i=1}^n \sum_{j=1}^n a_ia_jk(x_i, x_j)=\sum_{i=1}^n \sum_{j=1}^n \langle a_i\varphi(x_i), a_j\varphi(x_j) \rangle_\mathcal{H}=\left\langle \sum_{i=1}^n a_i\varphi(x_i), \sum_{j=1}^n a_j\varphi(x_j) \right\rangle_\mathcal{H}=\left\| \sum_{i=1}^n a_i\varphi(x_i) \right\|_\mathcal{H}^2 \geq 0.\]

\(\square\)

We can consider infinite dimensional Hilbert space as a space of functions. By definition, \(\varphi(x) \in \mathcal{H}\), and we can think of \(\varphi\) as mapping from \(x\) to functions, such that \(\varphi: \mathcal{X} \to (\mathcal{X} \to \mathbb{R})\).

3. Reproducing Kernel Hilbert Space

A particularly important class of Hilbert spaces is reproducing kernel Hilbert space (RKHS).

Definition 3.1    Let \(\mathcal{H}\) be a Hilbert space of functions \(f: \mathcal{X} \to \mathbb{R}\). A function \(k: \mathcal{X} \times \mathcal{X} \to \mathbb{R}\) is called a reproducing kernel of \(\mathcal{H}\) if it satisfies:

    (1) for all \(x \in \mathcal{X}\), \(k_x:=k(\cdot, x) \in \mathcal{H}\);

    (2) for all \(x \in \mathcal{X}\) and \(f \in \mathcal{H}\), \(\langle f, k(\cdot, x) \rangle_\mathcal{H}=f(x)\) (reproducing property).

If \(\mathcal{H}\) has a reproducing kernel, it is called a reproducing kernel Hilbert space (RKHS).

Definition 3.1'    \(\mathcal{H}\) is an RKHS if the evaluation functional \(\delta_x: \mathcal{H} \to \mathbb{R}\), \(\delta_xf=f(x)\) is continuous for all \(x \in \mathcal{X}\). Equivalently, \(\mathcal{H}\) is an RKHS if \(\delta_x\) is a bounded operator, i.e. there exists a \(C>0\) s.t. \[|\delta_xf|=|f(x)| \leq C\|f\|_\mathcal{H}\] for all \(x \in \mathcal{X}\) and \(f \in \mathcal{H}\).

If \(\|f-g\|_\mathcal{H}=0\), then \(f(x)=g(x)\) for all \(x \in \mathcal{X}\).

Theorem 3.1    A reproducing kernel is a valid kernel with the feature map \(\varphi: x \mapsto k(\cdot, x)\), known as the canonical feature map.

Proof. Let \(k\) be the reproducing kernel, then \(k(\cdot, x') \in \mathcal{H}\) and \(\langle k(\cdot, x'), k(\cdot, x) \rangle_\mathcal{H}=k(\cdot, x')(x)=k(x, x')\). Hence, we have \[k(x, x')=\langle k(\cdot, x'), k(\cdot, x) \rangle_\mathcal{H}=\langle \varphi(x'), \varphi(x) \rangle_\mathcal{H},\] i.e. \(k\) is a kernel.

\(\square\)

Theorem 3.2 (Moore-Aronszajn Theorem)    Evert positive definite function \(k: \mathcal{X} \times \mathcal{X} \to \mathbb{R}\) is also a reproducing kernel with a unique corresponding RKHS.

Proof. Let \(k: \mathcal{X} \times \mathcal{X} \to \mathbb{R}\) be a (symmetric) positive definite function.

Define a function space \(\mathcal{H}_0\) as \(\text{span}\{k(\cdot, x): x \in \mathcal{X}\}\), i.e. \[\mathcal{H}_0:=\left\{\sum_{i=1}^r \alpha_ik(\cdot, x_i)\right\}_{r \in \mathbb{N}, \alpha_i \in \mathbb{R}, x_i \in \mathcal{X}}\] where typically \(r=\infty\).

Define a functional \(h: (\mathcal{X} \to \mathbb{R}) \times (\mathcal{X} \to \mathbb{R}) \to \mathbb{R}\): \[h(k(\cdot, x_i), k(\cdot, x_j))=k(x_i, x_j)\] where \(k(\cdot, x_i) \in \mathcal{H}_0\). Since \(k(\cdot, x_i)\) dictates all possible values of \(k(x, x_i)\) for \(x \in \mathcal{X}\) and \(k(\cdot, x_j)\) dictates all possible values of \(k(x, x_j)\) for \(x \in \mathcal{X}\), then there is only one \(k(x_i, x_j)\) compatible with both \(k(\cdot, x_i)\) and \(k(\cdot, x_j)\).

Define an inner product for \(\mathcal{H}_0\), \(\langle \cdot, \cdot \rangle_\mathcal{H}: \mathcal{H}_0 \times \mathcal{H}_0 \to \mathbb{R}\). Consider an arbitrary pair \(f, g \in \mathcal{H}_0\), then \[f=\sum_{i=1}^r \alpha_ik(\cdot, x_i) \quad \text{and} \quad g=\sum_{j=1}^s \beta_jk(\cdot, x_j).\]

Define \[\langle f, g \rangle_\mathcal{H}:=\sum_{i=1}^r \sum_{j=1}^s \alpha_i\beta_j h(k(\cdot, x_i), k(\cdot, x_j))=\sum_{i=1}^r \sum_{j=1}^s \alpha_i\beta_jk(x_i, x_j).\]

Define \(\mathcal{H}\) to be the completion of \(\mathcal{H}_0\) w.r.t. the norm \(\|f\|_\mathcal{H}=\sqrt{\langle f, f \rangle_\mathcal{H}}\).

We now need to show that \(\langle \cdot, \cdot \rangle_\mathcal{H}\) is a valid inner product.

  • Linearity: Let \(f_1, f_2, g \in \mathcal{H}\). We have \[af_1+bf_2=a\sum_{i=1}^{r_1} \alpha_{i, 1}k(\cdot, x_{i, 1})+b\sum_{i=1}^{r_2} \alpha_{i, 2}k(\cdot, x_{i, 2})=\sum_{i=1}^{r_1+r_2} \alpha_ik(\cdot, x_i)\] where \(\alpha_i=\mathbb{I}(i \leq r_1)a\alpha_{i, 1}+\mathbb{I}(i>r_1)b\alpha_{i, 2}\) and \(x_i=\mathbb{I}(i \leq r_1)x_{i, 1}+\mathbb{I}(i>r_1)x_{i, 2}\). Then \[\begin{aligned} \langle af_1+bf_2, g \rangle_\mathcal{H}&=\sum_{i=1}^{r_1+r_2} \sum_{j=1}^s \alpha_i\beta_jk(x_i, x_j') \\&=\sum_{j=1}^s\beta_j \left(a\sum_{i=1}^{r_1} \alpha_{i, 1}k(x_{i, 1}, x_j')+b\sum_{i=1}^{r_2} \alpha_{i, 2}k(x_{i, 2}, x_j')\right) \\&=a\sum_{i=1}^{r_1}\alpha_{i, 1}\beta_jk(x_{i, 1}, x_j')+b\sum_{i=1}^{r_2}\alpha_{i, 2}\beta_jk(x_{i, 2}, x_j') \\&=a\langle f_1, g \rangle_\mathcal{H}+b\langle f_2, g \rangle_\mathcal{H}. \end{aligned}\]

  • Symmetry: Let \(f \in \mathcal{H}\). We have \[\langle f, g \rangle_\mathcal{H}=\sum_{i=1}^r\sum_{j=1}^s \alpha_i\beta_jk(x_i, x_j)=\sum_{j=1}^s\sum_{i=1}^r \beta_j\alpha_ik(x_j, x_i)=\langle g, f \rangle_\mathcal{H}.\]

  • Positivity: We have \[\langle f, f \rangle_\mathcal{H}=\sum_{i=1}^r\sum_{j=1}^r \alpha_i\alpha_jk(x_i, x_j) \geq 0\] since \(k\) is positive definite.

  • Positive definiteness: Suppose \(f=0\), then \[\langle f, f \rangle_\mathcal{H}=\sum_{i=1}^r \alpha_i\sum_{j=1}^r \alpha_jk(x_i, x_j)=\sum_{i=1}^r \alpha_if(x_i)=0.\] Suppose \(\langle f, f \rangle_\mathcal{H}=0\), then \[\sum_{i=1}^r \sum_{j=1}^r \alpha_i\alpha_jk(x_i, x_j)=0\] for such \(\{\alpha_i\}_{i=1}^r\) and \(\{x_i\}_{i=1}^r\). Suppose for some \(x_{r+1} \in \mathcal{X}\) s.t. \[f(x_{r+1})=\sum_{i=1}^r \alpha_i k(x_i, x_{r+1}) \neq 0.\] then \[\begin{aligned} \sum_{i=1}^{r+1} \sum_{j=1}^{r+1} \alpha_i\alpha_jk(x_i, x_j)&=\sum_{i=1}^r \sum_{j=1}^r \alpha_i\alpha_jk(x_i, x_j)+2\alpha_{r+1}f(x_{r+1})+\alpha_{r+1}^2k(x_{r+1}, x_{r+1}) \\&=2\alpha_{r+1}f(x_{r+1})+\alpha_{r+1}^2k(x_{r+1}, x_{r+1}) \geq 0 \end{aligned}\] holds for any \(\alpha_{r+1}\). If \(k(x_{r+1}, x_{r+1})=0\), then let \(\alpha_{r+1}=-f(x_{r+1})\), and \(-2f^2(x_{r+1}) \geq 0\), implying \(f(x_{r+1})=0\), which is a contradiction. If \(k(x_{r+1}, x_{r+1})>0\), then let \[\alpha_{r+1}=\frac{-f(x_{r+1})}{k(x_{r+1}, x_{r+1})},\] and \[\frac{-f^2(x_{r+1})}{k(x_{r+1}, x_{r+1})} \geq 0,\] implying \(f(x_{r+1})=0\), which is a contradiction. Therefore, \(f(x)=0\) for all \(x \in \mathcal{X}\). In conclusion, \(\langle f, f \rangle_\mathcal{H}=0\) iff \(f=0\).

Therefore, \(\langle \cdot, \cdot \rangle_\mathcal{H}\) is an inner product, and thus \(\mathcal{H}\) is a Hilbert space. Since \(\langle k(\cdot, x), k(\cdot, x') \rangle_\mathcal{H}=k(x, x')\) by construction, then \(k(x, x')=\langle \varphi(x), \varphi(x') \rangle_\mathcal{H}\) where \(\varphi(x)=k(\cdot, x)\). Hence, \(k(x, x')\) is a kernel.

We now need to show that \(k\) is a reproducing kernel.

  • \(k(\cdot, x) \in \mathcal{H}\) by definition.

  • For any \(f \in \mathcal{H}\), we have \[\begin{aligned} \langle f, k(\cdot, x) \rangle_\mathcal{H}&=\left\langle \sum_{i=1}^r \alpha_ik(\cdot, x_i), k(\cdot, x) \right\rangle_\mathcal{H} \\&=\sum_{i=1}^r \alpha_i\langle k(\cdot, x_i), k(\cdot, x) \rangle_\mathcal{H} \\&=\sum_{i=1}^r \alpha_ik(x_i, x) \\&=f(x). \end{aligned}\]

Hence \(k\) is a reproducing kernel of \(\mathcal{H}\).

\(\square\)

The following statements are proven:

Then we know the following concepts are equivalent: (1) kernel, (2) positive definite function, and (3) reproducing kernel.

4. RKHS as Hypothesis Class

4.1. Uniqueness of RKHS

Theorem 4.1    Each RKHS has a unique corresponding reproducing kernel.

Proof. Assume that an RKHS \(\mathcal{H}\) has two unique reproducing kernels \(k_1\) and \(k_2\). Using a combination of linearity and the reproducing property, we have for all \(f \in \mathcal{H}\) and \(x \in \mathcal{X}\), \[\langle f, k_1(\cdot, x)-k_2(\cdot, x) \rangle_\mathcal{H}=\langle f, k_1(\cdot, x) \rangle_\mathcal{H}-\langle f, k_2(\cdot, x) \rangle_\mathcal{H}=f(x)-f(x)=0.\] Let \(f=k_1(\cdot, x)-f_2(\cdot, x)\), then \(\|k_1(\cdot, x)-k_2(\cdot, x)\|_\mathcal{H}^2=0\) for all \(x \in \mathcal{X}\), which implies \(k_1=k_2\).

\(\square\)

The inverse result is also true, i.e. the RKHS for any kernel is unique, and thus we can denote the RKHS for kernel \(k\) as \(\mathcal{H}_k\).

4.2. Representer Theorem

We can consider a RKHS as a hypothesis class for ERM. A typical and general setup would be that we are looking for the function \(f^*\) in the RKHS \(\mathcal{H}_k\) which solves the regularized ERM problem \[\mathop{\arg\min}_{f \in \mathcal{H}_k} \widehat{R}(f)+\Omega(\|f\|_{\mathcal{H}_k}^2)\] for empirical risk \(\widehat{R}(f)\) and any non-decreasing function \(\Omega\).

Theorem 4.2 (Representer Theorem)    There is always a solution to \[f^*=\mathop{\arg\min}_{f \in \mathcal{H}_k} \widehat{R}(f)+\Omega(\|f\|_{\mathcal{H}_k}^2)\] that takes the form \[f^*=\sum_{i=1}^n a_ik(\cdot, x_i)\] where \(a_i \in \mathbb{R}\) and \(x_i\) is input datapoint. If \(\Omega\) is strictly increasing, all solutions have the form above.

Proof. Let \(f\) be a minimizer of \(g(f)=\widehat{R}(f)+\Omega(\|f\|_{\mathcal{H}_k}^2)\). Let \[f_s=\sum_{i=1}^n \alpha_i k(\cdot, x_i),\] i.e. \(f_s\) is the projection of \(f\) to the subspace \(\text{span}\{k(\cdot, x_i): i=1, \ldots, n\}\). Let \(f_\perp\) be orthogonal component of \(f\) s.t. \(f=f_s+f_\perp\). Then \(\|f\|_{\mathcal{H}_k}^2=\|f_s\|_{\mathcal{H}_k}^2+\|f_\perp\|_{\mathcal{H}_k}^2 \geq \|f_s\|_{\mathcal{H}_k}^2\) and thus \(\Omega(\|f\|_{\mathcal{H}_k}^2) \geq \Omega(\|f_s\|_{\mathcal{H}_k}^2)\).

Besides, \[f(x_i)=\langle f, k(\cdot, x_i) \rangle_{\mathcal{H}_k}=\langle f_s+f_\perp, k(\cdot, x_i) \rangle_{\mathcal{H}_k}=\langle f_s, k(\cdot, x_i) \rangle_{\mathcal{H}_k}=f_s(x_i).\] Then \(L(y_i, f(x_i), x_i)=L(y_i, f_s(x_i), x_i)\) and thus \(\widehat{R}(f)=\widehat{R}(f_s)\). Therefore, \(g(f_s) \leq g(f)\) and \(f_s\) is also a minimizer.

If \(\Omega\) is strictly increasing, then \(\Omega(\|f\|_{\mathcal{H}_k}^2)>\Omega(\|f_s\|_{\mathcal{H}_k}^2)\). Suppose \(\|f_\perp\|_{\mathcal{H}_k}^2>0\), then \(\|f\|_{\mathcal{H}_k}^2>\|f_s\|_{\mathcal{H}_k}^2\), then \(f\) is not a minimizer, which is a contradiction. Then \(\|f_\perp\|_{\mathcal{H}_k}^2=0\), i.e. \(\|f\|_{\mathcal{H}_k}^2=\|f_s\|_{\mathcal{H}_k}^2\), i.e. all solutions have the form \(\displaystyle \sum_{i=1}^n \alpha_ik(\cdot, x_i)\).

\(\square\)

5. Kernel

5.1. Operation with Kernel

Theorem 5.1    Given kernels \(k_1\) and \(k_2\) on \(\mathcal{X}\) and constants \(\alpha_1, \alpha_2>0\), then \(k=\alpha_1k_1+\alpha_2k_2\) is a kernel on \(\mathcal{X}\).

Proof. For all \(n \geq 1\), \(a_i \in \mathbb{R}\) and \(x_i \in \mathcal{X}\), we have \[\sum_{i=1}^n \sum_{j=1}^n a_ia_jk(x_i, x_j)=\alpha_1\sum_{i=1}^n \sum_{j=1}^n a_ia_j k_1(x_i, x_j)+\alpha_2\sum_{i=1}^n \sum_{j=1}^n a_ia_j k_2(x_i, x_j) \geq 0\] and thus \(k\) is positive definite.

\(\square\)

\(k_1-k_2\) need not be a kernel.

Theorem 5.2    Given a map \(A: \mathcal{X} \to \widetilde{\mathcal{X}}\) and kernel \(k\) on \(\widetilde{\mathcal{X}}\), then \(k(A(x), A(x'))\) is a kernel on \(\mathcal{X}\).

Proof. If \(k\) is a kernel, than \(k(A(x), A(x'))=\langle \varphi(A(x)), \varphi(A(x')) \rangle_\mathcal{H}\) which is a kernel with feature \(\varphi(A(x))\)

\(\square\)

Theorem 5.3    Given \(k_1\) on \(\mathcal{X}\) and \(k_2\) on \(\mathcal{Y}\), then \(k((x, y), (x', y'))=k_1(x, x')k_2(y, y')\) is a kernel on \(\mathcal{X} \times \mathcal{Y}\). Moreover, if \(\mathcal{X}=\mathcal{Y}\), then \(k(x, x')=k_1(x, x')k_2(x, x')\) is a kernel on \(\mathcal{X}\).

5.2. Matérn Kernel

Using operation with kernel, we can show that Gaussian kernel (RBF kernel, squared exponential kernel or exponentiated quadratic kernel) is valid on \(\mathbb{R}^p\): \[k(x, x'):=\exp\left(-\frac{1}{2\gamma^2}\|x-x'\|^2\right).\] The RKHS of Gaussian kernel is infinite-dimensional. Moreover, if the domain \(\mathcal{X}\) is a compact subset of \(\mathbb{R}^p\), its RKHS is dense in the space of all bounded continuous functions w.r.t. the uniform norm. All functions in its RKHS are infinitely differentiable, and Gaussian kernel is often considered to be excessively smooth.

A less smooth alternative is the Matérn kernel: \[k(x, x')=\frac{2^{1-\nu}}{\Gamma(\nu)}\left(\frac{\sqrt{2\nu}}{\gamma}\|x-x'\|\right)^\nu K_\nu\left(\frac{\sqrt{2\nu}}{\gamma}\|x-x'\|\right)\] where \(\nu>0\), \(\gamma>0\), \(K_\nu\) is the modified Bessel function of the second kind of order \(\nu\). The Matérn kernel corresponding to the value \(\displaystyle \nu=s+\frac{1}{2}\) for non-negative integer \(s\) takes a simpler form, where \(s\) means \(s\) times differentiable iff \(\nu>s\). For example:

  • when \(\displaystyle \nu=\frac{1}{2}\), \(\displaystyle k(x, x')=\exp\left(-\frac{1}{\gamma}\|x-x'\|\right)\);

  • when \(\displaystyle \nu=\frac{3}{2}\), \(\displaystyle k(x, x')=\left(1+\frac{\sqrt{3}}{\gamma}\|x-x'\|\right)\exp\left(-\frac{\sqrt{3}}{\gamma}\|x-x'\|\right)\).

Moreover, the RKHS norm \(\|f\|_{\mathcal{H}_k}^2\) directly penalizes the derivatives of \(f\). For example, when \(\displaystyle \nu=\frac{3}{2}\), \[\|f\|_{\mathcal{H}_k}^2 \propto \int_\mathcal{X} f''(x)^2\text{d}x+\frac{6}{\gamma^2}\int_\mathcal{X} f'(x)^2\text{d}x+\frac{9}{\gamma^4}\int_\mathcal{X} f(x)^2\text{d}x.\]

As \(\nu \to \infty\), Matérn kernel converges to the RBF kernel.

5.3. Other Kernels

  • Constant: \(k(x, x')=c\).

  • Linear: \(k(x, x')=x^\top x'\).

  • Polynomial: \(k(x, x')=(c+x^\top x')^m\), where \(c \in \mathbb{R}\) and \(m \in \mathbb{N}\). If \(m=1\), then \(k(x, x')\) gives affine kernel.

  • Periodic (1d): \(\displaystyle k(x, x')=\exp\left(-\frac{2\sin^2(\pi|x-x'|/p)}{\gamma^2}\right)\), where \(p\) is the period and \(\gamma>0\).

  • Laplace: \(\displaystyle k(x, x')=\exp\left(-\frac{1}{\gamma}\|x-x'\|\right)\), where \(\gamma>0\).

  • Rational quadratic: \(\displaystyle k(x, x')=\left(1+\frac{\|x-x'\|^2}{2\alpha\gamma^2}\right)^{-\alpha}\), where \(\alpha>0\) and \(\gamma>0\).

Rational quadratic kernel is a scale mixture of Gaussian kernels. In particular, consider Gaussian parametrization \[k_\theta(x, x')=\exp(-\theta\|x-x'\|^2)\] and a Gamma density placed on \(\theta\), i.e. \[p(\theta)=\frac{\beta^\alpha}{\Gamma(\alpha)}\theta^{\alpha-1}\exp(-\beta\theta)\] with shape \(\alpha\) and rate \(\beta\). We then define \[\begin{aligned}\kappa(x, x')&=\int_0^\infty k_\theta(x, x')p(\theta)\text{d}\theta\\&=\frac{\beta^\alpha}{\Gamma(\alpha)}\int_0^\infty \exp(-\theta(\|x-x'\|^2+\beta))\theta^{\alpha-1}\text{d}\theta\\&=\frac{\beta^\alpha}{\Gamma(\alpha)} \cdot \frac{\Gamma(\alpha)}{(\|x-x'\|^2+\beta)^\alpha}\\&=\left(1+\frac{\|x-x'\|^2}{\beta}\right)^{-\alpha}.\end{aligned}\]

If \(\beta=2\alpha\gamma^2\) and let \(\alpha \to \infty\), we again recover Gaussian kernel.

6. Representation of Probability in RKHS

We can represent probability distribution \(P\) in the RKHS by considering the kernel mean embedding \[\mu_k(P)=\mathbb{E}_{X \sim P}k(\cdot, X) \in \mathcal{H}_k.\] Using the reproducing property, a kernel mean embedding converts a function \(f \in \mathcal{H}_k\) to its mean w.r.t. \(P\) when we take an inner product \[\begin{aligned} \langle \mu_k(P), f \rangle_{\mathcal{H}_k}&=\langle \mathbb{E}_{X \sim P}k(\cdot, X), f \rangle_{\mathcal{H}_k} \\&=\mathbb{E}_{X \sim P}\langle k(\cdot, X), f \rangle_{\mathcal{H}_k} \\&=\mathbb{E}_{X \sim P}f(X) \end{aligned}\] for all \(f \in \mathcal{H}_k\).

Inner product between kernel mean embeddings can be computed as \[\langle \mu_k(P), \mu_k(Q) \rangle_{\mathcal{H}_k}=\langle \mathbb{E}_{X \sim P}k(\cdot, X), \mathbb{E}_{Y \sim Q}k(\cdot, Y) \rangle_{\mathcal{H}_k}=\mathbb{E}_{X \sim P, Y \sim Q}k(X, Y).\]

The samples are from different margin distributions, not the joint distribution.

6.1. Maximum Mean Dsicrepancy

One of the most powerful uses of kernel mean embedding is to measure discrepancy between distributions by considering their distance in RKHS norm, which is called maximum mean discrepancy (MMD): \[\begin{aligned} \text{MMD}_k(P, Q)&=\|\mu_k(P)-\mu_k(Q)\|_{\mathcal{H}_k} \\&=\sqrt{\langle \mu_k(P), \mu_k(P) \rangle_{\mathcal{H}_k}+\langle \mu_k(Q), \mu_k(Q) \rangle_{\mathcal{H}_k}-2\langle \mu_k(P), \mu_k(Q) \rangle_{\mathcal{H}_k}} \\&=\sqrt{\mathbb{E}_{X \sim P, X' \sim P}k(X, X')+\mathbb{E}_{Y \sim Q, Y' \sim Q}k(Y, Y')-2\mathbb{E}_{X \sim P, Y \sim Q}k(X, Y)} \end{aligned}\] where \(X\) and \(X'\) are independent random samples from \(P\), and \(Y\) and \(Y'\) are independent random samples from \(Q\). For characteristic kernels, the MMD is a proper metric on probability distributions, i.e. \[\text{MMD}_k(P, Q)=0 \Leftrightarrow P=Q.\]

The name MMD comes from the following interpretation: it can also be written as the largest discrepancy between expectations of the unit norm RKHS functions w.r.t. two distributions: \[\text{MMD}_k(P, Q)=\sup_{f \in \mathcal{H}_k: \|f\|_{\mathcal{H}_k} \leq 1} |\mathbb{E}_{X \sim P}f(X)-\mathbb{E}_{Y \sim Q}f(Y)|.\] The worst case \(f\) (i.e. the \(f\) that forms the supremum) turns out to be proportional to the witness function \(\mu_k(P)-\mu_k(Q)\).

Given sets of independent samples \(\{x_i\}_{i=1}^{n_x} \overset{\text{i.i.d.}}{\sim} P\) and \(\{y_i\}_{i=1}^{n_y} \overset{\text{i.i.d.}}{\sim} Q\), a simple unbiased estimator of the squared MMD is given by \[\widehat{\text{MMD}_k^2}(P, Q)=\frac{1}{n_x(n_x-1)}\sum_{i \neq j} k(x_i, x_j)+\frac{1}{n_y(n_y-1)}\sum_{i \neq j} k(y_i, y_j)-\frac{2}{n_xn_y}\sum_{i=1}^{n_x} \sum_{j=1}^{n_y} k(x_i, y_j).\]

6.2. Hilbert-Schmidt Independence Criterion

Another use of kernel embedding is in measuring dependence between r.v.s. taking values in some generic domains (e.g. random vectors, strings, or graphs).

For any kernels \(k_\mathcal{X}\) and \(k_\mathcal{Y}\) on the respective domains \(\mathcal{X}\) and \(\mathcal{Y}\), we can define \(k=k_\mathcal{X} \otimes k_\mathcal{Y}\) given by \[k((x, y), (x', y'))=k_\mathcal{X}(x, x')k_\mathcal{Y}(y, y')\] which is a valid kernel on the product domain \(\mathcal{X} \times \mathcal{Y}\) from theorem 5.3. The canonical feature map is \(\varphi(x, y)=k_\mathcal{X}(\cdot, x) \otimes k_\mathcal{Y}(\cdot, y)\), which is a function on \(\mathcal{X} \times \mathcal{Y}\), i.e. \[(\varphi(x, y))(x', y')=k_\mathcal{X}(x', x)k_\mathcal{Y}(y', y).\]

Definition 6.1    Let \(X\) and \(Y\) be r.v.s. on domains \(\mathcal{X}\) and \(\mathcal{Y}\) (non-empty topological spaces). Let \(k_\mathcal{X}\) and \(k_\mathcal{Y}\) be kernels on \(\mathcal{X}\) and \(\mathcal{Y}\) respectively. The Hilbert-Schmidt independence criterion (HSIC) \(\Xi_{k_\mathcal{X}, k_\mathcal{Y}}(X, Y)\) of \(X\) and \(Y\) is the squared MMD between the joint measure \(P_{XY}\) and the product of marginals \(P_XP_Y\), computed with the product kernel \(k=k_\mathcal{X} \otimes k_\mathcal{Y}\), i.e. \[\begin{aligned} \Xi_{k_\mathcal{X}, k_\mathcal{Y}}(X, Y)&=\text{MMD}_k^2(P_{XY}, P_XP_Y) \\&=\|\mu_k(P_{XY})-\mu_k(P_XP_Y)\|_{\mathcal{H}_k}^2 \\&=\|\mathbb{E}_{XY}[k_\mathcal{X}(\cdot, X) \otimes k_\mathcal{Y}(\cdot, Y)]-\mathbb{E}_Xk_\mathcal{X}(\cdot, X) \otimes \mathbb{E}_Yk_\mathcal{Y}(\cdot, Y)\|_{\mathcal{H}_k}^2. \end{aligned}\]

\(\Xi_{k_\mathcal{X}, k_\mathcal{Y}}(X, Y)=0\) if \(X\) and \(Y\) are independent, while for many common choices of kernel the inverse result also holds (i.e. \(\Xi_{k_\mathcal{X}, k_\mathcal{Y}}(X, Y)=0\) implies independence). The larger HSIC is, the more dependent the variables \(X\) and \(Y\) are.

The HSIC can be estimated via the following process:

  1. Draw \(m\) independent samples \(\{(x_i, y_i)\}_{i=1}^m\) from the joint distribution \(P_{XY}\).

  2. Construct the kernel matrices \(K_{ij}=k_\mathcal{X}(x_i, x_j)\) and \(L_{ij}=k_\mathcal{Y}(y_i, y_j)\).

  3. Center the kernel matrices by computing \(\widetilde{K}=HKH\) and \(\widetilde{L}=HLH\) where \(\displaystyle H=I-\frac{1}{m}J\), and \(J=\mathbf{1}\mathbf{1}^\top\).

  4. Construct the estimate (note that \(\widetilde{K}\) and \(\widetilde{L}\) are symmetric): \[\widehat{\Xi_{k_\mathcal{X}, k_\mathcal{Y}}}(X, Y)=\frac{1}{m^2}\text{tr}(\widetilde{K}\widetilde{L})=\frac{1}{m^2}\sum_{i=1}^m \sum_{j=1}^m \widetilde{K}_{ij}\widetilde{L}_{ij}.\]