1. Support Vector Machine (SVM)

SVM is a class of linear models mainly for classification.
The principle of SVM is to find a hyperplane that maximizes the margin of separation between points of different classes, while minimizing the number of points misclassified.
It is easy to train SVM since the empirical risk minimization formulation satisfies strong duality.

2. Linearly Separable Case

The margin of the hyperplane is defined as twice the distance to the closest point: \[\text{margin}=2\min_i\left|\frac{\mathbf{w}^\top}{\|\mathbf{w}\|}\mathbf{x}_i+\frac{b}{\|\mathbf{w}\|}\right|.\] SVM is based on choosing the hyperplane that maximizes the margin, so that the points are as far as possible from the decision boundary and thus are least sensitive to changes.

The optimal hyperplane will always be halfway between the bounding points for each class, so the margin at the solution will be equal to the distance between the two closest points of each class in the direction of \(\mathbf{w}\).

Proposition 2.1 The margin maximization problem can be expressed as the following quadratic program: \[\begin{aligned} \min_{\mathbf{w}, b} &\frac{1}{2} \|\mathbf{w}\|^2 \\ \text{s.t.}\ &y_i(\mathbf{w}^\top\mathbf{x}_i+b) \geq 1. \end{aligned}\]

Proof. Consider two hypothetical points \(\mathbf{x}_+\) and \(\mathbf{x}_-\), then \(\text{margin}=\|\mathbf{x}_+-\mathbf{x}_-\|\). Note that \((\beta\mathbf{w})^\top\mathbf{x}+\beta b=0\) for all \(\beta \in \mathbb{R}\). Suppose w.l.o.g. that \[\min_i y_i(\mathbf{w}^\top\mathbf{x}_i+b)=1.\] Then \(\mathbf{w}^\top\mathbf{x}_++b=1\) and \(\mathbf{w}^\top\mathbf{x}_-+b=-1\). Thus \(\mathbf{w}^\top(\mathbf{x}_+-\mathbf{x}_-)=2\) and \(\|\mathbf{w}\| \cdot \|\mathbf{x}_+-\mathbf{x}_-\|=2\). Hence, \[\text{margin}=\frac{2}{\|\mathbf{w}\|}.\] Then we want to solve \[\begin{aligned} \max &\frac{2}{\|\mathbf{w}\|} \\ \text{s.t.}\ &\min_i y_i(\mathbf{w}^\top\mathbf{x}_i+b)=1. \end{aligned}\]

Equivalently, we solve \[\begin{aligned} \min &\frac{1}{2}\|\mathbf{w}\|^2 \\ \text{s.t.}\ &y_i(\mathbf{w}^\top\mathbf{x}_i+b) \geq 1. \end{aligned}\]

\(\square\)

3. C-SVM

Points will not generally be linearly separable, i.e., some points are on the wrong side of the decision boundary, and so error measurement is needed. SVM uses a hinge loss \[h(\alpha)=(1-\alpha)_+=\begin{cases} 1-\alpha,&1-\alpha>0 \\ 0, &\text{otherwise} \end{cases},\] which induces sparse solutions, i.e. \(\mathbf{w}\) and \(b\) will be completely determined by a small number of data points known as the support vectors.

The optimization problem is given by \[\min_{\mathbf{w}, b} \left(\frac{1}{2}\|\mathbf{w}\|^2+C\sum_{i=1}^n h(y_i(\mathbf{w}^\top\mathbf{x}_i+b))\right)\] where \(C\) controls the tradeoff between maximum margin and loss.

By applying a rescaling, we can see that the optimization problem can be viewed as a regularized empirical risk minimization problem: \[\min_{\mathbf{w}, b} \left(\frac{1}{2nC}\|\mathbf{w}\|^2+\frac{1}{n}\sum_{i=1}^n h(y_i(\mathbf{w}^\top\mathbf{x}_i+b))\right)\] where the second term is an empirical risk from margin errors, and the first term is a regularizer that encourages large margins with scaling \((2nC)^{-1}\).

Using the substitution \(\xi_i=h(y_i(\mathbf{w}^\top\mathbf{x}_i+b))\), we equivalently solve \[\begin{aligned} \min_{\mathbf{w}, b, \boldsymbol{\xi}}&\left(\frac{1}{2}\|\mathbf{w}\|^2+C\sum_{i=1}^n \xi_i\right)\\ \text{s.t.}\ &\xi_i \geq 0, \\&y_i(\mathbf{w}^\top\mathbf{x}_i+b) \geq 1-\xi_i, \end{aligned}\] which is known as the C-SVM. The Lagrangian is \[\mathcal{L}(\mathbf{w}, b, \boldsymbol{\xi}, \boldsymbol{\alpha}, \boldsymbol{\lambda})=\frac{1}{2}\|\mathbf{w}\|^2+C\sum_{i=1}^n \xi_i+\sum_{i=1}^n \alpha_i (1-\xi_i-y_i(\mathbf{w}^\top\mathbf{x}_i+b))+\sum_{i=1}^n \lambda_i(-\xi_i)\] with dual variable constraints \(\alpha_i \geq 0\) and \(\lambda_i \geq 0\). Let \(\displaystyle \frac{\partial \mathcal{L}}{\partial \mathbf{w}}=0\), \(\displaystyle \frac{\partial \mathcal{L}}{\partial b}=0\) and \(\displaystyle \frac{\partial \mathcal{L}}{\partial \xi_i}=0\), then \(\displaystyle \mathbf{w}=\sum_{i=1}^n \alpha_iy_i\mathbf{x}_i\), \(\displaystyle \sum_{i=1}^n \alpha_iy_i=0\) and \(\alpha_i=C-\lambda_i\).

The derivatives w.r.t. \(b\) and \(\xi_i\) yield expressions that are independent of the primal variables (\(\mathbf{w}\), \(b\) and \(\boldsymbol{\xi}\)), which form equality constraints for the dual variables to be feasible. If those equality constraints do not hold, \[g(\boldsymbol{\alpha}, \boldsymbol{\lambda})=\inf_{\mathbf{w}, b, \boldsymbol{\xi}} \mathcal{L}(\mathbf{w}, b, \boldsymbol{\xi}, \boldsymbol{\alpha}, \boldsymbol{\lambda})=-\infty.\] For example, if \(\displaystyle \sum_{i=1}^n \alpha_iy_i \neq 0\), then \(b=\infty\) to make \(\mathcal{L}\) as small as possible.

We therefore assume equality constraints hold when calculating the dual function form, before adding them back in as constraints to the dual problem itself.

Therefore, the dual function is \[\begin{aligned} g(\boldsymbol{\alpha}, \boldsymbol{\lambda})&=\inf_{\mathbf{w}, b, \boldsymbol{\xi}}\left(\frac{1}{2}\|\mathbf{w}\|^2+C\sum_{i=1}^n \xi_i-\mathbf{w}^\top\sum_{i=1}^n \alpha_iy_i\mathbf{x}_i+\sum_{i=1}^n \alpha_i(1-\xi_i-y_ib)-\sum_{i=1}^n \lambda_i\xi_i\right) \\&=\inf_{b, \boldsymbol{\xi}}\left(\frac{1}{2}\|\mathbf{w}\|^2+C\sum_{i=1}^n\xi_i-\mathbf{w}^\top\mathbf{w}+\sum_{i=1}^n \alpha_i-\sum_{i=1}^n \alpha_i\xi_i-\sum_{i=1}^n(C-\alpha_i)\xi_i\right) \\&=\inf_{b, \boldsymbol{\xi}}\left(\sum_{i=1}^n \alpha_i-\frac{1}{2}\sum_{i=1}^n \sum_{j=1}^n \alpha_i\alpha_jy_iy_j\mathbf{x}_i^\top\mathbf{x}_j\right) \\&=\sum_{i=1}^n \alpha_i-\frac{1}{2}\sum_{i=1}^n \sum_{j=1}^n \alpha_i\alpha_jy_iy_j\mathbf{x}_i^\top\mathbf{x}_j. \end{aligned}\] The goal is to maximize the dual \[g(\boldsymbol{\alpha})=\sum_{i=1}^n \alpha_i-\frac{1}{2}\sum_{i=1}^n \sum_{j=1}^n \alpha_i\alpha_jy_iy_j\mathbf{x}_i^\top\mathbf{x}_j\] s.t. \(\displaystyle \sum_{i=1}^n \alpha_iy_i=0\) and \(0 \leq \alpha_i \leq C\).

We can the derive the variables for our hyperplane by taking \[\mathbf{w}=\sum_{i=1}^n \alpha_iy_i\mathbf{x}_i \quad \text{and} \quad b=y_{i_{\text{margin}}}-\mathbf{w}^\top\mathbf{x}_{i_{\text{margin}}}\] where \(i_\text{margin}\) is any index of a margin support vector, i.e. any \(i\) s.t. \(0<\alpha_i<C\).

3.1. Support Vector

Using complementary slackness, we can show that all data points fall into one of the following:

Non support vector: Suppose \(\alpha_i=0\) (unused data point). Then \(\lambda_i=C-\alpha_i>0\) and thus \(\xi_i=0\) (as \(\lambda_i\xi_i=0\)). Hence, \(y_i(\mathbf{w}^\top\mathbf{x}_i+b) \geq 1\), i.e., \(\mathbf{x}_i\) is on the correct side of the margin.
Margin support vector: Suppose \(0<\alpha_i<C\), then \(y_i(\mathbf{w}^\top\mathbf{x}_i+b)=1-\xi_i\). Since \(\lambda_i=C-\alpha_i>0\), then \(\xi_i=0\), and thus \(y_i(\mathbf{w}^\top\mathbf{x}_i+b)=1\), i.e., \(\mathbf{x}_i\) is on the margin boundary.
Non-margin support vector (margin error): Suppose \(\alpha_i=C>0\), then \(y_i(\mathbf{w}^\top\mathbf{x}_i+b)=1-\xi_i\). Since \(\lambda_i=C-\alpha_i=0\), then \(\xi_i \geq 0\), and thus \(y_i(\mathbf{w}^\top\mathbf{x}_i+b) \leq 1\), i.e., \(\mathbf{x}_i\) contributes to margin error.

3.2. Insight

The solution is sparse, i.e., point which neither is on the margin nor contributes to margin error have \(\alpha_i=0\). If \(\alpha_i \neq 0\), then the point is a support vector.
As \(\alpha_i \geq 0\), point in class \(y_i=+1\) has positive coefficients, and point in class \(y_i=-1\) has negative coefficients.
Influence of any single data point is bounded since the weights cannot exceed \(C\).
Even if \(p>n\), \(\mathbf{w} \in \text{span}\{\mathbf{x}_i: i=1, \ldots, n\}\).

4. Multi-Class Classification

SVM does not directly generalize to multi-class classification because it is based on learning a single hyperplane. But SVM can still be applied to multi-class classification problem by reducing it into a series binary classification problems.

For example, given \(K\) classes, we perform a one-versus-the-rest binary classification for each \(k \in K\), yielding \(\mathbf{w}_k\) and \(b_k\), and then classify according to \[\widehat{y}(\mathbf{x})=\mathop{\arg\max}_k (\mathbf{w}_k^\top\mathbf{x}+b_k).\]

Alternatively, we can perform \(K(K-1)\) one-versus-one classification for each pair of classes and then use a max-wins voting strategy to choose the class.