0%

Duality in Convex Optimization

1. Background

Much of machine learning requires to perform optimization subject to constraints on the variables. Constrained optimization problem must have a convex dual problem form that is the basis for support vector machine (SVM).

2. Lagrangian

Consider a general constrained optimization problem with objective function \(f_0: \mathbb{R}^n \to \mathbb{R}\), \(m\) inequality and \(r\) equality constraints: \[\begin{align}\begin{aligned} \min\ &f_0(\mathbf{x}) \\ \text{s.t.}\ &f_i(\mathbf{x}) \leq 0, & i=1, \ldots, m, \\ &h_j(\mathbf{x})=0, & j=1, \ldots, r, \end{aligned}\tag{P}\end{align}\] which is known as the primal problem. We denote its optimum value as \(p^*=f_0(\mathbf{x}^*)\) and \[\mathcal{D}=\bigcap_{i=0}^m \text{dom}(f_i) \cap \bigcap_{j=1}^r \text{dom}(h_j)\] be the domain of the problem, and require \(\mathcal{D} \subseteq \mathbb{R}^n\), where \(f_0\), \(f_i\) and \(h_j\) are all defined to be nonempty. Any \(\mathbf{x} \in \mathcal{D}\) where \(f_i(\mathbf{x}) \leq 0\) for all \(i\), and \(h_j(\mathbf{x})=0\) for all \(j\) is known as a primal feasible point.

The Lagrangian \(\mathcal{L}: \mathbb{R}^n \times \mathbb{R}^m \times \mathbb{R}^r \to \mathbb{R}\) is \[\mathcal{L}(\mathbf{x}, \boldsymbol{\lambda}, \boldsymbol{\nu})=f_0(\mathbf{x})+\sum_{i=1}^m \lambda_if_i(\mathbf{x})+\sum_{j=1}^r \nu_jh_j(\mathbf{x})\] where \(\boldsymbol{\lambda} \in \mathbb{R}^m\) and \(\boldsymbol{\nu} \in \mathbb{R}^r\) are Lagrange multipliers, a.k.a. dual variables.

The Lagrange dual function (dual function) is \[g(\boldsymbol{\lambda}, \boldsymbol{\nu})=\inf_{x \in \mathcal{D}} \mathcal{L}(\mathbf{x}, \boldsymbol{\lambda}, \boldsymbol{\nu}).\]

The domain of \(g\), \(\text{dom}(g)\), is the set of values \((\boldsymbol{\lambda}, \boldsymbol{\nu})\) for which the Lagrangian is bounded from below, i.e. \(g(\boldsymbol{\lambda}, \boldsymbol{\nu})>-\infty\). A dual feasible pair \((\boldsymbol{\lambda}, \boldsymbol{\nu})\) is a pair where \(\boldsymbol{\lambda} \succeq 0\) and \((\boldsymbol{\lambda}, \boldsymbol{\nu}) \in \text{dom}(g)\).

In principle (but not in practice), we could convert \((\text{P})\) to an unconstrained problem by \[\min \widetilde{f}(\mathbf{x}):=f_0(\mathbf{x})+\sum_{i=1}^m I\_(f_i(\mathbf{x}))+\sum_{j=1}^r I_0(h_j(\mathbf{x}))\] where \(I\_(u)=\begin{cases} 0, &u \leq 0 \\ \infty, &u>0 \end{cases}\) and \(I_0(u)=\begin{cases} 0, &u=0 \\ \infty, &u \neq 0 \end{cases}\).

If \(\boldsymbol{\lambda} \succeq 0\), then the Lagrangian is a lower bound on \(\widetilde{f}(x)\), i.e. for all \(\mathbf{x} \in \mathbb{R}^n\), \(\boldsymbol{\lambda} \in \mathbb{R}^m\) with \(\boldsymbol{\lambda} \succeq 0\), and \(\boldsymbol{\nu} \in \mathbb{R}^r\), \(\mathcal{L}(\mathbf{x}, \boldsymbol{\lambda}, \boldsymbol{\nu}) \leq \widetilde{f}(\mathbf{x})\). More concretely we have \[I\_(f_i(\mathbf{x}))=\sup_{\lambda_i \in \mathbb{R}^+} \lambda_if_i(\mathbf{x})=\begin{cases}0, &f_i(\mathbf{x}) \leq 0 \\\infty, &f_i(\mathbf{x})>0\end{cases}\] and \[I_0(h_j(\mathbf{x}))=\sup_{\nu_j \in \mathbb{R}} \nu_jh_j(\mathbf{x})=\begin{cases}0, &h_i(\mathbf{x})=0 \\\infty, &h_j(\mathbf{x}) \neq 0\end{cases}\] and thus \[\begin{aligned}\widetilde{f}(\mathbf{x})&=f_0(\mathbf{x})+\sum_{i=1}^m \sup_{\lambda_i \in \mathbb{R}^+} \lambda_if_i(\mathbf{x})+\sum_{j=1}^r \sup_{\nu_j \in \mathbb{R}} \nu_jh_j(\mathbf{x})\\&=\sup_{\forall i, j: \lambda_i \in \mathbb{R}^+, \nu_j \in \mathbb{R}} \left(f_0(\mathbf{x})+\sum_{i=1}^m \lambda_if_i(\mathbf{x})+\sum_{j=1}^r \nu_jh_j(\mathbf{x})\right)\\&=\sup_{\boldsymbol{\lambda} \succeq 0, \boldsymbol{\nu}} \mathcal{L}(\mathbf{x}, \boldsymbol{\lambda}, \boldsymbol{\nu}).\end{aligned}\]

3. Dual Problem

The dual problem attempts to find the best lower bound \(g(\boldsymbol{\lambda}, \boldsymbol{\nu})\) on the optimal solution \(p^*\) of \((\text{P})\):\[\begin{align}\begin{aligned} \max\ &g(\boldsymbol{\lambda}, \boldsymbol{\nu}) \\ \text{s.t.}\ &\boldsymbol{\lambda} \succeq 0 \end{aligned}\tag{D}\end{align}\] and we denote its optimum values as \(d^*=g(\boldsymbol{\lambda}^*, \boldsymbol{\nu}^*)\).

\((\text{D})\) is a convex optimization problem, since the function being maximized is concave and the constraint set is convex.

Theorem 3.1    When \(\boldsymbol{\lambda} \succeq 0\), then for all \(\boldsymbol{\nu}\) we have \(g(\boldsymbol{\lambda}, \boldsymbol{\nu}) \leq p^*\).

Proof. Assume \(\mathbf{x} \in \mathcal{D}\) is feasible and \(\boldsymbol{\lambda} \succeq 0\), then \[\sum_{i=1}^m \lambda_if_i(\mathbf{x})+\sum_{j=1}^r \nu_jh_j(\mathbf{x}) \leq 0\] and so for all \(\mathbf{x}\), \[\begin{aligned} g(\boldsymbol{\lambda}, \boldsymbol{\nu})&=\inf_{\mathbf{x} \in \mathcal{D}} \left(f_0(\mathbf{x})+\sum_{i=1}^m \lambda_if_i(\mathbf{x})+\sum_{j=1}^r \nu_jh_j(\mathbf{x})\right) \\&\leq f_0(\mathbf{x})+\sum_{i=1}^m \lambda_if_i(\mathbf{x})+\sum_{j=1}^r \nu_jh_j(\mathbf{x}) \\&\leq f_0(\mathbf{x}). \end{aligned}\] Hence, \(g(\boldsymbol{\lambda}, \boldsymbol{\nu}) \leq p^*\).

\(\square\)

Property 3.2 (Weak Duality)    \(d^* \leq p^*\).

Proof. From theorem 3.1, we immediately have \[d^*=\sup_{\boldsymbol{\lambda} \succeq 0, \boldsymbol{\nu}} g(\boldsymbol{\lambda}, \boldsymbol{\nu}) \leq p^*.\]

The difference \(p^*-d^*\) is called the optimal duality gap. If the duality gap is zero, then strong duality holds, i.e. \(d^*=p^*\). Most common (but not only) sufficient condition for strong duality is Slater's condition for a convex optimization problem, i.e.

  • each \(f_i(\mathbf{x})\) is a convex function and each \(h_j(\mathbf{x})\) is affine (\(h_j(\mathbf{x})=\mathbf{a}_j^\top\mathbf{x}-b_j=0\) and thus we can represent the equality constraints as \(A\mathbf{x}=\mathbf{b}\));

  • there exists a strictly feasible point \(\widetilde{\mathbf{x}} \in \text{relint}(\mathcal{D})\) s.t. for all \(i\), \(f_i(\widetilde{\mathbf{x}})<0\), and \(A\widetilde{\mathbf{x}}=\mathbf{b}\), where \(\text{relint}(\mathcal{D})\) is the relative interior of \(\mathcal{D}\).

Condition under which strong duality holds is called constraint qualification.

3.1. Saddle Point Characterization of Weak and Strong Duality

The primal problem can be solved using the unconstrained saddle point problem \[p^*=\inf_{x \in \mathcal{D}} \widetilde{f}(\mathbf{x})=\inf_{x \in \mathcal{D}} \sup_{\boldsymbol{\lambda} \succeq 0, \boldsymbol{\nu}} \mathcal{L}(\mathbf{x}, \boldsymbol{\lambda}, \boldsymbol{\nu}).\] Deriving the dual form of the problem equates to switching the order of optimizations \[d^*=\sup_{\boldsymbol{\lambda} \succeq 0, \boldsymbol{\nu}} \inf_{x \in \mathcal{D}} \mathcal{L}(\mathbf{x}, \boldsymbol{\lambda}, \boldsymbol{\nu}).\] The max-min inequality guarantees that \(d^* \leq p^*\).

3.2. Optimality Condition

Suppose the primal is equal to the dual, and let \(\mathbf{x}^*\) be the optimum solution of \((\text{P})\) and \((\boldsymbol{\lambda}^*, \boldsymbol{\nu}^*)\) be the optimum solution of \((\text{D})\), then \[\begin{aligned} f_0(\mathbf{x}^*)&=g(\boldsymbol{\lambda}^*, \boldsymbol{\nu}^*) \\&=\inf_{\mathbf{x} \in \mathcal{D}} \left(f_0(\mathbf{x})+\sum_{i=1}^m \lambda_i^*f_i(\mathbf{x})+\sum_{j=1}^r \nu_j^*h_j(\mathbf{x})\right) \\&\leq f_0(\mathbf{x}^*)+\sum_{i=1}^m \lambda_i^*f_i(\mathbf{x}^*)+\sum_{j=1}^r \nu_j^*h_j(\mathbf{x}^*) \\&\leq f_0(\mathbf{x}^*). \end{aligned}\] It follows that \[\sum_{i=1}^m \lambda_i^*f_i(\mathbf{x}^*)=0\] which is the condition of complementary slackness i.e. \[\begin{aligned} \lambda_i^*>0 &\Rightarrow f_i(\mathbf{x}^*)=0, \\ f_i(\mathbf{x}^*)<0 &\Rightarrow \lambda_i^*=0. \end{aligned}\]

Suppose \(f_i\) and \(h_i\) are differentiable and the duality gap is zero, then since \(\mathbf{x}^*\) minimizes \(\mathcal{L}(\mathbf{x}, \boldsymbol{\lambda}^*, \boldsymbol{\nu}^*)\), we have \[\nabla f_0(\mathbf{x}^*)+\sum_{i=1}^m \lambda_i^* \nabla f_i(\mathbf{x}^*)+\sum_{j=1}^r \nu_i^* \nabla h_j(\mathbf{x}^*)=\mathbf{0}.\]

Gathering the various conditions for optimality, the KKT conditions for the primal and dual variables \((\mathbf{x}, \boldsymbol{\lambda}, \boldsymbol{\nu})\) are \[\begin{aligned} \nabla f_0(\mathbf{x})+\sum_{i=1}^m \lambda_i \nabla f_i(\mathbf{x})+\sum_{j=1}^r \nu_i \nabla h_i(\mathbf{x}) &=0, \\ f_i(\mathbf{x}) &\leq 0, i=1, \ldots, m, \\ h_j(\mathbf{x}) &=0, j=1, \ldots, r, \\ \lambda_i &\geq 0, i=1, \ldots, m, \\ \lambda_if_i(\mathbf{x}) &=0, i=1, \ldots, m. \end{aligned}\]

If a convex optimization problem with differentiable objective and constraint functions satisfies Slater's condition, then the KKT conditions are necessary and sufficient for global optimality.