0%

Optimization Algorithm for Training DNN

1. DNN Loss Function

Consider a fully connected \(L\) layer deep network given by \[\mathbf{h}^{(\mathscr{l})}=W^{(\mathscr{l})}\mathbf{z}^{(\mathscr{l})}+\mathbf{b}^{(\mathscr{l})}, \mathbf{z}^{(\mathscr{l}+1)}=\phi(\mathbf{h}^{(\mathscr{l})}), \mathscr{l}=0, \ldots, L-1,\] for \(\mathscr{l}=1, \ldots, L\) with nonlinear activation \(\phi(\cdot)\) and \(W^{(\mathscr{l})} \in \mathbb{R}^{n_\mathscr{l} \times n_{\mathscr{l}-1}}\). The trainable parameters for DNN are \(\theta:=\{W^{(\mathscr{l})}, \mathbf{b}^{(\mathscr{l})}\}_{\mathscr{l}=1}^L\), which are learned by minimizing a high dimensional (\(|\theta| \sim n^2L\)) loss function \(\mathcal{L}\).

The shape of \(\mathcal{L}(\theta; X, Y)\) and our knowledge about a good initial minimizer \(\theta^{(0)}\) strongly influence our ability to learn the parameters \(\theta\) for the DNN.

Let \(\displaystyle \delta_\mathscr{l}:=\frac{\partial \mathcal{L}}{\partial \mathbf{h}^{(\mathscr{l})}}\) and \(D^{(\mathscr{l})}\) be the diagonal matrix with \(D_{ii}^{(\mathscr{l})}=\phi'(h_i^{(\mathscr{l})})\), we have \[\delta_\mathscr{l}=D^{(\mathscr{l})}(W^{(\mathscr{l})})^T\delta_{\mathscr{l}+1} \quad \text{and} \quad \delta_L=D^L\nabla_{\mathbf{h}^{(L)}} \mathcal{L}\] which gives the formula for computing the \(\delta_\mathscr{l}\) for each layer as \[\delta_\mathscr{l}=\left(\prod_{k=\mathscr{l}}^{L-1} D^{(k)}(W^{(k)})^T\right)D^{L}\nabla_{\mathbf{h}^{(L)}}\mathcal{L}\] and the resulting gradient \(\nabla_\theta \mathcal{L}\) with entries as \[\frac{\partial \mathcal{L}}{\partial W^{(\mathscr{l})}} = \delta_{\mathscr{l}+1} \cdot \mathbf{h}_\mathscr{l}^T \quad \text{and} \quad \frac{\partial \mathcal{L}}{\partial \mathbf{b}^{(\mathscr{l})}} = \delta_{\mathscr{l}+1}.\]

2. Stochastic Gradient Descent (SGD)

Given a loss function \(\mathcal{L}(\theta; X, Y)\), gradient descent is given by \[\theta^{(k+1)} = \theta^{(k)} - \alpha \cdot \nabla_\theta \mathcal{L}(\theta; X, Y)\] with \(\alpha\) is called the learning rate. In deep learning, \(\mathcal{L}(\theta; X, Y)\) is the sum of \(m\) individual loss functions for \(m\) data point \[\mathcal{L}(\theta; X, Y)=m^{-1}\sum_{\mu=1}^m l(\theta; \mathbf{x}_\mu, \mathbf{y}_\mu).\] For \(m \gg 1\), gradient descent is computationally too costly and instead one can break apart the \(m\) loss functions into mini-batches, and repeatedly solve \[\theta^{(k+1)} = \theta^{(k)} - \alpha |\mathcal{S}_k|^{-1} \nabla_\theta \sum_{\mu \in \mathcal{S}_k} l(\theta; \mathbf{x}_\mu, \mathbf{y}_\mu),\] which is called stochastic gradient descent as typically \(\mathcal{S}_k\) is chosen in some randomized method, usually as a partition of \([m]\) and a sequence of \(\mathcal{S}_k\) which cover \([m]\) is called an epoch.

2.1. Challenge and Benefit

  • SGD is preferable for large \(m\) since it reduces the per iteration computational cost depending on \(m\) to instead depending on \(|\mathcal{S}_k|\).

  • SGD and gradient descent require selection of a learning rage which in deep learning is typically selected using some costly trial and error heuristics.

  • The learning rate is typically chosen adaptively in a way that satisfies \[\sum_{k=1}^\infty \alpha_k=\infty \quad \text{and} \quad \sum_{k=1}^\infty \alpha_k^2<\infty.\] In particular, \(\alpha_k \sim k^{-1}\).

  • The optimal selection of learning weight and selection of \(\mathcal{S}_k\) depend on the unknown local Lipschitz constant \[\| \nabla l(\theta_1; \mathbf{x}_\mu, \mathbf{y}_\mu) - \nabla l(\theta_2; \mathbf{x}_\mu, \mathbf{y}_\mu) \| \leq L_\mu \| \theta_1 - \theta_2 \|.\]

2.2. Global Convergence of Gradient Descent

Lemma 2.1(Overestimation Property)    Let \(\mathcal{L}(\theta) \in \mathcal{C}^1(\mathbb{R}^n)\) with \(\nabla \mathcal{L}\) Lipschitz continuous with constant \(L\). Then for any \(\theta\), \(\mathbf{d} \in \mathbb{R}^n\) and \(\alpha \in \mathbb{R}\): \[\mathcal{L}(\theta + \alpha \mathbf{d}) \leq \mathcal{L}(\theta) + \alpha \nabla \mathcal{L}(\theta)^T \mathbf{d} + \alpha^2 \frac{L}{2} \|\mathbf{d}\|^2.\] In particular, if \(\mathbf{d}=-\nabla \mathcal{L}(\theta)\), then \[\begin{aligned} \mathcal{L}(\theta - \alpha \nabla \mathcal{L}(\theta)) &\leq \mathcal{L}(\theta) - \alpha \| \nabla \mathcal{L}(\theta) \|^2 + \frac{L}{2} \alpha^2 \| \nabla \mathcal{L}(\theta) \|^2 \\ &=\mathcal{L}(\theta) - \alpha \left(1 - \frac{L}{2}\alpha \right) \| \nabla \mathcal{L}(\theta) \|^2. \end{aligned}\]

Proof. By Taylor's theorem in integral form, we have \[\begin{aligned} \mathcal{L}(\theta+\alpha\mathbf{d}) &= \mathcal{L}(\theta)+\int_0^1 \nabla \mathcal{L}(\theta+\alpha t \mathbf{d})^T (\alpha\mathbf{d}) \text{d}t \\&=\mathcal{L}(\theta) + \alpha \nabla \mathcal{L}(\theta)^T\mathbf{d} + \alpha \int_0^1 [\nabla \mathcal{L}(\theta+\alpha t \mathbf{d})-\nabla \mathcal{L}(\theta)]^T\mathbf{d} \text{d}t \\&\leq \mathcal{L}(\theta) + \alpha \nabla \mathcal{L}(\theta)^T\mathbf{d} + \alpha \int_0^1 \| \nabla \mathcal{L}(\theta+\alpha t \mathbf{d})-\nabla \mathcal{L}(\theta) \| \cdot \| \mathbf{d} \| \text{d}t \\&\leq \mathcal{L}(\theta) + \alpha \nabla \mathcal{L}(\theta)^T\mathbf{d} + \alpha L \|\mathbf{d}\| \int_0^1 \|\theta+\alpha t \mathbf{d}-\theta \|\text{d}t \\&\leq \mathcal{L}(\theta) + \alpha \nabla \mathcal{L}(\theta)^T\mathbf{d} + \alpha^2 L \|\mathbf{d}\|^2 \int_0^1 t\text{d}t. \end{aligned}\]

\(\square\)

2.3. Global Convergence of SGD

Suppose for convenience that \(|\mathcal{S}_k|=1\), the expected gradient w.r.t. data point \(\displaystyle G^k:=\nabla_\theta \sum_{\mu \in \mathcal{S}_k} l(\theta; \mathbf{x}_\mu, \mathbf{y}_\mu)\) is \[\mathbb{E}_{\mathcal{S}_k}[G^k]=\mathbb{E}[G^k \mid \mathcal{S}_k]=\sum_{i=1}^m \mathbb{E}[G^k \mid \mathcal{S}_k=i]P(\mathcal{S}_k=i)=\sum_{i=1}^m \nabla l_i(\theta^k) \cdot \frac{1}{m}=\nabla \mathcal{L}(\theta^k).\]

We also assume (\(|\mathcal{S}_k|=1\)):

    (1) for all \(i \leq m\), \(\nabla l_i\) is Lipschitz continuous with constant \(L\);

    (2) there exists \(M>0\) s.t. \(\text{Var}[G^k \mid \mathcal{S}_k]:=\mathbb{E}[(G^k-\nabla l(\theta^k))^T(G^k- \nabla l(\theta^k)) \mid \mathcal{S}_k] \leq M\) for all \(k\).

Bounded total variance can usually be guaranteed in a neighborhood of \(\theta^*\), but not globally for strongly convex \(\mathcal{L}(\cdot)\).

Recall that \(G^k\) conditioned on current batch is an unbiased estimator of the true gradient, which is true (and for \(|\mathcal{S}_k|>1\)), but it would have to be assumed in a more general stochastic framework.

Lemma 2.2 (Overestimation Property in Expectation)    Assume assumption (1) holds. When applying SGD to \(\mathcal{L}(\theta)\) with \(|\mathcal{S}_k|=1\), we have \[\mathbb{E}_{\mathcal{S}_k}[\mathcal{L}(\theta^{k+1})] \leq \mathcal{L}(\theta^k)-\alpha \nabla\mathcal{L}(\theta^k)^T \mathbb{E}_{\mathcal{S}_k}[G^k] + \frac{L\alpha^2}{2}\mathbb{E}_{\mathcal{S}_k}[\|G^k\|^2].\] If assumption (2) also holds, then \[\mathbb{E}_{\mathcal{S}_k}[\mathcal{L}(\theta^{k+1})] \leq \mathcal{L}(\theta^k)-\alpha^k\left(\frac{L\alpha^k}{2}-1\right) \|\nabla \mathcal{L}(\theta^k)\|^2+\frac{ML(\alpha^k)^2}{2}.\]

Proof. Apply lemma 2.1 to \(\mathcal{L}\) with \(\theta=\theta^k\), \(\mathbf{d}=G^k\) and \(\alpha=\alpha^k\). Using \(\theta^{k+1}=\theta^k-\alpha^kG^k\), we have \[\mathcal{L}(\theta^{k+1}) \leq \mathcal{L}(\theta^k)-\alpha^k \nabla\mathcal{L}(\theta^k)^TG^k+\frac{L}{2}(\alpha^k)^2\|G^k\|^2.\]

Applying expectation on both sides w.r.t. \(\mathcal{S}_k\): \[\mathbb{E}_{\mathcal{S}_k}[\mathcal{L}(\theta^{k+1})] \leq \mathcal{L}(\theta^k)-\alpha^k \nabla\mathcal{L}(\theta^k)^T\mathbb{E}_{\mathcal{S}_k}[G^k]+\frac{L}{2}(\alpha^k)^2 \mathbb{E}_{\mathcal{S}_k}[\|G^k\|^2]\] where \(\mathcal{L}(\theta^k)\) and \(\nabla \mathcal{L}(\theta^k)\) do not depend on \(\mathcal{S}_k\). Since \(\mathbb{E}_{\mathcal{S}_k}[G^k]=\nabla \mathcal{L}(\theta^k)\), then \[\begin{aligned} \text{Var}[G^k \mid \mathcal{S}_k]&=\mathbb{E}_{\mathcal{S}_k}[\|G^k\|^2] - 2\nabla \mathcal{L}(\theta^k)^T \mathbb{E}_{\mathcal{S}_k}[G^k] + \|\nabla \mathcal{L}(\theta^k)\|^2 \\&=\mathbb{E}_{\mathcal{S}_k}[\|G^k\|^2]-\|\nabla \mathcal{L}(\theta^k)\|^2, \end{aligned}\] and thus \(\mathbb{E}_{\mathcal{S}_k}[\|G^k\|^2] \leq M + \|\nabla \mathcal{L}(\theta^k)\|^2\).

\(\square\)

Theorem 2.3    Let \(\mathcal{L}\) be smooth, strongly convex, i.e., for \(\mu>0\), \(\displaystyle \mathcal{L}(\mathbf{x}+\mathbf{s}) \geq \mathcal{L}(\mathbf{x})+\mathbf{s}^T \nabla\mathcal{L}(\mathbf{x})+\frac{\mu}{2}\|\mathbf{s}\|^2\) for all \(\mathbf{x}\) and \(\mathbf{s}\), and satisfying assumption (1) and (2). Let SGD with fixed learning rate be applied to minimize \(\mathcal{L}\), where \(\displaystyle \alpha^k=\alpha=\frac{\eta}{L}\) and \(\eta \in (0, 1]\). Then SGD converges linearly to a residual error: for all \(k \geq 0\), \[\mathbb{E}[\mathcal{L}(\theta^k)]-\mathcal{L}(\theta^*)-\frac{\eta M}{2\mu} \leq \left(1-\frac{\eta\mu}{L}\right)^k \cdot \left[\mathcal{L}(\theta^0)-\mathcal{L}(\theta^*)-\frac{\eta M}{2 \mu}\right].\]

Proof. Lemma 2.2 and \(\displaystyle \frac{L\alpha}{2}-1=\frac{\eta}{2}-1<-\frac{1}{2}\) give \[\mathbb{E}_{\mathcal{S}_k}[\mathcal{L}(\theta^{k+1})] \leq \mathcal{L}(\theta^k)-\frac{\alpha}{2} \|\nabla \mathcal{L}(\theta^k)\|^2+\frac{ML\alpha^2}{2}.\] Taking expectation w.r.t the past, i.e., \(\mathcal{S}_0, \ldots, \mathcal{S}_{k-1}\), we note that we have a memoryless property and so current iterate only depends on previous sample size (\(\mathbb{E}=\mathbb{E}_k:=\mathbb{E}[\cdot \mid \mathcal{S}_0, \ldots, \mathcal{S}_k]=E_{\mathcal{S}_k}\)):

\[\mathbb{E}_k[\mathcal{L}(\theta^{k+1})]-\mathcal{L}(\theta^*) \leq \mathbb{E}_{k-1}[\mathcal{L}(\theta^k)]-\mathcal{L}(\theta^*)-\frac{\alpha}{2}\mathbb{E}_{k-1}[\|\nabla \mathcal{L}(\theta^k)\|^2]+\frac{ML\alpha^2}{2}.\] A consequence of the strong convexity property is that global minimizer \(\theta^*\) is unique and \[\mathcal{L}(\theta^k)-\mathcal{L}(\theta^*) \geq \frac{1}{2\mu} \|\nabla \mathcal{L}(\theta^k)\|^2\] and thus \[\mathbb{E}_{k-1}[\mathcal{L}(\theta^k)-\mathcal{L}(\theta^*)] \geq \frac{1}{2\mu} \mathbb{E}_{k-1}[\| \nabla \mathcal{L}(\theta^k)\|^2].\]

We deduce \[\mathbb{E}_k[\mathcal{L}(\theta^{k+1})]-\mathcal{L}(\theta^*) \leq (1-\mu\alpha)(\mathbb{E}_{k-1}[\mathcal{L}(\theta^k)]-\mathcal{L}(\theta^*))+\frac{ML\alpha^2}{2}\] or equivalently \[\mathbb{E}_k[\mathcal{L}(\theta^{k+1})]-\mathcal{L}(\theta^*)-\frac{\alpha ML}{2\mu} \leq (1-\mu\alpha)\left(\mathbb{E}_{k-1}[\mathcal{L}(\theta^k)]-\mathcal{L}(\theta^*)-\frac{\alpha ML}{2\mu}\right).\]

Note that \(\displaystyle \alpha=\frac{\eta}{L} \leq \frac{1}{L} \leq \frac{1}{\mu}\). Replacing \(\alpha\) gives \[\mathbb{E}_k[\mathcal{L}(\theta^{k+1})]-\mathcal{L}(\theta^*)-\frac{M\eta}{2\mu} \leq \left(1-\frac{\eta\mu}{L}\right)\left(\mathbb{E}_{k-1}[\mathcal{L}(\theta^k)]-\mathcal{L}(\theta^*)-\frac{M\eta}{2\mu}\right).\] By induction, the claim holds.

\(\square\)

Thus \[\lim_{k \to \infty} (\mathbb{E}[\mathcal{L}(\theta^k)]-\mathcal{L}(\theta^*)) \leq \frac{\alpha ML}{2\mu}=\frac{\eta M}{2 \mu}.\] Convergence is obtained, in expectation, up to the level \(\displaystyle \frac{\eta M}{2\mu}\) (noise level), which can be decreased in various ways. The ratio \(\displaystyle \frac{L}{\mu}\) is a condition number of \(\mathcal{L}\).

2.4. Decreasing the SGD Noise Level

Though not always desirable (due to the needs for small generalization error), the SGD noise level of \(\displaystyle \frac{\eta M}{2\mu}\) can be removed so that \[\lim_{k \to \infty} \mathbb{E}[\mathcal{L}(\theta^k)]=\mathcal{L}(\theta^*).\]

2.4.1. Dynamic Stepsize Reduction

We dynamically reduce \(\displaystyle \alpha^k=\frac{\eta_k}{L}\). Note that \(\eta_k \to 0\) makes the residual \(\displaystyle \frac{\eta_kM}{2\mu} \to 0\), but it also means that \(\displaystyle \left(1-\frac{\eta_k}{L}\right) \to 1\), so the price is that we lose linear convergence.

Theorem 2.4 (Dynamic Stepsize Stochastic Gradient Descent, DS-SGD)    Let \(\displaystyle \alpha^k=\frac{2}{2L+k\mu}\) for all \(k \geq 0\). Then SGD satisfies \[0 \leq \mathbb{E}[\mathcal{L}(\theta^k)]-\mathcal{L}(\theta^*) \leq \frac{v}{2L/\mu+k}\] for all \(k \geq 0\), where \(\displaystyle v:=\frac{2L}{\mu}\max\left\{\frac{M}{\mu}, \mathcal{L}(\theta^0)-\mathcal{L}(\theta^*)\right\}\). Thus \[\lim_{k \to \infty} \mathbb{E}[\mathcal{L}(\theta^k)]=\mathcal{L}(\theta^*).\]

The rate is sublinear, \(\displaystyle \mathcal{O}\left(\frac{1}{k}\right)\).

Proof. Note that \(\displaystyle \alpha^k \leq \frac{1}{L} \leq \frac{1}{\mu}\) and all arguments continue to hold in the proof of theorem 2.3 until and including, and so for all \(k \geq 0\), \[\mathbb{E}_k[\mathcal{L}(\theta^{k+1})]-\mathcal{L}(\theta^*)-\frac{\alpha^kML}{2\mu} \leq (1-\mu\alpha^k)\left(\mathbb{E}_{k-1}[\mathcal{L}(\theta^k)]-\mathcal{L}(\theta^*)-\frac{\alpha^kML}{2\mu}\right).\]

We prove the conclusion by induction. Clearly at \(k=0\), the conclusion holds. Assume the conclusion holds at \(k>0\), and substitute the conclusion into the above inequality, we obtain \[\mathbb{E}_k[\mathcal{L}(\theta^{k+1})]-\mathcal{L}(\theta^*)-\frac{\alpha^kML}{2\mu} \leq (1-\mu\alpha^k)\left(\frac{v}{2L/\mu+k}-\frac{\alpha^kML}{2\mu}\right).\]

Using the expression of \(\alpha^k\) in the above and simplifying the expressions provides the conclusion with \(k\) replaced by \((k+1)\).

\(\square\)

2.4.2. Increasing Mini-batch Size

We increase mini-batch size from \(|\mathcal{S}_k|=1\) to \(|\mathcal{S}_k|=p \geq 1\). Let \[G^k=\frac{1}{p}\sum_{j \in \mathcal{S}_k} \nabla l_j(\theta^k)\] where \(j \in \mathcal{S}_k \overset{\text{i.i.d.}}{\sim} \mathcal{U}(\{1, \ldots, m\})\). Then \[\begin{aligned} \text{Var}[G^k \mid \mathcal{S}_k]&=\sum_{j \in \mathcal{S}_k} \frac{1}{p^2} \mathbb{E}_{\mathcal{S}_k} [\| \nabla l_j(\theta^k)-\nabla \mathcal{L}(\theta^k)\|^2] \\&\ \ \ \ +2\sum_{j<i} \frac{1}{p^2} \mathbb{E}_{\mathcal{S}_k}[\nabla l_j(\theta^k)-\nabla \mathcal{L}(\theta^k)]^T \mathbb{E}_{\mathcal{S}_k}[\nabla l_i(\theta^k)-\nabla \mathcal{L}(\theta^k)] \\&=\frac{1}{p^2}\sum_{j \in \mathcal{S}_k} \text{Var}[\nabla l_j(\theta^k)]+0 \leq \frac{M}{p}, \end{aligned}\] where \(|\mathcal{S}_k|=p\) and the independence of \(i\) and \(j\) indices in \(\mathcal{S}_k\) in the first equality as well as the lack of bias \(\mathbb{E}_{\mathcal{S}_k}[\nabla l_j(\theta^k)]=\nabla \mathcal{L}(\theta^k)\). We also have \(\mathbb{E}_{\mathcal{S}_k}[G_k]=\nabla \mathcal{L}(\theta^k)\), i.e., unbiased batch gradient. Then as in theorem 2.3, we deduce, under the same assumptions, \[\mathbb{E}[\mathcal{L}(\theta^k)]-\mathcal{L}(\theta^*)-\frac{\eta M}{2\mu p} \leq \left(1-\frac{\eta\mu}{L}\right)^k \cdot \left( \mathcal{L}(\theta^0)-\mathcal{L}(\theta^*)-\frac{\eta M}{2\mu p}\right).\] Thus the noise level is decreased by batch size \(p\), without impacting the convergence factor.

2.4.3. Momentum for Gradient Variance Reduction

We use acceleration by momentum to reduce \(\text{Var}[G^k \mid \mathcal{S}_k]\), which yields \(\mathbb{E}[\mathcal{L}(\theta^k)] \to \mathcal{L}(\theta^*)\) with linear convergence rate, with a much smaller cost per iteration than mini-batching.

2.4.4. Other Technique

  • Variance reduction (SVRG)

  • SAG

  • SAGA

2.4.5. Conclusion

Each of the three approaches (2.4.1 to 2.4.3) for accelerating SGD have merit and are often all used at once. In particular, once SGD appears to stagnate, one both reduces the stepsize and increases the batch-size; though this is stopped once validation error begins to increase.

2.5. Global Convergence of SGD: Non-convex

Theorem 2.5 (SGD with Fixed Stepsize)    Let \(\mathcal{L} \in \mathcal{C}^1(\mathbb{R}^n)\) be bounded below by \(\mathcal{L}_\text{low}\), with \(\nabla \mathcal{L}\) Lipschitz continuous with constant \(L\) (assumption (1)). Assume assumption (2) holds. Apply the SGD method with fixed stepsize \(\displaystyle \alpha=\frac{\eta}{L}\), where \(\eta \in (0, 1]\), and \(|\mathcal{S}_k|=1\), to minimize \(\mathcal{L}\). Then \[\min_{0 \leq i \leq k} \mathbb{E}[\|\nabla \mathcal{L}(\theta^i)\|^2] \leq \alpha LM + \frac{2(\mathcal{L}(\theta^0)-\mathcal{L}_\text{low})}{k\alpha}=\eta M + \frac{2L(\mathcal{L}(\theta^0)-\mathcal{L}_\text{low})}{k\eta}\] and therefore the SGD method takes at most \(\displaystyle k \leq \frac{2L(\mathcal{L}(\theta^0)-\mathcal{L}_\text{low})}{\eta\varepsilon}\) iterations/evaluations to generate \(\mathbb{E}[\|\nabla \mathcal{L}(\theta^k)\|^2] \leq \varepsilon+\eta M\).

Proof. We have \[\mathbb{E}_k[\mathcal{L}(\theta^{k+1})] \leq \mathbb{E}_{k-1}[\mathcal{L}(\theta^k)]-\frac{\alpha}{2}\mathbb{E}_{k-1}[\|\nabla \mathcal{L}(\theta^k)\|^2]+\frac{ML\alpha^2}{2}.\] We need to connect the per iteration decrease with the gradient. For all \(k \geq 0\), \[\mathbb{E}_{k-1}[\mathcal{L}(\theta^k)]-\mathbb{E}_k[\mathcal{L}(\theta^{k+1})] \geq \frac{\alpha}{2} \mathbb{E}_{k-1}[\|\nabla \mathcal{L}(\theta^k)\|^2]-\frac{ML\alpha^2}{2}.\] Summing up the above bound from \(i=0\) to \(k\), we deduce \[\begin{aligned} \mathcal{L}(\theta^0)-\mathcal{L}_\text{low} &\geq \mathcal{L}(\theta^0)-\mathbb{E}_k[\mathcal{L}(\theta^{k+1})] \\&\geq \frac{\alpha}{2}\sum_{i=0}^k \mathbb{E}_{i-1}[\|\nabla \mathcal{L}(\theta^i)\|^2]-(k+1)\frac{ML\alpha^2}{2} \\&\geq \frac{\alpha}{2}(k+1)\left[\min_{0 \leq i \leq k} \mathbb{E}[\|\nabla \mathcal{L}(\theta^i)\|^2]-ML\alpha\right]. \end{aligned}\]

\(\square\)

The noise floor that limits the accuracy can be obtained. The acceleration momentum method is difficult in the nonconvex case.

3. SGD Improvement

There are many improvements of SGD typically used in practice for deep learning.

3.1. Momentum

  • Polyak momentum: \[\theta^{(k+1)}=\theta^{(k)} + \beta(\theta^{(k)} - \theta^{(k-1)}) - \alpha \nabla_\theta \mathcal{L}(\theta^{(k)}).\]

  • Nesterov's accelerated gradient: \[\begin{aligned} \widehat{\theta}^k &= \theta^{(k)}+\beta(\theta^{(k)}-\theta^{(k-1)}) \\ \theta^{(k+1)} &= \widehat{\theta}^{(k)} - \alpha\nabla_\theta\mathcal{L}(\widehat{\theta}^{(k)}) \end{aligned}\]

These acceleration methods give substantial improvements in the linear convergence rate for convex problems. Note that the linear convergence rates are: normal GD \(\displaystyle \frac{\kappa-1}{\kappa+1}\), Polyak \(\displaystyle \frac{\sqrt{\kappa}-1}{\sqrt{\kappa}+1}\), and NAG \(\displaystyle \sqrt{\frac{\sqrt{\kappa}-1}{\sqrt{\kappa}}}\).

3.2. Adaptive Sub-Gradient

Preconditioning improves convergence rate of line-search methods.

Let \(\mathbf{g}^{(k)}(\theta^{(k)})=:\nabla_\theta \mathcal{L}(\theta^{(k)})\) be the gradient of the training loss function at iteration \(k\) and \[B_k(i, i)=\left(\sum_{j=1}^k (\mathbf{g}^{(j)}(\theta^{(j)})(i))^2\right)^{1/2}\] be the diagonal of the square root of the sum of prior gradient outer products. Adaptive sub-gradient (AdaGrad) is preconditioned GD via the diagonal matrix \(B\): \[\theta^{(k+1)} = \theta^{(k)} - \eta|\Lambda_k|^{-1}(B^{(k)}+\varepsilon I)^{-1}\nabla_\theta\sum_{\mu \in \Lambda_k} l(\theta; \mathbf{x}_\mu, \mathbf{y}_\mu),\] where \(\eta\) is the learning rate, \(\Lambda_k\) is chosen in some randomized method, and \(\varepsilon I>0\) is added to avoid poor scaling of small values of \(B^{(k)}(i, i)\).

3.2.1. AdaGrad Improvement: RMSProp and AdaDelta

RMSProp gives more weight to the current gradient: \[B_k^\text{RMS}(i, i)=\gamma B_{k-1}^\text{RMS}(i, i) + (1-\gamma)(\mathbf{g}^{(k)}(\theta^{(k)})(i))^2\] for some \(\gamma \in [0, 1]\) and updates as \[\theta^{(k+1)}=\theta^{(k)} - \eta|\Lambda_k|^{-1}(B^{(k)}+\varepsilon I)^{-1/2}\nabla_\theta\sum_{\mu \in \Lambda_k} l(\theta; \mathbf{x}_\mu, \mathbf{y}_\mu).\]

AdaDelta (Zeiler, 2012) extended AdaGrad using a similar preconditioned as \(B_k^\text{RMS}\), but also estimated the stepsize using an average difference in \(\theta^{(k)}-\theta^{(k-1)}\).

3.2.2. Scalar AdaGrad

3.2.2.1. Iteration Complexity

Initialize with \(\theta^{(0)}\) and \(b_0>0\), \[\begin{aligned} b_k^2 &= b_{k-1}^2+\|\nabla_\theta\mathcal{L}(\theta^{(k)})\|^2 \\ \theta^{(k)} &= \theta^{(k-1)} - b_k^{-1}\nabla_\theta\mathcal{L}(\theta^{(k)}). \end{aligned}\]

For \(\mathcal{L}(\theta) \in \mathcal{C}_L^1\), that is \(L\) minimal for which \(\|\nabla_\theta\mathcal{L}(\theta_1)-\nabla_\theta\mathcal{L}(\theta_2)\| \leq L\|\theta_1-\theta_2\|\) for all \(\theta_1\) and \(\theta_2\), then scalar batch AdaGrad satisfies \[\min_{k=1, \ldots, T-1} \|\nabla_\theta\mathcal{L}(\theta^{(k)})\|^2 \leq \varepsilon\] for either \[T=1+\lceil 2\varepsilon^{-1}\mathcal{L}(\theta^{(0)})(b_0+2\mathcal{L}(\theta^{(0)})) \rceil\ \ \text{if}\ \ b_0 \geq L\] or \[T=1+\lceil \varepsilon^{-1}(L^2-b_0^2+4(\mathcal{L}(\theta^{(0)})+(3/4+\ln(L/b_0))L)^2) \rceil\ \ \text{if}\ \ b_0<L.\] In contrast, if \(b_k\) is a fixed constant \(b\), then if \(b<L/2\), GD can diverge, while if \(b \geq L\), then \(T=2b\varepsilon^{-1}\mathcal{L}(\theta^{(0)})\).

Iteration complexity for scalar batch AdaGrad following properties that for any non-negative values \(a_1, \ldots, a_T\) with \(a_1>0\) \[\sum_{\mathscr{l}=1}^T \frac{a_\mathscr{l}}{\sum_{i=1}^\mathscr{l} a_i} \leq \ln\left(\sum_{i=1}^T a_i\right)+1 \quad \text{and} \quad \sum_{\mathscr{l}=1}^T \frac{a_\mathscr{l}}{\sqrt{\sum_{i=1}^\mathscr{l}}a_i} \leq 2\sqrt{\sum_{i=1}^T a_i}.\]

In addition, for any fixed \(\varepsilon \in (0, 1]\) and \(L, b_0>0\), the iterates \(b_{k+1}^2=b_k^2+a_k\) has the property that after \(N=\lceil \varepsilon^{-1}(L^2-b_0^2) \rceil+1\) iterations, either \(\displaystyle \min_{0 \leq k \leq N-1} a_k \leq \varepsilon\) or \(b_N \geq L\).

Let \(k_0\) be the first iterate s.t. \(b_{k_0} \geq L\), then for all \(k \geq k_0\), \[b_k \leq b_{k_0-1}+2\mathcal{L}(\theta^{(k_0-1)}) \quad \text{(bounded above)}\] and \[\mathcal{L}(\theta^{(k_0-1)}) \leq \frac{L}{2}(1+2\ln(b_{k_0-1} / b_0)) \quad \text{(not diverged)}.\]

3.2.2.2. Influence of Mini-batch or Other Gradient Approximation

Let \(\mathbf{g}^{(k)}\) be an unbiased estimator of the gradient \(\nabla_\theta \mathcal{L}(\theta^{(k)})\) of the training loss function at iteration \(k\), i.e., \(\mathbb{E}[\mathbf{g}^{(k)}]=\nabla_\theta\mathcal{L}(\theta^{(k)})\). Moreover, suppose there is a uniform bound \(\mathbb{E}[\|\mathbf{g}^{(k)}\|^2] \leq c_\mathbf{g}^2\). Then consider the stochastic scalar AdaGrad update as \[\begin{aligned} b_k^2 &= b_{k-1}^2+\|\mathbf{g}^{(k)}\|^2 \\ \theta^{(k)} &= \theta^{(k-1)}-b_k^{-1}\mathbf{g}^{(k)}. \end{aligned}\]

Unlike in the batch version of AdaGrad where \(b_k\) converges to a fixed stepsize, stochastic AdaGrad converges roughly at the rate \(b_k \approx c_\mathbf{g}k^{1/2}\). Moreover, Ward et al. (2019) showed that \[\min_{\mathscr{l}=0, \ldots, N-1}(\mathbb{E} \|\nabla_\theta\mathcal{L}(\theta^{(\mathscr{l})})\|^{4/3})^{3/2} \leq \mathcal{O}\left(\frac{b_0+c_\mathbf{g}}{N}+\frac{c_\mathbf{g}}{N^{1/2}}\right)\ln(Nc_\mathbf{g}^2/b_0^2).\]

4. Reference

Rachel Ward, Xiaoxia Wu, and Leon Bottou. AdaGrad Stepsizes: Sharp Convergence Over Nonconvex Landscapes. In Proceedings of the 36th International Conference on Machine Learning, 2019.
 
Matthew D. Zeiler. ADADELTA: An Adaptive Learning Rate Method. arXiv Preprint arXiv:1212.5701, 2012.