0%

Loss Landscape

1. Topology of Loss Landscape

Consider the loss function \[\mathcal{L}(\theta; X, Y)=n^{-1}\sum_{\mu=1}^n l(\theta; \mathbf{x}_\mu, \mathbf{y}_\mu)\] and its associated level set \[\Omega_\mathcal{L}(\lambda)=\{\theta: \mathcal{L}(\theta; X, Y) \leq \lambda\}.\]

Define the number of connected components, say \(N_\lambda\) in \(\Omega_\mathcal{L}(\lambda)\). If \(N_\lambda=1\) for all \(\lambda\), then \(\mathcal{L}(\theta; X, Y)\) has no isolated local minima and any descent method can obtain a global minima. If \(N_\lambda>1\), there may be spurious valleys in which the minima in the connected component does not achieve the global minima.

1.1. Linear Network: Single Component

Theorem (Freeman et al., 2016)     Let \(H(\mathbf{x}; \theta)\) be an \(L\) layer net given by \(\mathbf{h}^{(\mathscr{l})}=W^{(\mathscr{l})}\mathbf{h}^{(\mathscr{l}-1)}\) with \(W^{(\mathscr{l})} \in \mathbb{R}^{n_\mathscr{l} \times n_{\mathscr{l}-1}}\), then if \(n_\mathscr{l}>\min(n_0, n_L)\) for \(0<\mathscr{l}<L\), the sum of squares loss function has a single connected component.

1.2. ReLU Network: Multiple Components

Theorem (Freeman et al., 2016)     Let \(H(\mathbf{x}; \theta)\) be an \(L\) layer net given by \(\mathbf{h}^{(\mathscr{l})}=\phi(W^{(\mathscr{l})}\mathbf{h}^{(\mathscr{l}-1)})\) with \(W^{(\mathscr{l})} \in \mathbb{R}^{n_\mathscr{l} \times n_{\mathscr{l}-1}}\) and \(\phi(\cdot)=\max(0, \cdot)\), then for any choice of \(n_\mathscr{l}\), there is a distribution of data \((X, Y)\) s.t. there are more than one single connected component.

1.3. ReLu Activation Network: Nearly Connected

Theorem (Venturi et al., 2016)     Consider a two layer ReLu network \(H(\mathbf{x}, \theta)=W^{(2)}\phi(W^{(1)}\mathbf{x})\) with \(W^{(1)} \in \mathbb{R}^{m \times n}\) and \(W^{(2)} \in \mathbb{R}^m\), then for any two parameters \(\theta_1\) and \(\theta_2\) with \(\mathcal{L}(\theta_i) \leq \lambda\) for \(i=1, 2\), then there is a path \(\gamma(t)\) between \(\theta_1\) and \(\theta_2\) s.t. \(\mathcal{L}(\theta_{\gamma(t)}) \leq \max(\lambda, m^{-1/n})\).

1.4. Quadratic Activation Network: Single Component

Theorem (Venturi et al., 2016)     Let \(H(\mathbf{x}; \theta)\) be an \(L\) layer net given by \(\mathbf{h}^{(\mathscr{l})}=\phi(W^{(\mathscr{l})}\mathbf{h}^{(\mathscr{l}-1)})\) with \(W^{(\mathscr{l})} \in \mathbb{R}^{n_\mathscr{l} \times n_{\mathscr{l}-1}}\) and quadratic activation \(\phi(z)=z^2\), then once the number of parameters \(n_\mathscr{l} \geq 3N^{2^\mathscr{l}}\) where \(N\) is the number of data entries, then the sum of squares loss function has a single connected component. For the two layer case with a single quadratic activation this simplifies to \(n>2N\).

2. Manifold of Global Minimizer

Theorem (Cooper, 2021)     Let \(H(\mathbf{x}; \theta)\) be a DNN from \(\mathbb{R}^n\) to \(\mathbb{R}^r\) with smooth nonlinear activation \(\phi(\cdot)\), let the loss function over \(d\) distinct data elements be defined as \[\mathcal{L}=(2m)^{-1}\sum_{\mu=1}^d \|H(\mathbf{x}_\mu; \theta)-\mathbf{y}_\mu\|^2_2,\] and let \(\Omega_{\mathcal{L}}^*(0)=\{\theta: \mathcal{L}(\theta; X, Y)=0\}\) be the set of weight and bias trainable parameters for which the DNN exactly fits the \(d\) data elements. Then, subject to possibly arbitrarily small perturbation, the set \(\Omega_\mathcal{L}^*(0)\) is a smooth \((d-rn)\) dimensional sub-manifold (possibly empty) of \(\mathbb{R}^d\).

3. Summary

  • Loss landscape for DNN can be non-convex and hence difficult to optimize.

  • The number of components of a loss landscape level curve can be analyzed, and in some settings has a single component greatly aiding its optimization.

  • Increasing width of a DNN can improve the loss landscape.

  • The local shape of random net can be analyzed, showing that when near a minima the Hessian has only non-negative eigenvalues.

  • When the amount of data exceeds the product of the input and output dimensions, DNN with smooth non-linear activations which exactly fit the data, have smooth manifold of a known dimension.

4. Improvement

  • Larger training batch size narrows the loss function while weight decay (adding \(\|\theta\|\) to the loss, broadens the loss function).

  • Adding skip connections through residual networks can greatly smooth the loss landscape.

  • Batch normalization can help train parameters in bulk and in so doing improve the training rate (though superfluous for expressivity).

  • CNN can even be convexified which may limit overall accuracy, but ensures ease of training regardless of initialization.