1. Topology of Loss Landscape
Consider the loss function \[\mathcal{L}(\theta; X, Y)=n^{-1}\sum_{\mu=1}^n l(\theta; \mathbf{x}_\mu, \mathbf{y}_\mu)\] and its associated level set \[\Omega_\mathcal{L}(\lambda)=\{\theta: \mathcal{L}(\theta; X, Y) \leq \lambda\}.\]
Define the number of connected components, say \(N_\lambda\) in \(\Omega_\mathcal{L}(\lambda)\). If \(N_\lambda=1\) for all \(\lambda\), then \(\mathcal{L}(\theta; X, Y)\) has no isolated local minima and any descent method can obtain a global minima. If \(N_\lambda>1\), there may be spurious valleys in which the minima in the connected component does not achieve the global minima.
1.1. Linear Network: Single Component
Theorem (Freeman et al., 2016) Let \(H(\mathbf{x}; \theta)\) be an \(L\) layer net given by \(\mathbf{h}^{(\mathscr{l})}=W^{(\mathscr{l})}\mathbf{h}^{(\mathscr{l}-1)}\) with \(W^{(\mathscr{l})} \in \mathbb{R}^{n_\mathscr{l} \times n_{\mathscr{l}-1}}\), then if \(n_\mathscr{l}>\min(n_0, n_L)\) for \(0<\mathscr{l}<L\), the sum of squares loss function has a single connected component.
1.2. ReLU Network: Multiple Components
Theorem (Freeman et al., 2016) Let \(H(\mathbf{x}; \theta)\) be an \(L\) layer net given by \(\mathbf{h}^{(\mathscr{l})}=\phi(W^{(\mathscr{l})}\mathbf{h}^{(\mathscr{l}-1)})\) with \(W^{(\mathscr{l})} \in \mathbb{R}^{n_\mathscr{l} \times n_{\mathscr{l}-1}}\) and \(\phi(\cdot)=\max(0, \cdot)\), then for any choice of \(n_\mathscr{l}\), there is a distribution of data \((X, Y)\) s.t. there are more than one single connected component.
1.3. ReLu Activation Network: Nearly Connected
Theorem (Venturi et al., 2016) Consider a two layer ReLu network \(H(\mathbf{x}, \theta)=W^{(2)}\phi(W^{(1)}\mathbf{x})\) with \(W^{(1)} \in \mathbb{R}^{m \times n}\) and \(W^{(2)} \in \mathbb{R}^m\), then for any two parameters \(\theta_1\) and \(\theta_2\) with \(\mathcal{L}(\theta_i) \leq \lambda\) for \(i=1, 2\), then there is a path \(\gamma(t)\) between \(\theta_1\) and \(\theta_2\) s.t. \(\mathcal{L}(\theta_{\gamma(t)}) \leq \max(\lambda, m^{-1/n})\).
1.4. Quadratic Activation Network: Single Component
Theorem (Venturi et al., 2016) Let \(H(\mathbf{x}; \theta)\) be an \(L\) layer net given by \(\mathbf{h}^{(\mathscr{l})}=\phi(W^{(\mathscr{l})}\mathbf{h}^{(\mathscr{l}-1)})\) with \(W^{(\mathscr{l})} \in \mathbb{R}^{n_\mathscr{l} \times n_{\mathscr{l}-1}}\) and quadratic activation \(\phi(z)=z^2\), then once the number of parameters \(n_\mathscr{l} \geq 3N^{2^\mathscr{l}}\) where \(N\) is the number of data entries, then the sum of squares loss function has a single connected component. For the two layer case with a single quadratic activation this simplifies to \(n>2N\).
2. Manifold of Global Minimizer
Theorem (Cooper, 2021) Let \(H(\mathbf{x}; \theta)\) be a DNN from \(\mathbb{R}^n\) to \(\mathbb{R}^r\) with smooth nonlinear activation \(\phi(\cdot)\), let the loss function over \(d\) distinct data elements be defined as \[\mathcal{L}=(2m)^{-1}\sum_{\mu=1}^d \|H(\mathbf{x}_\mu; \theta)-\mathbf{y}_\mu\|^2_2,\] and let \(\Omega_{\mathcal{L}}^*(0)=\{\theta: \mathcal{L}(\theta; X, Y)=0\}\) be the set of weight and bias trainable parameters for which the DNN exactly fits the \(d\) data elements. Then, subject to possibly arbitrarily small perturbation, the set \(\Omega_\mathcal{L}^*(0)\) is a smooth \((d-rn)\) dimensional sub-manifold (possibly empty) of \(\mathbb{R}^d\).
3. Summary
Loss landscape for DNN can be non-convex and hence difficult to optimize.
The number of components of a loss landscape level curve can be analyzed, and in some settings has a single component greatly aiding its optimization.
Increasing width of a DNN can improve the loss landscape.
The local shape of random net can be analyzed, showing that when near a minima the Hessian has only non-negative eigenvalues.
When the amount of data exceeds the product of the input and output dimensions, DNN with smooth non-linear activations which exactly fit the data, have smooth manifold of a known dimension.
4. Improvement
Larger training batch size narrows the loss function while weight decay (adding \(\|\theta\|\) to the loss, broadens the loss function).
Adding skip connections through residual networks can greatly smooth the loss landscape.
Batch normalization can help train parameters in bulk and in so doing improve the training rate (though superfluous for expressivity).
CNN can even be convexified which may limit overall accuracy, but ensures ease of training regardless of initialization.