0%

DNN Initialization

1. Xavier Weight Initialization

Xavier Glorot and Yoshua Bengio (2010) considered random initialized networks and the associated variance of pre-activation values and the gradients as they pass from layer to layer. Glorot and Bengio noted that for symmetric \(\phi(\cdot)\) such as \(\tanh(\cdot)\), the hidden layer values are approximately Gaussian with a variance depending on the variance of the weight matrices \(W^{(i)}\).

In particular, if \(\sigma_W\) is selected appropriately (\(\sigma_W^2=1/(3n)\)) then the variance of hidden layer values is approximately constant through layers (bottom plot of Figure 1); if \(\sigma_W\) is small then the variance of hidden layer values converges towards zero with depth (top plot of Figure 1).

Figure 1: Activation values normalized histograms with hyperbolic tangent activation, with standard (top) vs normalized (bottom) initialization (Glorot and Bengio, 2010).

Similarly, the gradient of a loss function used for training shows the similar Gaussian behavior depending on \(\sigma_W\).

Figure 2: Back-propagated gradients normalized histograms with hyperbolic tangent activation, with standard (top) vs normalized (bottom) initialization (Glorot and Bengio, 2010).

The suggested Xavier initialization \(\sigma_W^2=1/(3n)\) follows from balancing the variance of hidden layer values and the gradient to be constant through depth.

2. Random DNN Recursion Correlation Map

Let \(f_{NN}(\mathbf{x})\) denote a random Gaussian DNN: \[\mathbf{h}^{(\mathscr{l})}=W^{(\mathscr{l})}\mathbf{z}^{(\mathscr{l})}+\mathbf{b}^{(\mathscr{l})}, \mathbf{z}^{(\mathscr{l}+1)}=\phi(\mathbf{h}^{(\mathscr{l})}), \mathscr{l}=0, \ldots, L-1,\] which takes as input the vector \(\mathbf{x}\), and is parameterized by random weight matrices \(W^{(\mathscr{l})}\) and bias vectors \(\mathbf{b}^{(\mathscr{l})}\) with entries sampled i.i.d. from the Gaussian normal distribution \(\mathcal{N}(0, \sigma_W^2)\) and \(\mathcal{N}(0, \sigma_\mathbf{b}^2)\). Define the \(\mathscr{l}^2\) length of the pre-activation hidden layer \(\mathbf{h}^{(\mathscr{l})} \in \mathbb{R}^{n_\mathscr{l}}\) output as \[q^\mathscr{l}=n_\mathscr{l}^{-1}\|\mathbf{h}^{(\mathscr{l})}\|^2_{\mathscr{l}^2}:=\frac{1}{n_\mathscr{l}}\sum_{i=1}^{n_\mathscr{l}} (\mathbf{h}^{(\mathscr{l})}(i))^2,\] which is the sample mean of the random entries \((\mathbf{h}^{(\mathscr{l})}(i))^2\), each of which are identically distributed.

The norm of the hidden layer output \(q^\mathscr{l}\) has an expected value over the random draws of \(W^{(\mathscr{l})}\) and \(\mathbf{b}^{(\mathscr{l})}\) which satisfies \(\mathbb{E}[q^\mathscr{l}]=\mathbb{E}[(\mathbf{h}^{(\mathscr{l})}(i))^2]\) which as \(\mathbf{h}^{(\mathscr{l})}=W^{(\mathscr{l})}\phi(\mathbf{h}^{(\mathscr{l}-1)})+\mathbf{b}^{(\mathscr{l})}\) is \[\begin{aligned} \mathbb{E}[q^\mathscr{l}]&=\mathbb{E}[(W_i^{(\mathscr{l})}\phi(\mathbf{h}^{(\mathscr{l}-1)}))^2]+\mathbb{E}[(b_i^{(\mathscr{l})})^2] \\&=\sigma_W^2n_{\mathscr{l}-1}^{-1}\sum_{i=1}^{n_\mathscr{l}-1} \phi(h_i^{(\mathscr{l}-1)})^2+\sigma_\mathbf{b}^2 \end{aligned}\] where \(W_i^{(\mathscr{l})}\) denotes the \(i\)th row of \(W^{(\mathscr{l})}\).

Approximating \(h_i^{(\mathscr{l}-1)}\) as Gaussian with variance \(q^{(\mathscr{l}-1)}\): \[n_{\mathscr{l}-1}^{-1}\sum_{i=1}^{n_{\mathscr{l}-1}} \phi(h_i^{(\mathscr{l}-1)})^2=\mathbb{E}[\phi(h^{(\mathscr{l}-1)})^2]=\frac{1}{\sqrt{2\pi}}\int_{-\infty}^\infty \phi\left(\sqrt{q^{(\mathscr{l}-1)}}z\right)^2e^{-z^2/2}\text{d}z\] which gives a recursive map of \(q^\mathscr{l}\) between layers \[q^{(\mathscr{l})}=\sigma_\mathbf{b}^2+\sigma_W^2\frac{1}{\sqrt{2\pi}}\int_{-\infty}^\infty \phi\left(\sqrt{q^{(\mathscr{l}-1)}}z\right)^2e^{-z^2/2}\text{d}z=:\mathcal{V}(q^{(\mathscr{l}-1)} \mid \sigma_W, \sigma_\mathbf{b}, \phi(\cdot)).\] Note that the integral is larger for \(\phi(x)=|x|\) than ReLU, which are larger than \(\phi(x)=\tanh(x)\), indicating smaller \(\sigma_W\) and \(\sigma_\mathbf{b}\) needed to ensure \(q^{(\mathscr{l})}\) has a finite nonzero limit \(q^*\). As Figure 3 shown, the fixed points are all stable.

Figure 3: Dynamics of the squared length \(q^\mathscr{l}\) for a sigmoidal network (\(\phi(h)=\tanh(h)\)) with \(1000\) hidden units (Poole et al., 2016).

A single input \(\mathbf{x}\) has hidden pre-activation output converging to a fixed expected length. Consider the map governing the angle between the hidden layer pre-activations of two distinct inputs \(\mathbf{x}^{(0, a)}\) and \(\mathbf{x}^{(0, b)}\): \[q_{ab}^{(\mathscr{l})}=n_\mathscr{l}^{-1}\sum_{i=1}^{n_\mathscr{l}} h_i^{(\mathscr{l})}(\mathbf{x}^{(0, a)})h_i^{(\mathscr{l})}(\mathbf{x}^{(0, b)}).\] Replacing the average in the sum with the expected values gives \[q_{ab}^{(\mathscr{l})}=\sigma_\mathbf{b}^2+\sigma_W^2\mathbb{E}[\phi(\mathbf{h}^{(\mathscr{l}-1)}(\mathbf{x}^{(0, a)}))\phi(\mathbf{h}^{(\mathscr{l}-1)}(\mathbf{x}^{(0, b)}))]\] where as before, \(h_i^{(\mathscr{l}-1)}\) is well modeled as being Gaussian with expected length \(q^{(\mathscr{l}-1)}\) which converges to fix points \(q^*\).

Define the angle between the hidden layers as \(c^{(\mathscr{l})}=q_{12}^{(\mathscr{l})}/q^*\). Writing the expectation as integral, we have \[c^{(\mathscr{l})}=\mathcal{C}(c^{(\mathscr{l}-1)} \mid \sigma_W, \sigma_\mathbf{b}, \phi(\cdot)):=\sigma_\mathbf{b}^2+\sigma_W^2\iint \text{D}z_1\text{D}z_2\phi(u_1)\phi(u_2)\] where \(\text{D}z=(2\pi)^{-1/2}e^{-z^2/2}\text{d}z\), \(u_1=\sqrt{q^*}z_1\) and \(u_2=\sqrt{q^*}\left(c^{(\mathscr{l}-1)}z_1+\sqrt{1-(c^{(\mathscr{l}-1)})^2}z_2\right)\). Normalizing by \(q^*\), we have the correlation map \[\rho^{(\mathscr{l}+1)}=c^{(\mathscr{l}+1)}:=R(\rho^{(\mathscr{l})}; \sigma_W, \sigma_\mathbf{b}, \phi(\cdot)),\] where \(R(\rho)\) measures how the correlation between two inputs evolve through the layers. By construction \(R(1)=1\) as \(c^{(\mathscr{l})} \to q^*\) as the two inputs converge to one another.

For \(\sigma_\mathbf{b}^2>0\), the correlation map satisfies \(R(0)>0\), orthogonal \(\mathbf{x}^{(0, a)}\) and \(\mathbf{x}^{(0, b)}\) become increasingly correlated with depth.

The stability of a point and its perturbation is determined by the slope of \(R(\cdot)\) at \(\rho=1\) is \[\chi:=\frac{\partial R(\rho)}{\partial \rho}\bigg|_{\rho=1}=\frac{\partial \rho^{(\mathscr{l})}}{\partial \rho^{(\mathscr{l}-1)}}\bigg|_{\rho=1}=\sigma_W^2\int \text{D}z\left[\phi'\left(\sqrt{q^*}z\right)^2\right].\] Stability of the fixed point at \(\rho=1\) is determined by \(\chi\): if \(\chi<1\), \(\rho=1\) is locally stable and points which are sufficiently correlated all converge, with depth, to the same point; if \(\chi>1\), \(\rho=1\) is unstable and nearby points become uncorrelated with depth; it is preferable to choose \(\chi=1\) if possible for \((\sigma_W, \sigma_\mathbf{b}, \phi(\cdot))\).

As Figure 4 shown, initialization on the curve allows training very deep networks.

Figure 4: Edge of chaos curves for nonlinear activations (Murray, Abrol, and Tanner, 2021).

3. Spectrum of the Jacobian

3.1. DNN Jacobian

The Jacobian is given by \[J=\frac{\partial \mathbf{z}^{(\mathscr{l})}}{\partial \mathbf{x}^{(0)}}=\prod_{\mathscr{l}=0}^{L-1} D^{(\mathscr{l})}W^{(\mathscr{l})}\] where \(D^{(\mathscr{l})}\) is diagonal with entries \(D_{ii}^{(\mathscr{l})}=\phi'(h_i^{(\mathscr{l})})\).

The Jacobian can bound the local stability of the DNN: \[\|H(\mathbf{x}+\varepsilon; \theta)-H(\mathbf{x}; \theta)\|=\|J\varepsilon+\mathcal{O}(\|\varepsilon\|^2)\| \leq \|\varepsilon\| \cdot \max\|J\|.\]

For the sum of squares loss \(\mathcal{L}\), the gradient is computed as \[\delta_\mathscr{l}=D^\mathscr{l}(W^{(\mathscr{l})})^\top\delta_{\mathscr{l}+1} \quad \text{and} \quad \delta_L=D^{(\mathscr{l})}\nabla_{\mathbf{h}^{(\mathscr{l})}}\mathcal{L}\] which gives the formula for computing the \(\delta_\mathscr{l}\) for each layer as \[\delta_\mathscr{l}=\left(\prod_{k=\mathscr{l}}^{L-1} D^{(k)}(W^{(k)})^\top\right)D^{(\mathscr{l})}\nabla_{\mathbf{h}^{(\mathscr{l})}}\mathcal{L}\] and the resulting gradient \(\nabla_\theta \mathcal{L}\) with entries as \[\frac{\partial \mathcal{L}}{\partial W^{(\mathscr{l})}}=\delta_{\mathscr{l}+1} \cdot (\mathbf{h}^{(\mathscr{l})})^\top \quad \text{and} \quad \frac{\partial \mathcal{L}}{\partial \mathbf{b}^{(\mathscr{l})}}=\delta_{\mathscr{l}+1}.\]

3.2. Stieltjes and \(\mathcal{S}\) Transform

For \(z \in \mathbb{C}-\mathbb{R}\), the Stieltjes transform \(G_\rho(z)\) of a probability distribution and its inverse are given by \[G_\rho(z)=\int_\mathbb{R} \frac{\rho(t)}{z-t}\text{d}t \quad \text{and} \quad \rho(\lambda)=-\pi^{-1}\lim_{\varepsilon \to 0_+} \text{Imag}(G_\rho(\lambda+i\varepsilon)).\] The Stieltjes transform and moment generating function are related by \[M_\rho(z):=zG_\rho(z)-1=\sum_{k=1}^\infty \frac{m_k}{z^k}\] and the \(\mathcal{S}\) transform is defined as \[S_\rho(z)=\frac{1+z}{zM_\rho^{-1}(z)}.\] The \(\mathcal{S}\) transform has the property that if \(\rho_1\) and \(\rho_2\) are freely independent, then \(S_{\rho_1\rho_2}=S_{\rho_1}S_{\rho_2}\).

The \(\mathcal{S}\) transform of \(JJ^\top\) is then given by \[\mathcal{S}_{JJ^\top}=\mathcal{S}_{D^2}^L\mathcal{S}_{W^\top W}^L,\] which can be computed through the moments \[M_{JJ^\top}(z)=\sum_{k=1}^\infty \frac{m_k}{z^k} \quad \text{and} \quad M_{D^2}(z)=\sum_{k=1}^\infty \frac{\mu_k}{z^k}\] where \(\displaystyle \mu_k=\int (2\pi)^{-1/2}\phi'\left(\sqrt{q^*}z\right)^{2k}e^{-z^2/2}\text{d}z\).

In particular, \(m_1=(\sigma_W^2\mu_1)^L\) and \(m_2=(\sigma_W^2\mu_1)^{2L}L(\mu_2^{-1}\mu_1^2+L^{-1}-1-s_1)\), where \(\sigma_W^2\mu_1=\chi\) is the growth factor we observe with the edge of chaos, requiring \(\chi=1\) to avoid rapid convergence of correlations to fixed points.

3.3. Nonlinear Activation Stability

Recall that \(m_1=\chi^L\) is the expected value of the spectrum of \(JJ^\top\), while the variance of the spectrum of \(JJ^\top\) is given by \[\sigma_{JJ^\top}^2=m_2-m_1^2=L(\mu_2\mu_1^{-2}-1-s_1)\] where for \(W\) Gaussian, \(s_1=-1\), and for \(W\) orthogonal, \(s_1=0\).

Pennington et al. (2018) found that for linear networks, the fixed point equation is \(q^*=\sigma_W^2q^*+\sigma^2_\mathbf{b}\), and \((\sigma_W, \sigma_\mathbf{b})=(1, 0)\) is the only critical point; for ReLU networks, the fixed point equation is \(\displaystyle q^*=\frac{1}{2}\sigma_W^2q^*+\sigma_\mathbf{b}^2\), and \((\sigma_W, \sigma_\mathbf{b})=(\sqrt{2}, 0)\) is the only critical point; Hard Tanh and Erf have curves as fixed points \(\chi(\sigma_W, \sigma_\mathbf{b})\).

4. Controlling the Variance of the Jacobian Spectra

Definition    An activation function \(\phi: \mathbb{R} \to \mathbb{R}\) satisfying the following properties is called scaled-bounded activation:

    (1) continuous;

    (2) odd, i.e., \(\phi(z)=-\phi(-z)\) for all \(z \in \mathbb{R}\);

    (3) linear around the origin and bounded, i.e., there exists \(a, k \in \mathbb{R}_{>0}\) s.t. \(\phi(z)=kz\) for all \(z \in [-a, a]\) and \(\phi(z) \leq ak\) for all \(z \in \mathbb{R}\);

    (4) twice differentiable at all points \(z \in \mathbb{R}-\mathcal{D}\), where \(\mathcal{D} \subset \mathbb{R}\) is a finite set; furthermore \(|\phi'(z)| \leq k\) for all \(z \in \mathbb{R}-\mathcal{D}\).

Theorem (Murray et al., 2021)     Let \(\phi\) be a scaled-bounded activation, \(\chi_1:=\sigma_W^2\mathbb{E}[\phi'(\sqrt{q^*}Z)^2]=1\), where \(q^*>0\) is a fixed point of \(V_\phi\), and \(\sigma_\mathbf{b}^2>0\). Let input \(\mathbf{x}\) satisfy \(\|\mathbf{x}\|_2^2=q^*\). Then as \(y:=\sigma_\mathbf{b}^2/a^2 \to 0\), \[\max_{\rho \in [0, 1]} |R_{\phi, q^*}(\rho)-\rho| \to 0 \quad \text{and} \quad |\mu_2/\mu_1^2-1| \to 0.\]

This is independent of details of \(\phi(\cdot)\) outside its linear region \([-a, a]\). Best performance is observed with \(a \sim 3\), or preferably a decreasing from about \(5\) to \(2\) during training.

5. DNN Random Initialization Summary

  • The hidden layers converge to fixed expected length.

  • Poole et al. (2016) showed pre-activation output is well modeled as Gaussian with variance \(q^*\) determined by \((\sigma_W, \sigma_\mathbf{b}, \phi(\cdot))\). Moreover, the correlation between two inputs follows a similar map with correlations converging to a fixed point, with the behavior determined in part by \(\chi\) where \(\chi=1\) avoids correlation to the same point, or nearby points diverging.

  • Very DNNs can be especially hard to train for activations with unfavorable initializations; e.g., ReLU with \(\chi=1\) requires \((\sigma_W, \sigma_\mathbf{b})=(\sqrt{2}, 0)\).

  • Pennington et al. (2018) showed more generally how to compute the moments for the Jacobian spectra, where \(\chi=1\) is needed to avoid exponential growth or shrinkage with depth of gradients.

6. Reference

Xavier Glorot and Yoshua Bengio. Understanding the Difficulty of Training Deep Feedforward Neural Networks. In Proceedings of the 13th International Conference on Artificial Intelligence and Statistics, 2010.
 
Michael Murray, Vinayak Abrol, and Jared Tanner. Activation Function Design for Deep Networks: Linearity and Effective Initialisation. arXiv Preprint arXiv:2105.07741, 2021.
 
Jeffrey Pennington, Samuel S. Schoenholz, and Surya Ganguli. The Emergence of Spectral Universality in Deep Networks. In 21st International Conference on Artificial Intelligence and Statistics, 2018.
 
Ben Poole, Subhaneil Lahiri, Maithra Raghu, Jascha Sohl-Dickstein, and Surya Ganguli. Exponential Expressivity in Deep Neural Networks through Transient Chaos. In 30th Conference on Neural Information Processing Systems, 2016.