Minimization

The training (also known as learning or optimisation phase) of neural networks is carried out in most cases using some variant of the gradient descent method, such as back-propagation or stochastic gradient descent. In these methods, the determination of the fit parameters (namely the weights and thresholds of the NN) requires the evaluation of the gradients of \chi^2, that is,

(1)   \begin{equation*} \frac{\partial \chi^2}{\partial w_{ij}^{(l)}} \,\mbox{,} \quad \frac{\partial \chi^2}{\partial \theta_{i}^{(l)}} \,\mbox{.} \end{equation*}

Computing these gradients in the NNPDF case would be quite involved due to the non-linear relation between the fitted experimental data and the input PDFs, which proceeds through convolutions both with the DGLAP evolution kernels and the hard-scattering partonic cross-sections as encoded into the optimised APFELgrid fast interpolation strategy.

The theory prediction for a collider cross-section in terms of the NN parameters reads

(2)   \begin{equation*} \sigma^{\rm \small (th)}\lp \{ \omega,\theta\}\rp = \widehat{\sigma}_{ij}(Q^2)\otimes \Gamma_{ij,kl} (Q^2,Q_0^2) \otimes q_k\lp Q_0,\{ \omega,\theta\} \rp \otimes q_l \lp Q_0 ,\{ \omega,\theta\}\rp \end{equation*}

where \otimes indicates a convolution over x, \widehat{\sigma}_{ij} and \Gamma_{ij,kl} stand for the hard-scattering cross-sections and the DGLAP evolution kernels respectively, and sum over repeated flavour indices is understood.

In the APFELgrid approach, this cross-section can be expressed in a much compact way as

(3)   \begin{equation*} \sigma^{\rm \small (th)}\lp \{ \omega,\theta\}\rp = \sum_{i,j=1}^{n_f}\sum_{a,b=1}^{n_x}{\tt FK}_{k,ij,ab} \cdot q_i\lp x_a,Q_0, \{ \omega,\theta\}\rp \cdot q_j\lp x_b,Q_0, \{ \omega,\theta\}\rp \,, \end{equation*}

where now all perturbative information is pre-computed and stored in the {\tt FK}_{k,ij,ab} interpolation tables, and a,b run over a grid in x.

The convoluted relation between \sigma^{(\rm th)} and the NN parameters in Eq.~(3) is what makes the implementation of gradient descent methods challenging.

In the proton NNPDF global analysis, both in the polarised and the unpolarised case, the NN training is carried out instead by means Genetic Algorithms (GAs). GAs are based on a combination of deterministic and stochastic ingredients which make them particularly useful to explore complex parameter spaced without getting stuck in local minima, and which do not require the knowledge of the \chi^2 gradients in Eq.~(1) but only of its local values.

In the figure above we display a schematic representation of how the CMA-ES algorithm works in a toy scenario, showing how it manages to approach the global minimum while at the same time stochastically sampling the region around it. Starting from a random population of solutions far from the minimum (white region), the spread (variance) of the population increases while at the same time the average (center) solution moves closer to the minimum. As the number of generations increases, the average solution remains close to the minimum but now the variance has been reduced significantly, indicating that the algorithm has converged.