Yesterday,

Yann was talking at the

Challenge in Optimization for Data Science workshop. One of his slides featured the distribution of losses across different neural network sizes (each color representing a specific network size, for one network size several randomly chosen networks were trained). The figure above was extracted from [1] but it is was the one shown in the slide. The crux of the argument was that by becoming larger, neural networks, trained from initially random weights, could see a concentration of their final losses i.e. once the networks are becoming large enough, the non-convexity is not that bad because even the local minima are in fact not so far from the global minimum. Roughly speaking, enlarging or lifting the phase space of hyperparameters allows the loss to converge toward zero while at the same time these losses are becoming more and more concentrated. Understanding how we can train these networks is important because currently we have no idea on how to really streamline the process nor have a particular understanding of the sampling bounds. It turns out that I had a very similar discussion with the folks at

Kalray who were presenting their architectures to the

NeuroSTIC people who also happen to have their annual meetings just a few meters away.

Coming back to this lifting argument, here are two papers that mentions that paper ([1]

The Loss Surfaces of Multilayer Networks by

Anna Choromanska,

Mikael Henaff,

Michael Mathieu,

Gérard Ben Arous,

Yann LeCun ) and both take the path of tensorization as a way to do the lifting: First there is [3]

Generalization Bounds for Neural Networks through Tensor Factorization by

Majid Janzamin,

Hanie Sedghi,

Anima Anandkumar and then there is

[2] Global Optimality in Tensor Factorization, Deep Learning, and Beyond by

Benjamin D. Haeffele,

Rene Vidal.

From [3] one can read:

*Recently, Choromanska et al. (2015) analyze the loss surface of a multi-layer ReLU network by relating it to a spin glass system. They make several assumptions such as variable independence for the input, redundancy in network parameterization and so on. They show that under these assumptions, the lowest critical values of the random loss function form a layered structure, and the number of local minima outside that band diminishes exponentially with the network size.*

aside from the sampling bound on the training they give, their approach is embarrassingly parallel which has a big impact on the

eventual hardware on which it is going to be run. From

[2] one can read:
*Additionally, we have also shown that if the size of the network is allowed to be large enough then for any initialization a global minimizer can be found from purely local descent, and thus local minima are all equivalent. This is a similar conclusion to the work of Choromanska et al. (2014), who analyzed the problem from a statistical standpoint and showed that with sufficiently large networks and appropriate assumptions about the distributions of the data and network weights, then with high probability any family of networks learned with different initializations will have similar objective values, but we note that our results allow for a well defined set of conditions which will be sufficient to guarantee the property. Finally, many modern large scale networks do not use traditional regularization on the network weigh parameters such as an l1 or l2 norms during training and instead rely on alternative forms of regularization such as dropout as it tends to achieve better performance in practice(Srivastava et al., 2014; Krizhevsky et al., 2012; Wan et al., 2013). Given our commentary above regarding the critical importance of balancing the degree of homogeneity between the mapping and the regularizer, an immediate prediction of our analysis is that simply ensuring that the degrees of homogeneity are balanced could be a significant factor in improving the performance of deep networks *

I can't wait to see how this discussion on mapping and regularizer will turn out. Without further ado, here are the abstracts:

[1]

The Loss Surfaces of Multilayer Networks
Anna Choromanska,

Mikael Henaff,

Michael Mathieu,

Gérard Ben Arous,

Yann LeCun
We study the connection between the highly non-convex loss function of a
simple model of the fully-connected feed-forward neural network and the
Hamiltonian of the spherical spin-glass model under the assumptions of: i)
variable independence, ii) redundancy in network parametrization, and iii)
uniformity. These assumptions enable us to explain the complexity of the fully
decoupled neural network through the prism of the results from random matrix
theory. We show that for large-size decoupled networks the lowest critical
values of the random loss function form a layered structure and they are
located in a well-defined band lower-bounded by the global minimum. The number
of local minima outside that band diminishes exponentially with the size of the
network. We empirically verify that the mathematical model exhibits similar
behavior as the computer simulations, despite the presence of high dependencies
in real networks. We conjecture that both simulated annealing and SGD converge
to the band of low critical points, and that all critical points found there
are local minima of high quality measured by the test error. This emphasizes a
major difference between large- and small-size networks where for the latter
poor quality local minima have non-zero probability of being recovered.
Finally, we prove that recovering the global minimum becomes harder as the
network size increases and that it is in practice irrelevant as global minimum
often leads to overfitting.

[2] Global Optimality in Tensor Factorization, Deep Learning, and Beyond
Benjamin D. Haeffele,

Rene Vidal
Techniques involving factorization are found in a wide range of applications
and have enjoyed significant empirical success in many fields. However, common
to a vast majority of these problems is the significant disadvantage that the
associated optimization problems are typically non-convex due to a multilinear
form or other convexity destroying transformation. Here we build on ideas from
convex relaxations of matrix factorizations and present a very general
framework which allows for the analysis of a wide range of non-convex
factorization problems - including matrix factorization, tensor factorization,
and deep neural network training formulations. We derive sufficient conditions
to guarantee that a local minimum of the non-convex optimization problem is a
global minimum and show that if the size of the factorized variables is large
enough then from any initialization it is possible to find a global minimizer
using a purely local descent algorithm. Our framework also provides a partial
theoretical justification for the increasingly common use of Rectified Linear
Units (ReLUs) in deep neural networks and offers guidance on deep network
architectures and regularization strategies to facilitate efficient
optimization.

[3] Generalization Bounds for Neural Networks through Tensor Factorization
Majid Janzamin,

Hanie Sedghi,

Anima Anandkumar
Training neural networks is a challenging non-convex optimization problem,
and backpropagation or gradient descent can get stuck in spurious local optima.
We propose a novel algorithm based on tensor decomposition for training a
two-layer neural network. We prove efficient generalization bounds for our
proposed method, with a polynomial sample complexity in the relevant
parameters, such as input dimension and number of neurons. While learning
arbitrary target functions is NP-hard, we provide transparent conditions on the
function and the input for generalizability. Our training method is based on
tensor decomposition, which provably converges to the global optimum, under a
set of mild non-degeneracy conditions. It consists of simple embarrassingly
parallel linear and multi-linear operations, and is competitive with standard
stochastic gradient descent (SGD), in terms of computational complexity. Thus,
for the first time, we have a computationally efficient method with guaranteed
generalization bounds for training neural networks.

** **
** **
**Join the CompressiveSensing subreddit or the Google+ Community or the Facebook page and post there !**

Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.