Friday, August 29, 2014

Universal Streaming


Given a stream of data, a typical approach in streaming algorithms is to design a sophisticated algorithm with small memory that computes a specific statistic over the streaming data. Usually, if one wants to compute a different statistic after the stream is gone, it is impossible. But what if we want to compute a different statistic after the fact? In this paper, we consider the following fascinating possibility: can we collect some small amount of specific data during the stream that is "universal," i.e., where we do not know anything about the statistics we will want to later compute, other than the guarantee that had we known the statistic ahead of time, it would have been possible to do so with small memory? In other words, is it possible to collect some data in small space during the stream, such that any other statistic that can be computed with comparable space can be computed after the fact? This is indeed what we introduce (and show) in this paper with matching upper and lower bounds: we show that it is possible to collect universal statistics of polylogarithmic size, and prove that these universal statistics allow us after the fact to compute all other statistics that are computable with similar amounts of memory. We show that this is indeed possible, both for the standard unbounded streaming model and the sliding window streaming model.

Join the CompressiveSensing subreddit or the Google+ Community and post there !
Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.

Thursday, August 28, 2014

Compressive Sampling of Polynomial Chaos Expansions: Convergence Analysis and Sampling Strategies

The sensing matrices in Uncertainty Quantification are very structured and so it is sometimes difficult to use solvers relying on gaussian measurement matrices and one wonders if solvers different from L1 could be contemplated in light of the recent SwAMP thing. Anyway, today's paper investigate in a deep fashion this subject. Let us note the use of phase transition as a comparison tool, great !



Sampling orthogonal polynomial bases via Monte Carlo is of interest for uncertainty quantification of models with high-dimensional random inputs, using Polynomial Chaos (PC) expansions. It is known that bounding a probabilistic parameter, referred to as {\it coherence}, yields a bound on the number of samples necessary to identify coefficients in a sparse PC expansion via solution to an ℓ1-minimization problem. Utilizing asymptotic results for orthogonal polynomials, we bound the coherence parameter for polynomials of Hermite and Legendre type under the respective natural sampling distribution. In both polynomial bases we identify an importance sampling distribution which yields a bound with weaker dependence on the order of the approximation. For more general orthonormal bases, we propose the {\it coherence-optimal} sampling: a Markov Chain Monte Carlo sampling, which directly uses the basis functions under consideration to achieve a statistical optimality among all sampling schemes with identical support. We demonstrate these different sampling strategies numerically in both high-order and high-dimensional, manufactured PC expansions. In addition, the quality of each sampling method is compared in the identification of solutions to two differential equations, one with a high-dimensional random input and the other with a high-order PC expansion. In both cases the coherence-optimal sampling scheme leads to similar or considerably improved accuracy.
Relevant previous entry:



Join the CompressiveSensing subreddit or the Google+ Community and post there !
Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.

Wednesday, August 27, 2014

Parallel Paths for Deep Learning and Signal Processing ?

This is a follow-up to this thread.

Deep learning in Machine Learning stands for neural networks with many layers for the purpose of producing features that can eventually be used for classification. In image/signal processing, the generic idea is to acquire and decompose a signal along a family of (generally) known signals (bases in known/identified spaces). Simple machine learning techniques can then rely on the elements of that decomposition (the features in ML) to achieve the last step of a classification task. Is there a convergence between the two approaches ?

As Stephane Mallat [1] pointed out in this panel, an FFT/IFFT is already deep, in fact a process that is decomposed through several matrix factorizations is also deep as it requires several iterations of different factorizations ( see Sparse Matrix Factorization: Simple rules for growing neural nets and Provable Bounds for Learning Some Deep Representations ) with each factored matrix representing a layer.

But if we explore recent developments in Compressive Sensing, we know that that most reconstruction solvers used to take a long time to converge and that any convergence was measured in hundreds if not thousands of iterations. If any iteration could be construed as a layer (as is the case in autoencoders) , a very deep network composed of a thousand layers would be clearly a non publishable offence.  Aside from the long "depth", some of these solvers rely on linear operations whereas current neural networks implicitely use nonlinear functionals.  

Recently, many things changed in Compressive Sensing with the appearance of Approximate Message Passing algorithms. They are theoretically sound and require only a few iterations (5 to 20) to obtain convergence. Each of these iterations can be expressed as a nonlinear functional of a matrix-vector multiply akin to each layer's computation in neural networks:




From: The SwAMP Thing! Sparse Estimation with the Swept Approximated Message-Passing Algorithm -implementation -

One could argue that the first layer in Compressive Sensing does not parallel that in Neural Networks. It turns out that a few people [6] are working on what is called  one bit sensing [5] which is very close in spirit to neural networks. 

In all, each of these approaches are building deep networks in their own way...using different vocabularies.





Join the CompressiveSensing subreddit or the Google+ Community and post there !
Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.

#itwist14 and CfP: Biomedical and Astronomical Signal Processing (BASP) Frontiers 2015

The twitter handle you currently want to follow is that of iTWIST ( international Traveling Workshop on Interactions between Sparse models and Technology ) that is taking place right now, it is:


Also Yves Wiaux sent me the following:

Hi Igor
Could I ask you to post this conference announcement on Nuit Blanche?
Thanks a lot in advance
Cheers
Yves
Here is the announcement:

***
Re: Biomedical and Astronomical Signal Processing (BASP) Frontiers 2015 at baspfrontiers.org

Dear Colleagues

Abstract submission is now open until September 26 for deluxe posters contributions by young and senior researchers: http://www.baspfrontiers.org/submission.php?page=Author

See also
Committed invited participants: http://www.baspfrontiers.org/participants.php

Best contribution awards: http://www.baspfrontiers.org/award.php

Do not hesitate to disseminate this announcement.

On behalf of the chairs

Regards

Yves
***
___________________________________
Dr Yves Wiaux, Assoc. Prof., BASP Director
Institute of Sensors, Signals & Systems
School of Engineering & Physical Sciences
Heriot-Watt University, Edinburgh


Join the CompressiveSensing subreddit or the Google+ Community and post there !
Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.

Yoshua Bengio's view on Deep Learning


Following this entryYoshua Bengio just wrote the following on his Google+ stream (I have a added a few links and reshaped the text for clearer reading):

There was an interesting debate on deep learning at Technion a couple of days ago: http://nuit-blanche.blogspot.fr/2014/08/is-deep-learning-final-frontier-and-end.html 
I wasn't there but I wished I had been to clarify a number of points. So I wrote this message to the members of the debate panel (sorry, it's pretty long):
-------- 
I just watched your recent debate on deep learning that took place at Technion. It was very interesting to hear you but made me wish very much that I had been there to inform the debate. So I would like to throw in a few answers to some of the questions and comments about some statements that were made. 
(1) Lack of theory and understanding of why deep learning works
I would be much more nuanced then what was said: there is already a lot of theory, but a lot more remains mysterious and needs to be worked out.  
1.1 Regarding depth, there are several theoretical results showing that deep circuits (including deep feedforward nets with rectifiers) can represent some functions exponentially (wrt depth) more efficiently than shallower ones (or ones of depth 2, which is the minimum to get universal approximation). The latest in this series (with references to previous ones) is this one: http://arxiv.org/abs/1402.1869 
1.2 Regarding the effect of learning a distributed representation, the same paper (following on many papers where I discuss this non-mathematically, including my 2009 book, and also following Montufar's similar result on RBMs,http://arxiv.org/pdf/1206.0387v4) shows that even a single layer distributed representation can represent an exponentially richer function (or distribution, for RBMs) than a "non-distributed" learner, e.g., kernel SVM, clustering, k-nearest-neighbors, decision trees, and other such non-parametric machine learning methods. This follows on previous negative results on such non-distributed local learners in my NIPS'2005 paper with Delalleau "The curse of highly variable functions for local kernel machines" and a related paper on the limitations of decision trees ("Decision trees do not generalize to new variations"). 
1.3 Insight behind the theory. The basic reason we get these potentially exponential gains is that we have compositionality of the parameters, i.e., the same parameters can be re-used in many contexts, so O(N) parameters can allow to distinguish O(2^N) regions in input space, whereas with nearest-neighbor-like things, you need O(N) parameters (i.e. O(N) examples) to characterize a function that can distinguish betwen O(N) regions.
Of course, this "gain" is only for some target functions, or more generally, we can think of it like a prior. If the prior is applicable to our target distribution, we can gain a lot. As usual in ML, there is no real free lunch. The good news is that this prior is very broad and seems to cover most of the things that humans learn about. What it basically says is that the data we observe are explained by a bunch of underlying factors, and that you can learn about each of these factors WITHOUT requiring to see all of the configurations of the other factors. This is how you are able to generalize to new configurations and get this exponential statistical gain. 
(2) Generative & unsupervised models. 
At the end there was a question about unsupervised and generative deep learning, and I would have really liked to be there to say that there is a lot of it, and in fact it is one of the main differences between the neural net research of the previous wave and the current wave, i.e., that we have made a lot of progress in designing unsupervised learning algorithms, including the generative type, and including deep ones. I am sure you have already heard about RBMs? and maybe about denoising auto-encoders? They are shallow but you can use them to pre-train deep generative models such as DBMs and DBNs (although these days we are moving in a territory where you don't need pre-training anymore). To see how powerful these are I invite you to look at some of the images generated in the papers of the last few years, e.g., http://www.icml-2011.org/papers/591_icmlpaper.pdf, or or pretty much all the papers from Ruslan Salakhutdinov or http://arxiv.org/pdf/1312.6114v10. Some that literature is reviewed in our 2013 TPAMI review paper
(3) How deep is deep enough?  
The above theoretical results say that the answer is data-dependent. For some tasks you may need a deeper (more non-linear, more complex) function. Also the actual "number" of layers is somewhat immaterial because it depends on the richness of the non-linearities that we put in each layer (i.e. you may simulate a depth-d net with a depth 2d depending on what operations each level is allowed to perform). However, with the usual suspects (neuron-like things and RBF-like things and gate-like things and sums and products), two levels give you universal approximation, so what we usually call "shallow" is depth 1 or 2. Depth 1 corresponds to linear systems, which are not even universal approximators, but are still very often used because they are so convenient. Regarding FFTs and such, they are indeed deep but that is not deep learning. Deep learning is when you learn multiple levels of representation, and the number of levels is something you can learn as well, so you have a kind of generic recipe throughout. 
(4) Successes of deep learning, beyond object recognition in images. 
Of course, besides object recognition in images, the big success has been in speech recognition. What most people don't know is that in the language modeling part of speech recognition, neural networks have been the SOTA for a number of years (starting with the work of Holger Schwenk), and neural nets (especially learning neural word embeddings, which I started with my NIPS'2000 paper) are quickly becoming a standard and a secret sauce in modern NLP. In particular, we are seeing this year a number of groups reaching and passing the SOTA in machine translation using these ideas (Google, BBN, Oxford, my lab, others). What is interesting is that we have moved beyond "object classification" into "structured output" tasks where the "output" is a high-dimensional object (e.g. a sentence, or a parse tree) which we are representing typically by its joint distribution. What is also interesting is that a lot of that work relies on UNSUPERVISED learning from pure text. See below for more on the supervised vs unsupervised divide. 
(5) Unsupervised learning is not bullshit and we can generalize with very few labeled examples thanks to transfer learning. Unsupervised learning is crucial to approach AI for a number of fundamental reasons, including the abundance of unlabeled data and the observed fact that representation learning (whether supervised or unsupervised) allows transfer learning, allowing to learn from VERY FEW LABELED EXAMPLES some new categories. Of course this is only possible if the learner has previously learned good representations from related categories, but with the AlexNet, it has clearly been shown that from 1000 object categories you can generalize to new categories with just a few examples. This has been demonstrated in many papers, starting with the two transfer learning competitions we won in 2011 (ICML 2011 and NIPS 2011), using UNSUPERVISED transfer learning. More recently, Socher showed that you can even get some decent generalization from ZERO examples simply because you know things from multiple modalities (e.g., that 'dog' and 'cat' are semantically similar in sentences, so that you can guess that something in an image could be a dog even if you have only seen images of cats). We had shown a similar result earlier (AAAI-2008, Larochelle et al) in the case of learning a new task for which you have some kind of representation (and these representations are learned across tasks).
So you can't use deep learning on a new field for which there is very little data if there is no relationship with what the learner has learned previously, but that is also true of humans. 
(6) Deep learning is not magic. 
I am horrified at how people react to deep learning as if 
  • (a) it was something magical, or 
  • (b) blame it for not being magical and solving every problem. 
Come on, let's not feed the blind hype and stay in the realm of science... which concentrates on the question of UNDERSTANDING. However, the kind of understanding I and others are seeking is not about the physical world directly but about the LEARNING mechanisms. That is very different. That is the thing on which I seek insight, and that is what my papers seek to provide. 
(7) About performance bounds: 
Because we are doing machine learning, you won't be able to achieve performance bounds without making assumptions on your data generating distribution. No magic. The BIG difference between deep learning and classical non-parametric statistical machine learning is that we go beyond the SMOOTHNESS assumption and add other priors such as
  • the existence of these underlying generative factors (= distributed representations)
  • assuming that they are organized hierarchically by composition (= depth)
  • assuming that they are causes of the observed data (allows semi-supervised learning to work)
  • assuming that different tasks share different subsets of factors (allows multi-task learning and transfer learning to tasks with very few labeled examples)
  • assuming that the top-level factors are related in simple ways to each other (makes it possible to stick a simple classifier on top of unsupervised learning and get decent results, for example)
See more of a discussion of that in my 2013 PAMI review paper on representation learning.
I am also posting this answer on Nuit Blanche (where I saw the video), google+ and facebook. The new book (called 'Deep Learning') that I am writing should help make these answers even clearer. 
Thanks again for helping me clarify things that need to be made clearer.
-- Yoshua




Image Credit: NASA/JPL-Caltech 
This image was taken by Navcam: Right B (NAV_RIGHT_B) onboard NASA's Mars rover Curiosity on Sol 729 (2014-08-25 05:32:20 UTC). 
Full Resolution

Join the CompressiveSensing subreddit or the Google+ Community and post there !
Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.

Tuesday, August 26, 2014

Advice to a Freshly Graduated Graduate Student

Yesterday, I asked Junjie where his code would eventually be available. Initially Junjie wanted me to post it on my google site since he did not have an internet presence. During our back and forth conversation, it became clear that letting others be the owner of one's intellectual work and accomplishments is not a good signal fresh graduates want to send to the rest of the world. 

If you are a graduate student who has put the work, your research production cannot be in the hand of others by:

  • having only a personal website located at your current university (next department head or IT support could switch off your visibility from the web in one single blink)
  • not sharing your preprint or published article because you think that signing off a copyright statement to a journal means that you have to hide your work production from your peers forever (check what you can do here: http://www.sherpa.ac.uk/romeo/ )
  • not sharing your code implementation: once the implementation is out, others will use it, you will become a giant, see Nobody Cares About You and Your Algorithm. At the very least, you will be able to go back to your graduate work easily six months from now.
  • not pitching your work on Nuit Blanche (provided it is relevant - please tell me how it is relevant -). I specifically try to avoid deeplinking to implementations so that people can land on the researcher's or his/her publication page.
Junjie just produced an independent web presence and has released his code for yesterday's implemention here:


I'll update yesterday's entry to link back to that page.

Monday, August 25, 2014

Turbo Compressed Sensing with Partial DFT Sensing Matrix - implementation -

Junjie just let me know of the following:



Dear Igor,
I'm a reader of your Nuit Blanche blog, and I benefit a lot from it.
Recently we wrote a paper called "Turbo compressed sensing with partial DFT sensing matrix", available at http://arxiv.org/abs/1408.3904​ .
In that paper, we developed a simple iterative signal recovery algorithm for partial DFT sensing matrix (or more generally sub-sampled row orthogonal matrix). As we know, the state evolution of AMP (approximate message passing) is developed for i.i.d matrix and may not be accurate for a partial DFT sensing matrix, especially when the sparsity level is high. In our paper, we developed a signal recovery algorithm and analyzed its state evolution. Our numerical experiments show that the state evolution matches well with simulation. What is exciting is that the state evolution of our approach is consistent with the theoretical prediction for partial DFT sensing matrix in the following paper:
A. Tulino, G. Caire, S. Verdu, and S. Shamai, “Support recovery with sparsely sampled free random matrices,” IEEE Trans. Inf. Theory, vol. 59, no. 7, pp. 4243–4271, July 2013.
I therefore view our method as an extension of AMP from i.i.d matrix to partial DFT matrix.

I hope you find our paper interesting. Any comment is welcome.
Best regards,

Junjie

In this letter, we propose a turbo compressed sensing algorithm with partial discrete Fourier transform (DFT) sensing matrices. Interestingly, the state evolution of the proposed algorithm is shown to be consistent with that derived using the replica method. Numerical results demonstrate that the proposed algorithm outperforms the well-known approximate message passing (AMP) algorithm when a partial DFT sensing matrix is involved.
The implementation is on Junjie's page:



Join the CompressiveSensing subreddit or the Google+ Community and post there !
Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.

Scaling up Deep Learning - Yoshua Bengio

Following up on yesterday's summary of the panel discussion asking the question: Is Deep Learning the Final Frontier and the End of Signal Processing ? here is yesterday's tutorial on Scaling up Deep Learning by Yoshua Bengio

Here are some of the recent preprints by Yoshua Bengio and his team. Let us note that the generative adversarial work, at the very least serves; the purpose of answering a question that was asked repeatedly in the panel discussion on the actual functional space that is well approximated by the networks.




We propose to exploit {\em reconstruction} as a layer-local training signal for deep learning. Reconstructions can be propagated in a form of target propagation playing a role similar to back-propagation but helping to reduce the reliance on derivatives in order to perform credit assignment across many levels of possibly strong non-linearities (which is difficult for back-propagation). A regularized auto-encoder tends produce a reconstruction that is a more likely version of its input, i.e., a small move in the direction of higher likelihood. By generalizing gradients, target propagation may also allow to train deep networks with discrete hidden units. If the auto-encoder takes both a representation of input and target (or of any side information) in input, then its reconstruction of input representation provides a target towards a representation that is more likely, conditioned on all the side information. A deep auto-encoder decoding path generalizes gradient propagation in a learned way that can could thus handle not just infinitesimal changes but larger, discrete changes, hopefully allowing credit assignment through a long chain of non-linear operations. In addition to each layer being a good auto-encoder, the encoder also learns to please the upper layers by transforming the data into a space where it is easier to model by them, flattening manifolds and disentangling factors. The motivations and theoretical justifications for this approach are laid down in this paper, along with conjectures that will have to be verified either mathematically or experimentally, including a hypothesis stating that such auto-encoder mediated target propagation could play in brains the role of credit assignment through many non-linear, noisy and discrete transformations.




Many state-of-the-art results obtained with deep networks are achieved with the largest models that could be trained, and if more computation power was available, we might be able to exploit much larger datasets in order to improve generalization ability. Whereas in learning algorithms such as decision trees the ratio of capacity (e.g., the number of parameters) to computation is very favorable (up to exponentially more parameters than computation), the ratio is essentially 1 for deep neural networks. Conditional computation has been proposed as a way to increase the capacity of a deep neural network without increasing the amount of computation required, by activating some parameters and computation "on-demand", on a per-example basis. In this note, we propose a novel parametrization of weight matrices in neural networks which has the potential to increase up to exponentially the ratio of the number of parameters to computation. The proposed approach is based on turning on some parameters (weight matrices) when specific bit patterns of hidden unit activations are obtained. In order to better control for the overfitting that might result, we propose a parametrization that is tree-structured, where each node of the tree corresponds to a prefix of a sequence of sign bits, or gating units, associated with hidden units.


In this paper, we propose a novel neural network model called RNN Encoder--Decoder that consists of two recurrent neural networks (RNN). One RNN encodes a sequence of symbols into a fixed-length vector representation, and the other decodes the representation into another sequence of symbols. The encoder and decoder of the proposed model are jointly trained to maximize the conditional probability of a target sequence given a source sequence. The performance of a statistical machine translation system is empirically found to improve by using the conditional probabilities of phrase pairs computed by the RNN Encoder--Decoder as an additional feature in the existing linear model. Qualitatively, we show that the proposed model learns a semantically and syntactically meaningful representation of linguistic phrases.


We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.



We propose a new framework for estimating generative models via an adversarial process, in which we simultaneously train two models: a generative model G that captures the data distribution, and a discriminative model D that estimates the probability that a sample came from the training data rather than G. The training procedure for G is to maximize the probability of D making a mistake. This framework corresponds to a minimax two-player game. In the space of arbitrary functions G and D, a unique solution exists, with G recovering the training data distribution and D equal to 1/2 everywhere. In the case where G and D are defined by multilayer perceptrons, the entire system can be trained with backpropagation. There is no need for any Markov chains or unrolled approximate inference networks during either training or generation of samples. Experiments demonstrate the potential of the framework through qualitative and quantitative evaluation of the generated samples.




Training deep directed graphical models with many hidden variables and performing inference remains a major challenge. Helmholtz machines and deep belief networks are such models, and the wake-sleep algorithm has been proposed to train them. The wake-sleep algorithm relies on training not just the directed generative model but also a conditional generative model (the inference network) that runs backward from visible to latent, estimating the posterior distribution of latent given visible. We propose a novel interpretation of the wake-sleep algorithm which suggests that better estimators of the gradient can be obtained by sampling latent variables multiple times from the inference network. This view is based on importance sampling as an estimator of the likelihood, with the approximate inference network as a proposal distribution. This interpretation is confirmed experimentally, showing that better likelihood can be achieved with this reweighted wake-sleep procedure, which also provides a natural way to estimate the likelihood itself. Based on this interpretation, we propose that a sigmoid belief network is not sufficiently powerful for the layers of the inference network, in order to recover a good estimator of the posterior distribution of latent variables. Our experiments show that using a more powerful layer model, such as NADE, yields substantially better generative models.



We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.




Join the CompressiveSensing subreddit or the Google+ Community and post there !
Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.

Sunday, August 24, 2014

Is Deep Learning the Final Frontier and the End of Signal Processing ?

[Update: In response to this panel discussion, Yoshua Bengio provided his view on Deep Learning ]

There was a panel discussion at the Technion in June asking the provocative question: "Is Deep Learning the Final Frontier and the End of Signal Processing"



The panel was made up of the following smarts: Guillermo Sapiro (Moderator), Raphy Coifman, Nati Srebo, Stephane Mallat, Peyman Milanfar and Amnon Shashua,

I have attempted to summarize some of what was said (please let me know if I stated something inaccurate). I have also added my understanding of some of these issues at the bottom of this summary.

Raphy Coifman talked about how randomization has been successful in Compressive Sensing and how we eventually devised ways to deconstruct it. We should be doing the same for deep learning.

Stephane Mallat stated that there was something deep about memory (about its growth). A long time ago, much of the emphasis was about logic not about memory. Systems with large memory changes that paradigm,  there is something deep conceptually underneath.

Peyman Milanfar, who is interested in performance bounds made the point that one need to understand about these deep networks better (how much data is needed, etc ). That currently there no deep insight was gained from deep learning examples: more data does not translate into better results...or insight.

Amnon Shashua, stated that CNNs are fascinating because it is like going back to the stone age (convolution, etc...) but that it now works. This is shocking. Someone without domain expertise can get good results. We just replaced the nonlinearity and the regularization but everything is the same as in the 1980s. 

Guillermo wondered about fields where image is generally not abundant (example of the costly brain image acquisition by NIH), does deep network allow for smart interpolation ?

Raphy asked about learning physical models from examples, can deep learning allow one to learn more than a specific situation.

Peyman: How do we go from shallow to deep learning ?

Stephane: The reason these networks are deep is because of the factorization of an operator into smaller ones just like FFT or FWT

Mihail: Is there an equivalent in generative methods for deep learning ? deep generative methods ?

Raphy: Instead of being generative, it is probably a good way to build a memory of what we know.

Stephane: it is not just a storage of what we know however we also don't know the functional space that is approximated by these deep networks.

Other questions asked by people in the room yielded the following remarks:

Ammon: the unsupervised work (autoencoder) is no longer what makes the deep network successful since 2011. 

Guillermo: Another even more successful algorithm is Random Forest. Both algorithms ( random forest and deep learning ) are vying for displacing the other (instead of aiming for signal processing ).

Closing remarks

Peyman: what are the performance bounds for small amount of data, one of the interesting question is "can deep network do with one example ?" 

Nati: It is probably transformative

Raphy: can it help with one exemplar learning ?

Stephane: beautiful example of Engineering being at the forefront of mathematics, beautiful opportunities ahead.


The discussion was a very insightful summary of the lingering questions. Here are some remarks:

With regards to understanding the black box that is deep network, let us mention that there is some ongoing work in that respect (see Sparse Matrix Factorization: Simple rules for growing neural nets and Provable Bounds for Learning Some Deep Representations ). 

With linear encoding systems used in compressive sensing, we have gotten a deeper understanding of them though sharp phase transitions. Can a similar path be taken for understanding deep networks (see Sunday Morning Insight: Sharp Phase Transitions in Machine Learning ?Sunday Morning Insight: Randomization is not a dirty word).

When Stephane Mallat mentioned that FFT was already a deep network, maybe we should view of a whole slew of family of algorithms as deeper architectures and yes, that include iterative solvers (  see Sunday Morning Insight: Faster Than a Blink of an Eye ). The generative model question made me think of what we recently called Physics Driven Sensor Design. Obviously Matrix factorization could be part of that mix (see Sunday Morning Insight: Matrix Factorizations and the Grammar of Life ) but I would not be surprised if Random Forests did not make progress in that area since they tend to provide some insight and some explaining power.

While the panel featured some of the most important people in the field of signal processing and machine learning, I really felt that a hardware person was missing. Signal processing has always been on the tail end of the hardware pipeline and the issue of too few data in some fields is directly related to the cost of hardware. However, because there is a somewhat continuous spectrum between Machine Learning and traditional signal processing (see A Quick Panorama of Sensing from Direct Imaging to Machine Learning ), I think it is becoming clearer that we are in fact entering the era of Zero Knowledge Sensor Design / Data Driven Sensor Design ( see the recent Learning to be a Depth Camera for Close-Range Human Capture and Interaction ) especially since memory is no longer an issue (Being a child of Moore's Law). Time will tell.

Sunday Morning Insight: Relax, no need to round: integrality of clustering formulations

Awesome ! As we all know clustering is at the heart of Machine Learning and especially in unsupervised learning setting. But given the diverse set of algorithms, it is generally difficult to figure out which one should be applied and when. The authors of the following paper look at different instances or relaxation of the k-means/k-median algorithm and eventually get a sense of their respective strength through a vizualisation of their respective phase transitions. The map makers story continues:



In this paper, we initiate the study of exact recovery conditions for convex relaxations of point cloud clustering problems. We focus on two of the most common optimization problems for unsupervised clustering: k-means and k-medians clustering. Motivations for focusing on convex relaxations are that (a) they come with a certificate of optimality, and (b) they are generic tools not tailored to the specific recovery-guarantee conditions. More precisely, consider the distributional setting where there are k clusters and data from each cluster consists of n points sampled from a symmetric distribution within a ball of unit radius of dimension m. We ask: what is the minimal separation distance between cluster centers needed for various convex relaxations to exactly recover these k clusters as its optimal integral solution? For the k-median linear programming relaxation we show a tight bound: exact recovery is obtained given arbitrarily small cluster separation Δ>2+ϵ, for any ϵ>0. Under the same distributional model, the k-means LP relaxation fails to recover such clusters at separation as large as 2+2√. Yet, if we enforce PSD constraints on the k-means LP, we get exact cluster recovery at separation as low as Δ>2+2k/m−−−−−√+ϵ. In contrast, common heuristics such as Lloyd's algorithm (a.k.a. the k-means algorithm) can fail to recover clusters in this setting, even just three clusters and arbitrarily large cluster separation distance. To complement the theoretical analysis, we provide an experimental study of the recovery guarantees for these various methods. Our work provides new insights into the power of these relaxations and we believe that this line of research will lead to a deeper understanding of such methods in the future.

I just created a new section in Advanced Matrix Factorization Jungle Page that features a phase transition for "K-Means / K-MedianA = DX  with unknown D and X, solve for XX^T = I and X_i = 0 or 1"

Relevant previous discussion on a related subject:



Join the CompressiveSensing subreddit or the Google+ Community and post there !
Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.