## Sunday, December 08, 2013

### Sunday Morning Insight: Sharp Phase Transitions in Machine Learning ?

In the Sunday Morning Insight on a Quick Panorama of Sensing from Direct Imaging to Machine Learning, I pointed out to the continuous spectrum of ideas that went from 'pure' sensing all the way to the current machine learning techniques. In fact some of the indirect imaging (compressive sensing) could fit well into machine learning approaches with various instances of nonlinear compressive sensing, in particular quantization and 1bit CS [7]. Now the question that immediately arises after this parallel is: are there sharp phase transitions in Machine Learning as those found in compressive sensing and advanced matrix factorization. In these fields, these sharp phase transitions have been found to be pretty universal and they really divide what is computable with a polynomail algorithm (P) and what is not currently known to have a polynomial algorithm (NP) ( see Sunday Morning Insight's (The Map Makers)). It turns out that any advances in compressive sensing can really only be gauged with how one can push these boundaries ( see "...This result is quite striking...." ). Do we see these phase transitions in machine learning algorithms ?

While reading some of the NIPS2013 papers, I just stumbled upon this preprint [0] (Knowledge Matters: Importance of Prior Information for Optimization by Çağlar GülçehreYoshua Bengio) that asks the question as to whether any of the current approaches in Machine Learning have this ability to do well on simple benchmarks. The idea being that unlike most datasets against which most machine learning algorithms are comparing themselves, eventually, there is a sense that a simpler 'dataset' might provide some insight. From the abstract , the challenge is

In our experiments, a two-tiered MLP architecture is trained on a dataset with 64x64 binary inputs images, each image with three sprites. The final task is to decide whether all the sprites are the same or one of them is different. Sprites are pentomino tetris shapes and they are placed in an image with different locations using scaling and rotation transformations.
This is indeed a simple benchmark that has the added property of being very sparse even for image processing tasks. We also all realize that we are comparing two beasts. In ML parlance, the CS approach has a linear layer and a nonlinear one, while a traditional neural network, for instance, has only nonlinear layers. From the paper, one can read the following interesting bits:
Interestingly, the task we used in our experiments is not only hard for deep neural networks but also for non-parametric machine learning algorithms such as SVM's, boosting and decision trees.
....the task in the first part is to recognize and locate each Pentomino object class6 in the image.  The second part/fi nal binary classi cation task is to fi gure out if all the Pentominos in the image are of the same shape class or not. If a neural network learned to detect the categories of each object at each location in an image, the remaining task becomes an XOR-like operation between the detected object categories....
....With 40,000 examples, this gives 32,000 examples for training and 8,000 examples for testing. For neural network algorithms, stochastic gradient descent (SGD) is used for training. The following standard learning algorithms were rst evaluated: decision trees, SVMs with Gaussian kernel, ordinary fully-connected Multi-Layer Perceptrons, Random Forests, k-Nearest Neighbors, Convolutional Neural Networks, and Stacked Denoising Auto-Encoders with supervised ne-tuning....
and from the conclusion

The task has the particularity that it is de ned by the composition of two non linear sub-tasks (object detection on one hand, and a non-linear logical operation similar to XOR on the other hand).
What is interesting is that in the case of the neural network, we can compare two networks with exactly the same architecture but a different pre-training, one of which uses the known intermediate concepts to teach an intermediate representation to the network. With enough capacity and training time they can overfit but did not not capture the essence of the task, as seen by test set performance.
We know that a structured deep network can learn the task, if it is initialized in the right place, and do it from very few training examples. Furthermore we have shown that if one pre-trains SMLP with hints for only one epoch, it can nail the task. But the exactly same architecture which started training from random initialization, failed to generalize.
But here is what caught my attention, figure 12.. and many others actually:

Most ML practitionners will recognize that past a certain number of training elements, the algorithm has learned the identity. But we ought to look at it from the point of view of what we know in linear systems, i.e compressive sensing and related matrix factorization techniques.

In essence in this training items are fed in order to build a model and it hits a region where the function being modeled reproduces well the identity. A similar phase transition [2,6] can be found in the equivalent problem found in blind compressive sensing [1,4,5] (some of this work is currently being presented at NIPS2013) or blind deconvolution where one feeds a linear model with sparse objects and finds the ability to reconstruct the whole linear operator after a certain number of 'training' elements.

From [1]

From [5]

There ought to be a similar phase transition for 1-bit sensing as witnessed in  [8] that has an even more direct connection to this simple benchmark in Machine Learning.
From  [8]

[0] Knowledge Matters: Importance of Prior Information for Optimization by Çağlar Gülçehre, Yoshua Bengio

We explore the effect of introducing prior information into the intermediate level of neural networks for a learning task on which all the state-of-the-art machine learning algorithms tested failed to learn. We motivate our work from the hypothesis that humans learn such intermediate concepts from other individuals via a form of supervision or guidance using a curriculum. The experiments we have conducted provide positive evidence in favor of this hypothesis. In our experiments, a two-tiered MLP architecture is trained on a dataset with 64x64 binary inputs images, each image with three sprites. The final task is to decide whether all the sprites are the same or one of them is different. Sprites are pentomino tetris shapes and they are placed in an image with different locations using scaling and rotation transformations. The first part of the two-tiered MLP is pre-trained with intermediate-level targets being the presence of sprites at each location, while the second part takes the output of the first part as input and predicts the final task's target binary event. The two-tiered MLP architecture, with a few tens of thousand examples, was able to learn the task perfectly, whereas all other algorithms (include unsupervised pre-training, but also traditional algorithms like SVMs, decision trees and boosting) all perform no better than chance. We hypothesize that the optimization difficulty involved when the intermediate pre-training is not performed is due to the {\em composition} of two highly non-linear tasks. Our findings are also consistent with hypotheses on cultural learning inspired by the observations of optimization problems with deep learning, presumably because of effective local minima.
The code to reproduce the figures in this paper can be found here:

## Saturday, December 07, 2013

### Saturday Morning Videos: Unifying Theory and Experiment for Large-Scale Networks

The Simons Institute for the Theory of Computing had a workshop last month on the Unifying Theory and Experiment for Large-Scale Networks. The videos are in (I am highlighting some videos in this post, for the others, click-through the title of the talk):

Monday, November 18th, 20138:30 am – 8:50 am
Coffee and Check-In

8:50 am – 9:00 am
Opening Remarks

9:00 am – 9:10 am
Overview: Graph Models
Michael Kearns, University of Pennsylvania
9:10 am – 9:35 am
Limiting Behavior in Large Networks
Patrick Wolfe, University College London

How do we simplify and understand a rigid combinatorial object (a finite, discrete graph) in terms of an infinite-dimensional limiting object (a function)? What types of limiting behaviors should we expect empirically? What types do we need mathematically, to ensure consistent inference (leading to algorithms with provable properties, and hence to defensible data-analytic conclusions)?

9:35 am – 10:00 am
Practice is Better than Theory: Using Approximate Counting to Mine Large Graphs
Sebastiano Vigna, University of Milan

Algorithms developed in the last decade to analyze large networks (centrality, neighbourhood functions, distance distributions) use approximate set representations. The ways in which these approximate set representation are used give no theoretical guarantee, yet the computations are extremely precise (when compared with a ground truth) and even outperform the theoretical precision of the set representations themselves. It would be interesting to understand why.

10:00 am – 10:30 am
Break

10:30 am – 10:55 am
Bayesian Models for Graph Data and the Open Problem of Invariance in Networks
Peter Orbanz, Columbia University
10:55 am – 11:20 am
Matt Salganik, Princeton University
11:20 am – 12:00 pm
Discussion

12:00 pm – 1:30 pm
Lunch Break

1:30 pm – 1:40 pm
Overview: Clustering
Jennifer Neville, Purdue University
1:40 pm – 2:05 pm
Asymptotic Analysis of Unlabelled Networks
Peter Bickel, UC Berkeley
2:05 pm – 2:30 pm
Finding Small Structures in Large Datasets
Andrea Montanari, Stanford University

The term 'relational dataset' refers generically to graphs, matrices, networks. I will review a number of examples coming from different communities (machine learning, statistics, computer science) in which it is desirable to find a small structure in such datasets. Some fundamental computational obstructions appear to emerge in each of these domains, and I will discuss their connections.

2:30 pm – 3:00 pm
Break

3:00 pm – 3:25 pm
Local Clustering and the Blessing of Transitivity
3:25 pm – 3:50 pm
Large-scale Graph Clustering and Message-passing-based Distributed Framework
3:50 pm – 4:30 pm
Discussion

Tuesday, November 19th, 20138:30 am – 9:00 am
Coffee and Check-In

9:00 am – 9:10 am
Overview: Network Formation
Ashish Goel, Stanford University
9:10 am – 9:35 am
Challenges of Modeling Network Formation in Tractable Ways
Arun Chandrasekhar, Stanford University
9:35 am – 10:00 am
Relating Developmental Transcription Factors (TFs) Based on Fruitfly Embryoic Images
Bin Yu, UC Berkeley
10:00 am – 10:30 am
Break

10:30 am – 10:55 am
Issues in Developing Estimable Models of Strategic Network Formation
Matt Jackson, Stanford University
10:55 am – 11:20 am
Behavioral Experiments in a Network Formation Game
Michael Kearns, University of Pennsylvania
11:20 am – 12:00 pm
Discussion

12:00 pm – 2:00 pm
Lunch Break

2:00 pm – 2:10 pm
Overview: Causality/Experiments
Matt Jackson, Stanford University
2:10 pm – 2:35 pm
Are Observational Studies of Social Contagion Doomed?
Cosma Shalizi, Carnegie Mellon University
2:35 pm – 3:00 pm
On Sampling in Dynamic Networks for Inference, Experiments, and Interventions
Steve Thompson, Simon Fraser University
3:00 pm – 3:30 pm
Break

3:30 pm – 3:55 pm
The Effect of Social Media on News Consumption
Markus Mobius, Microsoft Research New England
3:55 pm – 4:30 pm
Discussion

4:30 pm – 5:30 pm
Reception

Wednesday, November 20th, 20138:30 am – 9:00 am
Coffee and Check-In

9:00 am – 9:10 am
Overview: Label Prediction/Diffusion
Michael Kearns, University of Pennsylvania
9:10 am – 9:35 am
Semi-supervised Learning on Graphs, Using Observed Correlation Structure
Art Owen, Stanford University
9:35 am – 10:00 am
10:00 am – 10:30 am
Break

10:30 am – 10:55 am
The Impact of Network Structure on Relational Learning and Inference
Jennifer Neville, Purdue University
10:55 am – 11:20 am
The Emergence of Convention: An Experimental Study of Social Order
Damon Centola, Massachusetts Institute of Technology
11:20 am – 12:00 pm
Discussion

12:00 pm – 1:30 pm
Lunch Break

1:30 pm – 1:40 pm
Overview: Networks Over Time
Patrick Wolfe, University College London
1:40 pm – 2:05 pm
Estimating Time-Varying Networks
2:05 pm – 2:30 pm
Pre-Processing of Dynamic Networks: Its Impact on Analytics
2:30 pm – 3:00 pm
Break

3:00 pm – 3:25 pm
Graph Streams and Sketches: What Can and Can't Be Done
Andrew McGregor, University of Massachusetts Amherst

3:25 pm – 3:50 pm
The Trials and Tribulations of Tractably Counting Triangles
3:50 pm – 4:30 pm
Discussion

Thursday, November 21st, 20138:30 am – 9:00 am
Coffee and Check-In

9:00 am – 9:10 am
Overview: Scalability/Practical Challenges

I will summarize some architectures that are currently used in practice (distributed streaming, single large memory or flash machines, Hadoop, sharded key-value stores, etc). I will outline some graph algorithms that run well on these architectures, and describe some problems for which no good combination of algorithm and architecture is currently known.

9:35 am – 10:00 am
On Unifying Asymptotic Complexity with Real-world Performance in Matrix-based Network Computations
David Gleich, Purdue University

Many network analysis questions can be posed as solving a system of linear equations with a stochastic random-walk matrix, e.g. PageRank. Monte Carlo methods often have the best asymptotic complexity for these problems. Yet, those same methods are notoriously slow to find high precision answers. In this talk, we pose some questions on unifying asymptotic complexity with real-world performance.

10:00 am – 10:30 am
Break

10:30 am – 10:55 am
Evolving Graphs

We consider the following computational model on graphs that slowly change over time.  At each time step, the graph incurs a unit update and an algorithm has to periodically probe the graph to learn about the update and modify its solution accordingly.  We discuss algorithms in this model for basic graph questions including path connectivity, minimum spanning trees, and PageRank.

Classical Recommender Systems provide serving schemes that display items to users to optimize some user satisfaction metrics. But what happens when such users are connected to each other as in social media platforms like Facebook, LinkedIn and Twitter? Content in such a network is produced by nodes (users) and flows through edges in the graph. This gives rise to challenging and brand new issues when dealing with classical problems like recommending content to users. I will talk about such issues, based on my experience working with feed recommendation problems at LinkedIn.

11:20 am – 12:00 pm Discussion

Join the CompressiveSensing subreddit or the Google+ Community and post there !
Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.

## Thursday, December 05, 2013

### #NIPS2013 papers, workshops ...

So #NIPS2013 is starting today with a set of tutorials, and a set of workshops listed below. Two words first, if you are prude or at work don't go watch the photos on Twitter (on your desktop) for the #NIPS2013 hashtag just yet! Second, for those of you in Paris next week, we'll have our 6th ML meetup. Third Andrej Karpathy has a nicer way of viewing the NIPS proceedings. It is here.

Without further due:

Here are a few papers I found interesting but the whole electronic proceeding is here (the whole pdf is here):

Several posters from the workshops are listed below:

### CONTRIBUTED TALKS

• Mu Li, Li Zhou, Zichao Yang, Aaron Li, Fei Xia, David Andersen and Alexander Smola.
Parameter Server for Distributed Machine Learning
We propose a parameter server framework to solve distributed machine learning problems. Both data and workload are distributed into client nodes, while server nodes maintain globally shared parameters, which are represented as sparse vectors and matrices. The framework manages asynchronous data communications between clients and servers. Flexible consistency models, elastic scalability and fault tolerance are supported by this framework. We present algorithms and theoretical analysis for challenging nonconvex and nonsmooth problems. To demonstrate the scalability of the proposed framework, we show experimental results on real data with billions of parameters.
PDF
• Yarin Gal and Zoubin Ghahramani.
Pitfalls in the use of Parallel Inference for the Dirichlet Process
Recent work done by Lovell, Adams, and Mansingka [2012] and Williamson, Dubey, and Xing [2013] has suggested an alternative parametrisation for the Dirichlet process in order to derive non-approximate parallel MCMC inference for it. This approach to parallelisation has been picked-up and implemented in several different fields [Chahuneau et al., 2013, Pan et al., 2013]. In this paper we show that the approach suggested is impractical due to an extremely unbalanced distribution of the data. We characterise the requirements of efficient parallel inference for the Dirichlet process and show that the proposed inference fails most of these conditions (while approximate approaches often satisfy most of them). We present both theoretical and experimental evidence of this, analysing the load balance for the inference showing that it is independent of the size of the dataset and the number of nodes available in the parallel implementation, and end with preliminary suggestions of alternative paths of research for efficient non-approximate parallel inference for the Dirichlet process.
PDF
• Yingyu Liang, Maria-Florina Balcan and Vandana Kanchanapally.
Distributed PCA and k-Means Clustering
This paper proposes a distributed PCA algorithm, with the theoretical guarantee that any good approximation solution on the projected data for k-means clustering is also a good approximation on the original data, while the projected dimension required is independent of the original dimension. When combined with the distributed coreset-based clustering approach in [3], this leads to an algorithm in which the number of vectors communicated is independent of the size and the dimension of the original data. Our experiment results demonstrate the effectiveness of the algorithm.
PDF

### POSTERS

• Julien-Charles Lévesque, Christian Gagné and Robert Sabourin.
Ensembles of Budgeted Kernel Support Vector Machines for Parallel Large Scale Learning
In this work, we propose to combine multiple budgeted kernel support vector machines (SVMs) trained with stochastic gradient descent (SGD) in order to exploit large databases and parallel computing resources. The variance induced by budget restrictions of the kernel SVMs is reduced through the averaging of predictions, resulting in greater generalization performance. The variance of the trainings results in a diversity of predictions, which can help explain the better performance. Finally, the proposed method is intrinsically parallel, which means that parallel computing resources can be exploited in a straightforward manner.
PDF
• Zhen Qin, Vaclav Petricek, Nikos Karampatziakis, Lihong Li and John Langford.
Efficient Online Bootstrapping for Large Scale Learning
Bootstrapping is a useful technique for estimating the uncertainty of a predictor, for example, confidence intervals for prediction. It is typically used on small to moderate sized datasets, due to its high computation cost. This work describes a highly scalable online bootstrapping strategy, implemented inside Vowpal Wabbit, that is several times faster than traditional strategies. Our experiments indicate that, in addition to providing a black box-like method for estimating uncertainty, our implementation of online bootstrapping may also help to train models with better prediction performance due to model averaging.
PDF
• Arun Kumar, Nikos Karampatziakis, Paul Mineiro, Markus Weimer and Vijay Narayanan.
Distributed and Scalable PCA in the Cloud
Principal Component Analysis (CA) is a popular technique with many applications. Recent randomized PCA algorithms scale to large datasets but face a bottleneck when the number of features is also large. We propose to mitigate this issue using a composition of structured and unstructured randomness within a randomized PCA algorithm. Initial experiments using a large graph dataset from Twitter show promising results. We demonstrate the scalability of our algorithm by implementing it both on Hadoop, and a more flexible platform named REEF.
PDF
• Nedim Lipka.
Towards Distributed Reinforcement Learning for Digital Marketing with Spark
A variety of problems in digital marketing can be modeled as Markov decision processes and solved by dynamic programming with the goal of calculating the policy that maximizes the expected discounted reward. Algorithms, such as policy iteration, require a state transition and a reward model, which can be estimated based on a given data set. In this paper, we compare the execution times for estimating the transition function in a map-reduce fashion if the data set becomes large in terms of the number of records and features. Therefore, we create different-sized Spark and Hadoop clusters in the Amazon cloud computing environment. The in-memory clustering system Spark is outperforming Hadoop and runs up to 71% faster. Furthermore, we study the execution times of policy iteration running on Spark clusters and show the execution time reduction gained by increasing the number of instances in the cluster.
PDF
• Tuukka Ruotsalo, Jaakko Peltonen, Manuel J. A. Eugster, Dorota Glowacka, Giulio Jacucci, Aki Reijonen and Samuel Kaski.
Lost in Publications? How to Find Your Way in 50 Million Scientific Documents
Researchers must navigate big data. Current scientific knowledge includes 50 million published articles. How can a system help a researcher find relevant documents in her field? We introduce IntentRadar, an interactive search user interface and search engine that anticipates userâ™s search intents by estimating them form userâ™s interaction with the interface. The estimated intents are visualized on a radial layout that organizes potential intents as directions in the information space. The intent radar assists users to direct their search by allowing feedback to be targeted on keywords that represent the potential intents. Users can provide feedback by manipulating the position of the keywords on the radar. The system then learns and visualizes improved estimates and corresponding documents. IntentRadar has been shown to significantly improve usersâ™ task performance and the quality of retrieved information without compromising task execution time.
PDF
• Michael Kane and Bryan Lewis.
cnidaria: A Generative Communication Approach to Scalable, Distributed Learning
This paper presents a scalable, software framework that facilitates large-scale learning and numerical computing. Unlike existing MapReduce frameworks our design is not limited to embarrassingly parallel computing challenges. The framework sits on top of existing storage infrastructures and results of a computation may left out on the cluster (a reduce step is not required). Unlike existing distributed numerical frameworks the proposed framework is elastic and works with both dense and sparse data representations. This generality is achieved through a generative communication scheme whose expressions are either consumed by the distributed computing environment or used to move data, in a peer-to-peer (P2P) fashion, between nodes in a cluster/cloud. This approach integrates advances in the both cloud computing and the distributed numerical computing community and can be applied to a general class of learning challenges.
PDF
• Anshumali Shrivastava and Ping Li.
Beyond Pairwise: Provably Fast Algorithms for Approximate k-Way Similarity Search
We go beyond the notion of pairwise similarity and look into search problems with k-way similarity functions. In this paper, we focus on problems related to 3-way Jaccard similarity. We show that approximate R3way similarity search problems admit fast algorithms with provable guarantees, analogous to the pairwise case. Our analysis and speedup guarantees naturally extend to k-way resemblance. In the process, we extend traditional framework of locality sensitive hashing (LSH) to handle higher-order similarities, which could be of independent theoretical interest. The applicability of R3way search is shown on the Google Sets application as well as in an application for improving retrieval quality.
PDF
• Wei Dai, Jinliang Wei, Xun Zheng, Jin Kyu Kim, Seunghak Lee, Junming Yin, Qirong Ho and Eric Xing.
Petuum: A System for Iterative-Convergent Distributed ML
A major bottleneck to applying advanced ML programs at industrial scales is the migration of an academic implementation, often specialized for a small, wellcontrolled computer platform such as desktop PCs and small lab-clusters, to a big, less predicable platform such as a corporate cluster or the cloud. This poses enormous challenges: how does one train huge models with billions of parameters on massive data, especially when substantial expertise is required to handle many low-level systems issues? We propose a new architecture of systems components that systematically addresses these challenges, thus providing a generalpurpose distributed platform for Big Machine Learning. Our architecture specifically exploits the fact that many ML programs are fundamentally loss function minimization problems, and that their iterative-convergent nature presents many unique opportunities to minimize loss, such as via dynamic variable scheduling and error-bounded consistency models for synchronization. Thus, we treat data, parameter and variable blocks as computing units to be dynamically scheduled and updated in an error-bounded manner, with the goal of minimizing the loss function as quickly as possible.
PDF
• Haiqin Yang, Junjie Hu, Michael Lyu and Irwin King.
Online Imbalanced Learning with Kernels
Imbalanced learning, or learning from imbalanced data, is a challenging problem in both academy and industry. Nowadays, the streaming imbalanced data become popular and trigger the volume, velocity, and variety issues of learning from these data. To tackle these issues, online learning algorithms are proposed to learn a linear classifier via maximizing the AUC score. However, the developed linear classifiers ignore the learning power of kernels. In this paper, we therefore propose online imbalanced learning with kernels (OILK) to exploit the non-linearity and heterogeneity embedded in the imbalanced data. Different from previously proposed work, we optimize the AUC score to learn a non-linear representation via the kernel trick. To relieve the computational and storing cost, we also investigate different buffer update policies, including first-in-first-out (FIFO) and reservoir sampling (RS), to maintain a fixed budgeted buffer on the number of support vectors. We demonstrate the properties of our proposed OILK through detailed experiments.
PDF
• Alex Beutel, Abhimanu Kumar, Evangelos Papalexakis, Partha Pratim Talukdar, Christos Faloutsos and Eric Xing.
FLEXIFACT: Scalable Flexible Factorization of Coupled Tensors on Hadoop
Given multiple data sets of relational data that share a number of dimensions, how can we efficiently decompose our data into the latent factors? Factorization of a single matrix or tensor has attracted much attention, as, e.g., in the Netflix challenge, with users rating movies. However, we often have additional, side, information, like, e.g., demographic data about the users, in the Netflix example above. Incorporating the additional information leads to the coupled factorization problem. So far, it has been solved for relatively small datasets. We provide a distributed, scalable method for decomposing matrices, tensors, and coupled data sets through stochastic gradient descent on a variety of objective functions. We offer the following contributions: (1) Versatility: Our algorithm can perform matrix, tensor, and coupled factorization, with flexible objective functions including the Frobenius norm, Frobenius norm with an l1 induced sparsity, and non-negative factorization. (2) Scalability: FLEXIFACT scales to unprecedented sizes in both the data and model, with up to billions of parameters. FLEXIFACT runs on standard Hadoop. (3) Convergence proofs showing that FLEXIFACT converges on the variety of objective functions, even with projections.
PDF
• Faraz Makari Manshadi and Rainer Gemulla.
A Distributed Approximation Algorithm for Mixed Packing-Covering Linear Programs
Mixed packing-covering linear programs capture a simple but expressive subclass of linear programs. They commonly arise as linear programming relaxations of a number important combinatorial problems, including various network design and generalized matching problems. In this paper, we propose an efficient distributed approximation algorithm for solving mixed packing-covering problems which requires a poly-logarithmic number of passes over the input. Our algorithm is well-suited for parallel processing on GPUs, in shared-memory architectures, or on small clusters of commodity nodes. We report results of a case study for generalized bipartite matching problems.
PDF
• Artem Sokolov and Stefan Riezler.
Task-driven Greedy Learning of Feature Hashing Functions
Randomly hashing multiple features into one aggregated feature is routinely used in largescale machine learning tasks to both increase speed and decrease memory requirements, with little or no sacrifice in performance. In this paper we investigate whether using a learned (instead of a random) hashing function improves performance. We show experimentally that with increasing difference between the dimensionalities of the input space and the hashed space, learning hashes is increasingly useful compared to random hashing.
PDF
• Ahmed Elgohary, Ahmed Farahat, Mohamed Kamel and Fakhri Karray.
Approximate Nearest Centroid Embedding for Kernel $k$-Means
This paper proposes an efficient embedding method for scaling kernel k-means on cloud infrastructures. The embedding method allows for approximating the computation of the nearest centroid to each data instance and, accordingly, it eliminates the quadratic space and time complexities of the cluster assignment step in the kernel k-means algorithm. We show that the proposed embedding method is effective under memory and computing power constraints, and that it achieves better clustering performance compared to other approximations of the kernel kmeans algorithm.
PDF
• Yisheng Liao, Alex Rubinsteyn, Russell Power and Jinyang Li.
Learning Random Forests on the GPU
Random Forests are a popular and powerful machine learning technique, with several fast multi-core CPU implementations. Since many other machine learning methods have seen impressive speedups from GPU implementations, applying GPU acceleration to random forests seems like a natural fit. Previous attempts to use GPUs have relied on coarse-grained task parallelism and have yielded inconclusive or unsatisfying results. We introduce CudaTree, a GPU Random Forest implementation which adaptively switches between data and task parallelism. We show that, for larger datasets, this algorithm is faster than highly tuned multi-core CPU implementations.
PDF
• Shravan Narayanamurthy, Markus Weimer, Dhruv Mahajan, Tyson Condie, Sundararajan Sellamanickam and S. Sathiya Keerthi.
Towards Resource-Elastic Machine Learning

PDF
• Ignacio Arnaldo, Kalyan Veeramachaneni and Una-May O'Reilly.
Building Multiclass Nonlinear Classifiers with GPUs
The adoption of multiclass classification strategies that train independent binary classifiers becomes challenging when the goal is to retrieve nonlinear models from large datasets and the process requires several passes through the data. In such scenario, the combined use of a search and score algorithm and GPUs allows to obtain binary classifiers in a reduced time. We demonstrate our approach by training a ten class classifier over more than 400K exemplars following the exhaustive Error Correcting Output Code strategy that decomposes into 511 binary problems.
PDF
• John Canny and Huasha Zhao.
BIDMach: Large-scale Learning with Zero Memory Allocation
This paper describes recent work on the BIDMach toolkit for large-scale machine learning. BIDMach has demonstrated single-node performance that exceeds that of published cluster systems for many common machine-learning task. BIDMach makes full use of both CPU and GPU acceleration (through a sister library BIDMat), and requires only modest hardware (commodity GPUs). One of the challenges of reaching this level of performance is the allocation barrier. While it is simple and expedient to allocate and recycle matrix (or graph) objects in expressions, this approach is too slow to match the arithmetic throughput possible on either GPUs or CPUs. In this paper we describe a caching approach that allows code with complex matrix (graph) expressions to run at massive scale, i.e. multi-terabyte data, with zero memory allocation after initial start-up. We present a number of new benchmarks that leverage this approach.
PDF
• Shohei Hido, Satoshi Oda and Seiya Tokui.
Jubatus: An Open Source Platform for Distributed Online Machine Learning
Distributed computing is essential for handling very large datasets. Online learning is also promising for learning from rapid data streams. However, it is still an unresolved problem how to combine them for scalable learning and prediction on big data streams. We propose a general computational framework called loose model sharing for online and distributed machine learning. The key is to share only models rather than data between distributed servers. We also introduce Jubatus, an open source software platform based on the framework. Finally, we describe the details of implementing classifier and nearest neighbor algorithms, and discuss our experimental evaluations.
PDF

## Accepted Papers

Poster presentations

## Accepted Papers

• Elham Azizi
• Stephen H. Bach, Bert Huang and Lise Getoor
• Brian Baingana, Gonzalo Mateos and Georgios B. Giannakis
• Evolving Groups of Political Interest in Social News
Roja Bandari, Hazhir Rahmandad and Vwani Roychowdhury
• Eric Bax, James Li, Abdullah Sonmez and Zehra Cataltepe
• William M. Campbell, Elisabeth Baseman and Kara Greenfield
• Bayesian Nonparametric Random Graphs
François Caron and Emily B. Fox
• Ahmed Hefny, Geoffrey Gordon and Katia Sycara
• Tue Herlau, Mikkel N. Schmidt and Morten Morup
• Giuseppe Jurman, Roberto Visintainer, Michele Filosi, Samantha Riccadonna and Cesare Furlanello
• Elias Khalil, Bistra Dilkina and Le Song
• Sequential Monte Carlo Inference of MMSB for Dynamic Social Networks
Tomoki Kobayashi and Koji Eguchi
• Variational Bayesian Inference Algorithms for Network Infinite Relational Model
Takuya Konishi, Takatomi Kubo, Kazuho Watanabe and Kazushi Ikeda
• James Robert Lloyd, Peter Orbanz, Zoubin Ghahramani and Daniel M. Roy
• Kien Duy Nguyen, Tuan Pham Minh, Quang Nhat Nguyen and Thanh Trung Nguyen
• Amirali Salehi-Abari and Craig Boutilier
• Aaron Schein, Juston Moore and Hanna Wallach
• Anshumali Shrivastava and Ping Li
• Serafeim Tsironis, Mauro Sozio and Michalis Vazirgiannis
• James D. Wilson, Shankar Bhamidi and Andrew B. Nobel
• Justin J. Yang and Edoardo M. Airoldi
• Justin J. Yang, Christina Q. Han and Edoardo M. Airoldi
• Yuan Zhang, Elizaveta Levina and Ji Zhu
Linear Bandits, Matrix Completion, and Recommendation Systems [pdf]
Efficient coordinate-descent for orthogonal matrices through Givens rotations [pdf][supplementary]
Improved Greedy Algorithms for Sparse Approximation of a Matrix in terms of Another Matrix [pdf]
Preconditioned Krylov solvers for kernel regression [pdf]
Probabilistic Low-Rank Matrix Completion with Adaptive Spectral Regularization Algorithms [pdf][supplementary]
Dimension Independent Matrix Square using MapReduce [pdf]

• Active Learning of Intuitive Sound Qualities (Huang, Duvenaud, Arnold, Partridge, and Oberholtzer) [pdf]
There is often a mismatch between the high-level goals an artist wants to express and what the parameters of a synthesizer allow them to control. To enable composers to directly adjust personalized high-level qualities during sound synthesis, our system actively learns functions that map from the space of synthesizer control parameters to perceived levels of high-level qualities.
• Automatic Construction and Natural-Language Summarization of Additive Nonparametric Models (Lloyd, Duvenaud, Grosse, Tenenbaum, and Ghahramani) [pdf][supplement1][supplement2]
To complement recently introduced automatic model-construction and search methods, we demonstrate an automatic model-summarization procedure. After building an additive nonparametric regression model, our method constructs a report which visualizes and explains in words the meaning and relevance of each component. These reports enable human model-checking and the understanding of complex modeling assumptions and structure. We demonstrate this procedure on two time-series, showing that the automatically constructed models identify clearly interpretable structures that can be automatically described in simple natural language.
• Designing Constructive Machine Learning Models based on Generalied Linear Learning Techniques (Kordjamshidi and Moens) [pdf]
We propose a general framework for designing machine learning models that deal with constructing complex structures in the output space. The goal is to provide an abstraction layer to easily represent and design constructive learning models. The learning approach is based on generalized linear training techniques, and exploits techniques from combinatorial optimization to deal with the complexity of the underlying inference required in this type of models. This approach also allows to consider global structural characteristics and constraints over the output elements in an efficient training and prediction setting. The use case focuses on building spatial meaning representations from text to instantiate a virtual world.
• Learning Graphical Concepts (Ellis, Dechter, Adams, and Tenenbaum) [pdf]
How can machine learning techniques be used to solve problems whose solutions are best represented as computer programs? For example, suppose a researcher wants to design a probabilistic graphical model for a novel domain. Searching the space of probabilistic models automatically is notoriously difficult, especially difficult when latent variables are involved. However, researchers seem able to easily adapt commonly used modeling motifs to new domains. In doing so, they draw on abstractions such as trees, chains, grids and plates to constrain and direct the kinds of models they produce. This suggests that before we ask machine learning algorithms to discover parsimonious models of new domains, we should develop techniques that enable our algorithms to automatically learn these ?graphical concepts? in much the same way that researchers themselves do, by seeing examples in the literature. One natural way to think of these graphical concepts is as programs that take sets of random variables and produce graphical models that relate them. In this work, we describe the CEC algorithm, which attempts to learn a distribution over programs by incrementally finding program components that commonly help to solve problems in a given domain, and we show preliminary results indicating that CEC is able to discover the graphical concepts that underlie many of the common graphical model structures.
• The Constructive Learning Problem: An Efficient Approach for Hypergraphs (Costa and Sorescu) [pdf]
Discriminative systems that can deal with input graphs are known, however, generative/constructive approaches that can output (hyper)graphs belonging with high probability to a desired class, are less studied. Here we propose an approach that, differently from common graph grammars inference systems, is computationally efficient and robust to the presence of outliers in the training sample. We report experimental results in a de-novo molecular synthesis problem. We show that we can construct compounds that, once added to the original training set can improve the performance of a binary classification predictor.
• Analyzing Probabilistic Models Generated by EDAs for Simplified Protein Folding Problems (Santana, Mendiburu, and Lozano) [pdf]
Estimation of distribution algorithms (EDAs) are optimization methods that construct at each step a probabilistic graphical model (PGM) of the best evaluated solutions. The model serves as a concise representation of the regularities shared by the good solutions and can serve to unveil structural characteristics of the problem domain. In this paper we use the PGMs learned by EDAs in the optimization of 15, 575 instances of the hydrophobic-polar (HP) functional protein folding model to analyze the relationship between the information contained in the PGMs? structures and the quality of the EDA?s solutions.
• Anticipating the Future By Constructing Human Activities using Object Affordances (Koppula and Saxena) [pdf]
An important aspect of human perception is anticipation and anticipating which activities will a human do next (and how to do them) in useful for many applications, for example, anticipation enables an assistive robot to plan ahead for reactive responses in the human environments. In this work, we present a constructive approach for generating various possible future human activities by reasoning about the rich spatial-temporal relations through object affordances. We represent each possible future using an anticipatory temporal conditional random field (ATCRF) where we sample the nodes and edges corresponding to future object trajectories and human poses from a generative model. We then represent the distribution over the potential futures using a set of constructed ATCRF particles. In extensive evaluation on CAD-120 human activity RGB-D dataset, for new subjects (not seen in the training set), we obtain an activity anticipation accuracy (defined as whether one of top three predictions actually happened) of 75.4%, 69.2% and 58.1% for an anticipation time of 1, 3 and 10 seconds respectively. 1
• Learning Global-to-Local Discrete Components with Nonparametric Bayesian Feature Construction (Heo, Lee, and Zhang) [pdf]
Finding common latent components from data is an important step in many data mining applications. These latent variables are typically categorical and there are many sources of categorical variables, including dichotomous, nominal, ordinal, and cardinal values. Thus it is important to be able to represent the discrete components (categories) in a flexible way. Here we propose a nonparametric Bayesian approach to learning "plastic" discrete components by considering the uncertainty of the number of components with the Indian buffet processes (IBP). As observation models, we use the product of experts (PoE) to utilize sharper representation power and sparse over-completeness. We apply the proposed method to optical hand-written digit datasets and demonstrate its capability of finding flexible global-to-local components that can be used to describe and generate the observed digit images faithfully.
• Racing Tracks Improvisation (Wang and Missura) [pdf][supplement]
Procedural content generation is a popular technique in the game development. One of its typical applications is generation of game levels. This paper presents a method to generate tracks for racing games, by viewing racing track generation as a discrete sequence prediction problem. To solve it we combine two techniques from music improvisation. We show that this method is capable of generating new racing tracks which appear to be interesting enough.
• STONES: Stochastic Technique for Generating Songs (Kamp and Manea) [pdf]
We propose a novel approach for automatically constructing new songs from a set of given compositions that involves sampling a melody line as well as the corresponding harmonies given by chords. The song is sampled from a hierarchical Markov model that captures the implicit properties of good composed songs from a set of existing ones. We empirically show that songs generated by our approach are closer to music composed by humans than those of existing methods.
• Constructing Cocktails from a Cocktail Map (Paurat, Garnett, and Gärtner) [pdf]
Consider a dataset that describes cocktails by the amount of ingredients used and a lower dimensional embedding of it that can be considered a map of cocktails. The problem we tackle is to query an arbitrary point of interest in this lower dimensional embedding and retrieve a newly constructed cocktail which embeds to that queried location. To do so, we formulate the task as a constrained optimization problem and consider the resulting ingredient mix as a 'hot' candidate. Starting off with a very basic formulation that merely demands the necessities of our problem to be fulfilled, we incorporate additional desired conditions into the problem formulation and compare the resulting cocktail recipes.
• Supervised graph summarization for structuring academic search results (Mirylenka and Passerini) [pdf]
In this paper we address the problem of visualizing the query results of the academic search services. We suggest representing the search results as concise topic hierarchies, and propose a method of building such hierarchies through summarization of the intermediate large topic graphs. We describe a supervised learning technique for summarizing the topic graphs in the most informative way using sequential structured prediction, and discuss our ongoing work on the interactive acquisition of the training examples.
• Hybrid SRL with Optimization Modulo Theories (Teso, Sebastiani, and Passerini) [pdf]
Generally speaking, the goal of constructive learning could be seen as, given an example set of structured objects, to generate novel objects with similar properties. From a statistical-relational learning (SRL) viewpoint, the task can be interpreted as a constraint satisfaction problem, i.e. the generated objects must obey a set of soft constraints, whose weights are estimated from the data. Traditional SRL approaches rely on (finite) First-Order Logic (FOL) as a description language, and on MAX-SAT solvers to perform inference. Alas, FOL is unsuited for constructive problems where the objects contain a mixture of Boolean and numerical variables. It is in fact difficult to implement, e.g. linear arithmetic constraints within the language of FOL. In this paper we propose a novel class of hybrid SRL methods that rely on Satisfiability Modulo Theories, an alternative class of formal languages that allow to describe, and reason over, mixed Boolean-numerical objects and constraints. The resulting methods, which we call Learning Modulo Theories, are formulated within the structured output SVM framework, and employ a weighted SMT solver as an optimization oracle to perform efficient inference and discriminative max margin weight learning. We also present a few examples of constructive learning applications enabled by our method.
1. Varun Aggarwal, Shashank Srikant, and Vinay Shashidhar
Principles for using Machine Learning in the Assessment of Open Response Items: Programming Assessment as a Case Study
2. Sumit Basu, Chuck Jacobs and Lucy Vanderwende
3. Franck Dernoncourt, Choung Do, Sherif Halawa, Una-May O’Reilly, Colin Taylor, Kalyan Veeramachaneni and Sherwin Wu
MOOCVIZ: A Large Scale, Open Access,Collaborative, Data Analytics Platform for MOOCs
4. Jorge Diez, Oscar Luaces, Amparo Alonso-Betanzos, Alicia Troncoso and Antonio Bahamonde
Peer Assessment in MOOCs Using Preference Learning via Matrix Factorization
5. Stephen E. Fancsali
Data-driven causal modeling of “gaming the system” and off-task behavior in Cognitive Tutor Algebra
6. Damien Follet
A three-steps classification algorithm to assist criteria grid assessment
7. Peter W. Foltz and Mark Rosenstein
Tracking Student Learning in a State-Wide Implementation of Automated Writing Scoring
8. Jose P. Gonzalez-Brenes, Yun Huang and Peter Brusilovsky
FAST: Feature-Aware Student Knowledge Tracing
9. Fang Han, Kalyan Veeramachaneni and Una-May O’Reilly
Analyzing student behavior during problem solving in MOOCs
10. Mohammad Khajah, Rowan M. Wing, Robert V. Lindsey and Michael C. Mozer
Incorporating Latent Factors Into Knowledge Tracing To Predict Individual Differences In Learning
11. Robert V. Lindsey, Jeff D. Shroyer, Harold Pashler and Michael C. Mozer
Improving students’ long-term knowledge retention through personalized review
12. Yun-En Liu, Travis Mandel, Zoran Popovic and Emma Brunskill
Towards Automatic Experimentation of Educational Knowledge
13. Andras Lorincz, Gyongyver Molnar, Laszlo A. Jeni, Zoltan Toser, Attila Rausch and Jeffrey F. Cohn
Towards entertaining and efficient educational games
14. Travis Mandel, Yun-En Liu, Zoran Popovic, Sergey Levin and Emma Brunskill
Unbiased Offline Evaluation of Policy Representations for Educational Games
15. Sergiy Nesterko, Svetlana Dotsenko, Qiuyi Hu, Daniel Seaton, Justin Reich, Isaac Chuang, and Andrew Ho
Evaluating Geographic Data in MOOCs
16. Andy Nguyen, Christopher Piech, Jonathan Huang and Leonidas Guibas
Codewebs: Scalable Code Search for MOOCs
17. Zachary A. Pardos
Simulation study of a HMM based automatic resource recommendation system
18. Arti Ramesh, Dan Goldwasser, Bert Huang, Snigdha Chaturvedi, Hal Daume III and Lise Getoor
Modeling Learner Engagement in MOOCs using Probabilistic Soft Logic
19. Nihar B. Shah, Joseph K. Bradley, Abhay Parekh, Martin Wainwright and Kannan Ramchandran
A Case for Ordinal Peer Evaluation in MOOCs
20. Adish Singla, Ilija Bogunovic, Gabor Bartok, Amin Karbasi and Andreas Krause
On Actively Teaching the Crowd to Classify
21. Glenda S. Stump, Jennifer DeBoer, Jonathan Whittinghill and Lori Breslow
Development of a Framework to Classify MOOC Discussion Forum Posts: Methodology and Challenges
22. Weiyi Sun, Siwei Lyu, Hui Jin and Jianwei Zhang
Analyzing Online Learning Discourse using Probabilistic Topic Models
23. Joseph Jay Williams
Applying Cognitive Science to Online Learning
24. Joseph Jay Williams and Betsy Williams
Using Interventions to Improve Online Learning
25. Diyii Yang, Tanmay Sinha, David Adamson and Carolyn Penstein Rose
“Turn on, Tune in, Drop out”: Anticipating student dropouts in Massive Open Online Courses

### Poster Session I

Yuxin Chen, Hiroaki Shioi, Cesar Antonio Fuentes Montesinos, Lian Pin Koh, Serge Wich, Andreas Krause.
Active Detection for Biodiversity Monitoring via Adaptive Submodularity.

Christopher R. Dance, Stephane Clinchant, Onno R. Zoeter.
Approximate Inference for a Non-Homogeneous Poisson Model of On-Street Parking. [pdf]

George Mathews, John Vial, Sanjeev Jha, Gregoire Mariethoz, Nickens Okello, Suhinthan Maheswararajah, Dom De Re, Michael Smith.
Bayesian Inference of the Hydraulic Properties of Deep Geological Formations.

Simon O’Callaghan, Alistair Reid, Lachlan McCalman, Edwin V. Bonilla, Fabio Ramos
Bayesian Joint Inversions for the Exploration and Characterization of Geothermal Targets. [pdf]

Jun Yu, Weng-Keen Wong, Steve Kelling.
Clustering Species Accumulation Curves to Identify Groups of Citizen Scientists with Similar Skill Levels. [pdf]

Kalyan Veeramachaneni, Teasha Feldman-Fitzthum, Una-May O’Reilly, Alfredo Cuesta-Infante.
Copula-Based Wind Resource Assessment. [pdf]

Danny Panknin, Tammo Krueger, Mikio Braun, Klaus-Robert Muller, Siegmund Duell.
Detecting changes in Wind Turbine Sensory Data. [pdf]

Shan Xue, Alan Fern, Daniel Sheldon.
Dynamic Resource Allocation for Optimizing Population Diffusion.

Nidhi Singh.
Green-Aware Workload Prediction for Non-stationary Environments.

Mingjun Zhong, Nigel Goddard, Charles Sutton.
Interleaved Factorial Non-Homogeneous Hidden Markov Models for Energy Disaggregation. [pdf]

### Poster Session II

Tao Sun, Daniel Sheldon, Akshat Kumar.
Message Passing for Collective Graphical Models. [pdf]

Jun Yu, Rebecca A. Hutchinson, Weng-Keen Wong.
Modeling Misidentification of Bird Species by Citizen Scientists. [pdf]

Anna Ogawa, Akiko Takeda, Toru Namerikawa.
Photovoltaic Output Prediction Using Auto-regression with Support Vector Machine. [pdf]

Rebecca A. Hutchinson, Thomas G. Dietterich.
Posterior Regularization for Occupancy Models.

Xiaojian Wu , Daniel Sheldon, Shlomo Zilberstein.
Stochastic Network Design for River Networks. [pdf]

Daniel Urieli, Peter Stone.

Bingsheng Wang, Haili Dong, Chang-Tien Lu.
Using Step Variant Convolutional Neural Networks for Energy Disaggregation. [pdf]

Angela Fernandez, Carlos M. Alaiz, Ana M. Gonzalez, Julia Diaz, Jose R. Dorronsoro
Local Anisotropic Diffusion Detection of Wind Ramps. [pdf]

Climate Prediction via Matrix Completion. [pdf]

• 8:20-8:40 Factorie
• 8:40-9:00 pySPACE
• 9:00 - 9:30 Coffee break
• 9:30 - 10:30 Demos (15 min for highlights)

#### AFTERNOON SESSION (3:30-6:30)

Domain Adaptation as Learning with Auxiliary Information
Shai Ben-David, Ruth Urner

Sample Complexity of Sequential Multi-task Reinforcement Learning
Emma Brunskill, Lihong Li

Sequential Transfer in Multi-armed Bandit with Logarithmic Transfer Regret
Mohammad Gheshlaghi Azar, Alessandro Lazaric, Emma Brunskill

Class-wise Density-ratios for Covariate Shift
Yun-Qian Miao, Ahmed K. Farahat, Mohamed S. Kamel

Domain adaptation for sequence labeling using hidden Markov models

Edouard Grave, Guillaume Obozinski,  Francis Bach

Retrieval of Experiments: Sequential Dirichlet Process Mixtures in Model Space
Ritabrata Dutta, Sohan Seth, Samuel Kaski

Meenakshi Mishra, Jun Huan

Restricted Transfer Learning for Text Categorization
Rajhans Samdani, Gideon Mann

Transform-based Domain Adaptation for Big Data
Erik Rodner, Judy Hoffman, Trevor Darrell, Jeff Donahue, Kate Saenko

A PAC-Bayesian bound for Lifelong Learning
Anastasia Pentina, Christoph H. Lampert

Jiaolong Xu, Sebastian Ramos, Xu Hu, David Vazquez, Antonio M. Lopez

Tree-Based Ensemble Multi-Task Learning Method for Classification and Regression
Jaak Simm, Ildefons Magrans de Abril, Masashi Sugiyama

Emilie Morvant

Multilinear Spectral Regularization for Kernel-based Multitask Learning
Marco Signoretto, Johan A.K. Suykens
Reinforcement Learning with Multi-Fidelity Simulators

Sameer Singh, Sebastian Riedel, and Andrew McCallum. Anytime belief propagation using sparse domains.

W00085459.jpg was taken on December 02, 2013 and received on Earth December 04, 2013. The camera was pointing toward SATURN at approximately 710,353 miles (1,143,202 kilometers) away, and the image was taken using the MT2 and CL2 filters. This image has not been validated or calibrated.

Image Credit: NASA/JPL/Space Science Institute

Join the CompressiveSensing subreddit or the Google+ Community and post there !
Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.