Tuesday, March 08, 2016

Exact Weighted Minwise Hashing in Constant Time

As I was featuring one of his work, Anshu sent me the following:
Dear Igor,

I am a regular follower of your blog.

I think the audience may be interested in two of our recent results. Take a look

1) Exact and Constant Time Weighted Minwise Hashing (Can be 60,000 times faster than Consistent Weighted Sampling) http://arxiv.org/abs/1602.08393


2) Training Deep Networks using LSH. Generally saves around 95% of computation and ideal for asynchronous training. http://arxiv.org/pdf/1602.08194v1.pdf

Thanks
Anshu
Thanks Anshu, we just featured the second work yesterday, here is the first:



Weighted minwise hashing (WMH) is one of the fundamental subroutine, required by many celebrated approximation algorithms, commonly adopted in industrial practice for large scale-search and learning. The resource bottleneck of the algorithms is the computation of multiple (typically a few hundreds to thousands) independent hashes of the data. The fastest hashing algorithm is by Ioffe \cite{Proc:Ioffe_ICDM10}, which requires one pass over the entire data vector, O(d) (d is the number of non-zeros), for computing one hash. However, the requirement of multiple hashes demands hundreds or thousands passes over the data. This is very costly for modern massive dataset.
In this work, we break this expensive barrier and show an expected constant amortized time algorithm which computes k independent and unbiased WMH in time O(k) instead of O(dk) required by Ioffe's method. Moreover, our proposal only needs a few bits (5 - 9 bits) of storage per hash value compared to around 64 bits required by the state-of-art-methodologies. Experimental evaluations, on real datasets, show that for computing 500 WMH, our proposal can be 60000x faster than the Ioffe's method without losing any accuracy. Our method is also around 100x faster than approximate heuristics capitalizing on the efficient "densified" one permutation hashing schemes \cite{Proc:OneHashLSH_ICML14}. Given the simplicity of our approach and its significant advantages, we hope that it will replace existing implementations in practice.
 Related ealier:
Improved Consistent Sampling, Weighted Minhash and L1 Sketching, Sergey Ioffe 

 Some of Anshu's earlier publications include:


  • Improved Asymmetric Locality Sensitive Hashing (ALSH) for Maximum Inner Product Search (MIPS). [pdf]
    Anshumali Shrivastava and Ping Li.
    Conference on Uncertainty in Artificial Intelligence (UAI) 2015.
  • Asymmetric Minwise Hashing for Indexing Binary Inner Products and Set Containment. [pdf][slides]
    Anshumali Shrivastava and Ping Li.
    International World Wide Web Conference (WWW) 2015.
  • Asymmetric LSH (ALSH) for Sublinear Time Maximum Inner Product Search (MIPS). [pdf][slides][video]
    Anshumali Shrivastava and Ping Li.
    Neural Information Processing Systems (NIPS) 2014.
    Best Paper Award.
  • A New Space for Comparing Graphs. [pdf] [slides]
    Anshumali Shrivastava and Ping Li.
    IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM) 2014.
    Best Paper Award.
  • Improved Densification of One Permutation Hashing. [pdf]
    Anshumali Shrivastava and Ping Li.
    Conference on Uncertainty in Artificial Intelligence (UAI) 2014.
  • In Defense of Minhash over Simhash. [pdf] [slides]
    Anshumali Shrivastava and Ping Li.
    International Conference on Artificial Intelligence and Statistics (AISTATS) 2014.
  • Densifying One Permutation Hashing via Rotation for Fast Near Neighbor Search. [pdf][slides][video]
    Anshumali Shrivastava and Ping Li.
    International Conference on Machine Learning (ICML) 2014.
  • Codings for Random Projections. [pdf]
    Ping Li, Michael Mitzenmacher and Anshumali Shrivastava .
    International Conference on Machine Learning (ICML) 2014.
  • Beyond Pairwise: Provably Fast Algorithms for Approximate k-Way Similarity Search. [pdf] [slides]
    Anshumali Shrivastava and Ping Li.
    Neural Information Processing Systems (NIPS) 2013.
  • Fast Near Neighbor Search in High-Dimensional Binary Data. [pdf] [slides]
    Anshumali Shrivastava and Ping Li.
    European Conference on Machine Learning (ECML) 2012.
    Top few papers invited for journal submission
  • Fast multi-task learning for query spelling correction. [pdf]
    Xu Sun, Anshumali Shrivastava and Ping Li.
    ACM International Conference on Information and Knowledge Management (CIKM) 2012.
  • GPU-based minwise hashing. [pdf]
    Ping Li, Anshumali Shrivastava and Christian Konig.
    International World Wide Web Conference (WWW)(Companion Volume) 2012.
  • Query spelling correction using multi-task learning. [pdf]
    Xu Sun, Anshumali Shrivastava and Ping Li.
    International World Wide Web Conference (WWW)(Companion Volume) 2012.
  • Hashing Algorithms for Large Scale Learning [pdf]
    Ping Li, Anshumali Shrivastava, Joshua Moore and Christian Konig.
    Neural Information Processing Systems (NIPS) 2011.

Join the CompressiveSensing subreddit or the Google+ Community or the Facebook page and post there !
Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.

No comments:

Printfriendly