Abstract
Locality sensitive hashing (LSH) is an efficient method for solving the problem of approximate similarity search in high-dimensional spaces. Through LSH, a high-dimensional similarity join can be processed in the same way as hash join, making the cost of joining two large datasets linear. By judicially analyzing the properties of multiple LSH algorithms, we propose a generic method to accelerate the process of joining two large datasets using LSH. The crux of our method lies in the way we identify a set of representative points to reduce the number of LSH lookups. Theoretical analyses show that our proposed method can greatly reduce the number of lookup operations and retain the same result accuracy compared to executing LSH lookups for every query point. Furthermore, we demonstrate the generality of our method by showing that the same principle can be applied to LSH algorithms for three different metrics: The Euclidean distance (QALSH), Jaccard similarity measure (MinHash), and Hamming distance (sequence hashing). Results from experimental studies using real datasets confirm our error analyses and show significant improvements of our method over the state-of-The-Art LSH method: To achieve over 0.95 recall, we only need to operate LSH lookups for at most 15% of the query points.
Original language | English |
---|---|
Title of host publication | Proceedings - 2017 IEEE 33rd International Conference on Data Engineering, ICDE 2017 |
Subtitle of host publication | 19-22 April 2017, San Diego, California, USA |
Editors | Yannis Papakonstantinou, Yanlei Diao |
Place of Publication | Piscataway NJ USA |
Publisher | IEEE, Institute of Electrical and Electronics Engineers |
Pages | 29-30 |
Number of pages | 2 |
ISBN (Electronic) | 9781509065431 |
DOIs | |
Publication status | Published - 2017 |
Externally published | Yes |
Event | IEEE International Conference on Data Engineering 2017 - Hilton San Diego Resort and Spa in Mission Bay, San Diego, United States of America Duration: 19 Apr 2017 → 22 Apr 2017 Conference number: 33rd http://icde2017.sdsc.edu/ (Conference website) https://ieeexplore.ieee.org/xpl/conhome/7929494/proceeding (Proceedings) |
Conference
Conference | IEEE International Conference on Data Engineering 2017 |
---|---|
Abbreviated title | ICDE 2017 |
Country/Territory | United States of America |
City | San Diego |
Period | 19/04/17 → 22/04/17 |
Internet address |
|