A generic method for accelerating LSH-based similarity join processing

Chenyun Yu, Sarana Nutanong, Hangyu Li, Cong Wang, Xingliang Yuan

Research output: Contribution to journalArticleResearchpeer-review

Abstract

Locality sensitive hashing (LSH) is an efficient method for solving the problem of approximate similarity search in high-dimensional spaces. Through LSH, a high-dimensional similarity join can be processed in the same way as hash join, making the cost of joining two large datasets linear. By judicially analyzing the properties of multiple LSH algorithms, we propose a generic method to speed up the process of joining two large datasets using LSH. The crux of our method lies in the way which we identify a set of representative points to reduce the number of LSH lookups. Theoretical analyzes show that our proposed method can greatly reduce the number of lookup operations and retain the same result accuracy compared to executing LSH lookups for every query point. Furthermore, we demonstrate the generality of our method by showing that the same principle can be applied to LSH algorithms for three different metrics: the Euclidean distance (QALSH), Jaccard similarity measure (MinHash), and Hamming distance (sequence hashing). Results from experimental studies using real datasets confirm our error analyzes and show significant improvements of our method over the state-of-the-art LSH method: to achieve over 0.95 recall, we only need to operate LSH lookups for at most 15 percent of the query points.

LanguageEnglish
Article number7782356
Pages712-726
Number of pages15
JournalIEEE Transactions on Knowledge and Data Engineering
Volume29
Issue number4
DOIs
Publication statusPublished - 1 Apr 2017
Externally publishedYes

Keywords

  • locality sensitive hashing
  • query processing
  • representative selection
  • Similarity join

Cite this

Yu, Chenyun ; Nutanong, Sarana ; Li, Hangyu ; Wang, Cong ; Yuan, Xingliang. / A generic method for accelerating LSH-based similarity join processing. In: IEEE Transactions on Knowledge and Data Engineering. 2017 ; Vol. 29, No. 4. pp. 712-726.
@article{cfb9276097384792b2f8bbf3530d360e,
title = "A generic method for accelerating LSH-based similarity join processing",
abstract = "Locality sensitive hashing (LSH) is an efficient method for solving the problem of approximate similarity search in high-dimensional spaces. Through LSH, a high-dimensional similarity join can be processed in the same way as hash join, making the cost of joining two large datasets linear. By judicially analyzing the properties of multiple LSH algorithms, we propose a generic method to speed up the process of joining two large datasets using LSH. The crux of our method lies in the way which we identify a set of representative points to reduce the number of LSH lookups. Theoretical analyzes show that our proposed method can greatly reduce the number of lookup operations and retain the same result accuracy compared to executing LSH lookups for every query point. Furthermore, we demonstrate the generality of our method by showing that the same principle can be applied to LSH algorithms for three different metrics: the Euclidean distance (QALSH), Jaccard similarity measure (MinHash), and Hamming distance (sequence hashing). Results from experimental studies using real datasets confirm our error analyzes and show significant improvements of our method over the state-of-the-art LSH method: to achieve over 0.95 recall, we only need to operate LSH lookups for at most 15 percent of the query points.",
keywords = "locality sensitive hashing, query processing, representative selection, Similarity join",
author = "Chenyun Yu and Sarana Nutanong and Hangyu Li and Cong Wang and Xingliang Yuan",
year = "2017",
month = "4",
day = "1",
doi = "10.1109/TKDE.2016.2638838",
language = "English",
volume = "29",
pages = "712--726",
journal = "IEEE Transactions on Knowledge and Data Engineering",
issn = "1041-4347",
publisher = "IEEE, Institute of Electrical and Electronics Engineers",
number = "4",

}

A generic method for accelerating LSH-based similarity join processing. / Yu, Chenyun; Nutanong, Sarana; Li, Hangyu; Wang, Cong; Yuan, Xingliang.

In: IEEE Transactions on Knowledge and Data Engineering, Vol. 29, No. 4, 7782356, 01.04.2017, p. 712-726.

Research output: Contribution to journalArticleResearchpeer-review

TY - JOUR

T1 - A generic method for accelerating LSH-based similarity join processing

AU - Yu, Chenyun

AU - Nutanong, Sarana

AU - Li, Hangyu

AU - Wang, Cong

AU - Yuan, Xingliang

PY - 2017/4/1

Y1 - 2017/4/1

N2 - Locality sensitive hashing (LSH) is an efficient method for solving the problem of approximate similarity search in high-dimensional spaces. Through LSH, a high-dimensional similarity join can be processed in the same way as hash join, making the cost of joining two large datasets linear. By judicially analyzing the properties of multiple LSH algorithms, we propose a generic method to speed up the process of joining two large datasets using LSH. The crux of our method lies in the way which we identify a set of representative points to reduce the number of LSH lookups. Theoretical analyzes show that our proposed method can greatly reduce the number of lookup operations and retain the same result accuracy compared to executing LSH lookups for every query point. Furthermore, we demonstrate the generality of our method by showing that the same principle can be applied to LSH algorithms for three different metrics: the Euclidean distance (QALSH), Jaccard similarity measure (MinHash), and Hamming distance (sequence hashing). Results from experimental studies using real datasets confirm our error analyzes and show significant improvements of our method over the state-of-the-art LSH method: to achieve over 0.95 recall, we only need to operate LSH lookups for at most 15 percent of the query points.

AB - Locality sensitive hashing (LSH) is an efficient method for solving the problem of approximate similarity search in high-dimensional spaces. Through LSH, a high-dimensional similarity join can be processed in the same way as hash join, making the cost of joining two large datasets linear. By judicially analyzing the properties of multiple LSH algorithms, we propose a generic method to speed up the process of joining two large datasets using LSH. The crux of our method lies in the way which we identify a set of representative points to reduce the number of LSH lookups. Theoretical analyzes show that our proposed method can greatly reduce the number of lookup operations and retain the same result accuracy compared to executing LSH lookups for every query point. Furthermore, we demonstrate the generality of our method by showing that the same principle can be applied to LSH algorithms for three different metrics: the Euclidean distance (QALSH), Jaccard similarity measure (MinHash), and Hamming distance (sequence hashing). Results from experimental studies using real datasets confirm our error analyzes and show significant improvements of our method over the state-of-the-art LSH method: to achieve over 0.95 recall, we only need to operate LSH lookups for at most 15 percent of the query points.

KW - locality sensitive hashing

KW - query processing

KW - representative selection

KW - Similarity join

UR - http://www.scopus.com/inward/record.url?scp=85015767457&partnerID=8YFLogxK

U2 - 10.1109/TKDE.2016.2638838

DO - 10.1109/TKDE.2016.2638838

M3 - Article

VL - 29

SP - 712

EP - 726

JO - IEEE Transactions on Knowledge and Data Engineering

T2 - IEEE Transactions on Knowledge and Data Engineering

JF - IEEE Transactions on Knowledge and Data Engineering

SN - 1041-4347

IS - 4

M1 - 7782356

ER -