A generic method for accelerating LSH-based similarity join processing (extended abstract)

Chenyun Yu, Sarana Nutanong, Hangyu Li, Cong Wang, Xingliang Yuan

Research output: Chapter in Book/Report/Conference proceedingConference PaperOtherpeer-review

Abstract

Locality sensitive hashing (LSH) is an efficient method for solving the problem of approximate similarity search in high-dimensional spaces. Through LSH, a high-dimensional similarity join can be processed in the same way as hash join, making the cost of joining two large datasets linear. By judicially analyzing the properties of multiple LSH algorithms, we propose a generic method to accelerate the process of joining two large datasets using LSH. The crux of our method lies in the way we identify a set of representative points to reduce the number of LSH lookups. Theoretical analyses show that our proposed method can greatly reduce the number of lookup operations and retain the same result accuracy compared to executing LSH lookups for every query point. Furthermore, we demonstrate the generality of our method by showing that the same principle can be applied to LSH algorithms for three different metrics: The Euclidean distance (QALSH), Jaccard similarity measure (MinHash), and Hamming distance (sequence hashing). Results from experimental studies using real datasets confirm our error analyses and show significant improvements of our method over the state-of-The-Art LSH method: To achieve over 0.95 recall, we only need to operate LSH lookups for at most 15% of the query points.

LanguageEnglish
Title of host publicationProceedings - 2017 IEEE 33rd International Conference on Data Engineering, ICDE 2017
Subtitle of host publication19-22 April 2017, San Diego, California, USA
EditorsYannis Papakonstantinou, Yanlei Diao
Place of PublicationPiscataway NJ USA
PublisherIEEE, Institute of Electrical and Electronics Engineers
Pages29-30
Number of pages2
ISBN (Electronic)9781509065431
DOIs
Publication statusPublished - 2017
Externally publishedYes
EventIEEE International Conference on Data Engineering 2017 - Hilton San Diego Resort and Spa in Mission Bay, San Diego, United States
Duration: 19 Apr 201722 Apr 2017
Conference number: 33rd
http://icde2017.sdsc.edu/ (Conference website)

Conference

ConferenceIEEE International Conference on Data Engineering 2017
Abbreviated titleICDE 2017
CountryUnited States
CitySan Diego
Period19/04/1722/04/17
Internet address

Cite this

Yu, C., Nutanong, S., Li, H., Wang, C., & Yuan, X. (2017). A generic method for accelerating LSH-based similarity join processing (extended abstract). In Y. Papakonstantinou, & Y. Diao (Eds.), Proceedings - 2017 IEEE 33rd International Conference on Data Engineering, ICDE 2017: 19-22 April 2017, San Diego, California, USA (pp. 29-30). [7929917] Piscataway NJ USA: IEEE, Institute of Electrical and Electronics Engineers. https://doi.org/10.1109/ICDE.2017.21
Yu, Chenyun ; Nutanong, Sarana ; Li, Hangyu ; Wang, Cong ; Yuan, Xingliang. / A generic method for accelerating LSH-based similarity join processing (extended abstract). Proceedings - 2017 IEEE 33rd International Conference on Data Engineering, ICDE 2017: 19-22 April 2017, San Diego, California, USA. editor / Yannis Papakonstantinou ; Yanlei Diao. Piscataway NJ USA : IEEE, Institute of Electrical and Electronics Engineers, 2017. pp. 29-30
@inproceedings{af1a909d6b594c098cbf4dd23d98d48f,
title = "A generic method for accelerating LSH-based similarity join processing (extended abstract)",
abstract = "Locality sensitive hashing (LSH) is an efficient method for solving the problem of approximate similarity search in high-dimensional spaces. Through LSH, a high-dimensional similarity join can be processed in the same way as hash join, making the cost of joining two large datasets linear. By judicially analyzing the properties of multiple LSH algorithms, we propose a generic method to accelerate the process of joining two large datasets using LSH. The crux of our method lies in the way we identify a set of representative points to reduce the number of LSH lookups. Theoretical analyses show that our proposed method can greatly reduce the number of lookup operations and retain the same result accuracy compared to executing LSH lookups for every query point. Furthermore, we demonstrate the generality of our method by showing that the same principle can be applied to LSH algorithms for three different metrics: The Euclidean distance (QALSH), Jaccard similarity measure (MinHash), and Hamming distance (sequence hashing). Results from experimental studies using real datasets confirm our error analyses and show significant improvements of our method over the state-of-The-Art LSH method: To achieve over 0.95 recall, we only need to operate LSH lookups for at most 15{\%} of the query points.",
author = "Chenyun Yu and Sarana Nutanong and Hangyu Li and Cong Wang and Xingliang Yuan",
year = "2017",
doi = "10.1109/ICDE.2017.21",
language = "English",
pages = "29--30",
editor = "Papakonstantinou, {Yannis } and Diao, {Yanlei }",
booktitle = "Proceedings - 2017 IEEE 33rd International Conference on Data Engineering, ICDE 2017",
publisher = "IEEE, Institute of Electrical and Electronics Engineers",
address = "United States",

}

Yu, C, Nutanong, S, Li, H, Wang, C & Yuan, X 2017, A generic method for accelerating LSH-based similarity join processing (extended abstract). in Y Papakonstantinou & Y Diao (eds), Proceedings - 2017 IEEE 33rd International Conference on Data Engineering, ICDE 2017: 19-22 April 2017, San Diego, California, USA., 7929917, IEEE, Institute of Electrical and Electronics Engineers, Piscataway NJ USA, pp. 29-30, IEEE International Conference on Data Engineering 2017, San Diego, United States, 19/04/17. https://doi.org/10.1109/ICDE.2017.21

A generic method for accelerating LSH-based similarity join processing (extended abstract). / Yu, Chenyun; Nutanong, Sarana; Li, Hangyu; Wang, Cong; Yuan, Xingliang.

Proceedings - 2017 IEEE 33rd International Conference on Data Engineering, ICDE 2017: 19-22 April 2017, San Diego, California, USA. ed. / Yannis Papakonstantinou; Yanlei Diao. Piscataway NJ USA : IEEE, Institute of Electrical and Electronics Engineers, 2017. p. 29-30 7929917.

Research output: Chapter in Book/Report/Conference proceedingConference PaperOtherpeer-review

TY - GEN

T1 - A generic method for accelerating LSH-based similarity join processing (extended abstract)

AU - Yu, Chenyun

AU - Nutanong, Sarana

AU - Li, Hangyu

AU - Wang, Cong

AU - Yuan, Xingliang

PY - 2017

Y1 - 2017

N2 - Locality sensitive hashing (LSH) is an efficient method for solving the problem of approximate similarity search in high-dimensional spaces. Through LSH, a high-dimensional similarity join can be processed in the same way as hash join, making the cost of joining two large datasets linear. By judicially analyzing the properties of multiple LSH algorithms, we propose a generic method to accelerate the process of joining two large datasets using LSH. The crux of our method lies in the way we identify a set of representative points to reduce the number of LSH lookups. Theoretical analyses show that our proposed method can greatly reduce the number of lookup operations and retain the same result accuracy compared to executing LSH lookups for every query point. Furthermore, we demonstrate the generality of our method by showing that the same principle can be applied to LSH algorithms for three different metrics: The Euclidean distance (QALSH), Jaccard similarity measure (MinHash), and Hamming distance (sequence hashing). Results from experimental studies using real datasets confirm our error analyses and show significant improvements of our method over the state-of-The-Art LSH method: To achieve over 0.95 recall, we only need to operate LSH lookups for at most 15% of the query points.

AB - Locality sensitive hashing (LSH) is an efficient method for solving the problem of approximate similarity search in high-dimensional spaces. Through LSH, a high-dimensional similarity join can be processed in the same way as hash join, making the cost of joining two large datasets linear. By judicially analyzing the properties of multiple LSH algorithms, we propose a generic method to accelerate the process of joining two large datasets using LSH. The crux of our method lies in the way we identify a set of representative points to reduce the number of LSH lookups. Theoretical analyses show that our proposed method can greatly reduce the number of lookup operations and retain the same result accuracy compared to executing LSH lookups for every query point. Furthermore, we demonstrate the generality of our method by showing that the same principle can be applied to LSH algorithms for three different metrics: The Euclidean distance (QALSH), Jaccard similarity measure (MinHash), and Hamming distance (sequence hashing). Results from experimental studies using real datasets confirm our error analyses and show significant improvements of our method over the state-of-The-Art LSH method: To achieve over 0.95 recall, we only need to operate LSH lookups for at most 15% of the query points.

UR - http://www.scopus.com/inward/record.url?scp=85021222852&partnerID=8YFLogxK

U2 - 10.1109/ICDE.2017.21

DO - 10.1109/ICDE.2017.21

M3 - Conference Paper

SP - 29

EP - 30

BT - Proceedings - 2017 IEEE 33rd International Conference on Data Engineering, ICDE 2017

A2 - Papakonstantinou, Yannis

A2 - Diao, Yanlei

PB - IEEE, Institute of Electrical and Electronics Engineers

CY - Piscataway NJ USA

ER -

Yu C, Nutanong S, Li H, Wang C, Yuan X. A generic method for accelerating LSH-based similarity join processing (extended abstract). In Papakonstantinou Y, Diao Y, editors, Proceedings - 2017 IEEE 33rd International Conference on Data Engineering, ICDE 2017: 19-22 April 2017, San Diego, California, USA. Piscataway NJ USA: IEEE, Institute of Electrical and Electronics Engineers. 2017. p. 29-30. 7929917 https://doi.org/10.1109/ICDE.2017.21