EISA: an efficient information theoretical approach to value segmentation in large databases

Weiqing Wang, Shazia Sadiq, Xiaofang Zhou

Research output: Chapter in Book/Report/Conference proceedingConference PaperResearchpeer-review

Abstract

Value disparity is a widely known problem, that contributes to poor data quality results and raises many issues in data integration tasks. Value disparity, also known as column heterogeneity, occurs when the same entity is represented by disparate values, often within the same column in a database table. A first step in overcoming value disparity is to identify the distinct segments. This is a highly challenging task due to high number of features that define a particular segment as well as the need to undertake value comparisons which can be exponential in large databases. In this paper, we propose an efficient information theoretical approach to value segmentation, namely EISA. EISA not only reduces the number of the relevant features but also compresses the size of the values to be segmented. We have applied our method on three datasets with varying sizes. Our experimental evaluation of the method demonstrates a high level of accuracy with reasonable efficiency.

Original languageEnglish
Title of host publicationWeb Technologies and Applications
Subtitle of host publication16th Asia-PacificWeb Conference, APWeb 2014 Changsha, China, September 5-7, 2014 Proceedings
EditorsLei Chen, Yan Jia, Timos Sellis, Guanfeng Liu
Place of PublicationCham Switzerland
PublisherSpringer
Pages224-235
Number of pages12
ISBN (Electronic)9783319111162
ISBN (Print)9783319111155
DOIs
Publication statusPublished - 2014
Externally publishedYes
EventAsia Pacific Web Conference 2014 - Changsha, China
Duration: 5 Sep 20147 Sep 2014
Conference number: 16th
https://www.cse.ust.hk/apweb2014/

Publication series

NameLecture Notes in Computer Science
PublisherSpringer
Volume8709
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

ConferenceAsia Pacific Web Conference 2014
Abbreviated titleAPWeb 2014
CountryChina
CityChangsha
Period5/09/147/09/14
Internet address

Keywords

  • attribute
  • data profiling
  • data quality
  • information theory
  • large database
  • segmentation

Cite this

Wang, W., Sadiq, S., & Zhou, X. (2014). EISA: an efficient information theoretical approach to value segmentation in large databases. In L. Chen, Y. Jia, T. Sellis, & G. Liu (Eds.), Web Technologies and Applications: 16th Asia-PacificWeb Conference, APWeb 2014 Changsha, China, September 5-7, 2014 Proceedings (pp. 224-235). (Lecture Notes in Computer Science ; Vol. 8709 ). Springer. https://doi.org/10.1007/978-3-319-11116-2_20