Abstract
Entity matching (EM) is the process of linking records from different data sources. While extensive research has been done in various aspects of EM, many of these studies generally assume EM tasks as schema-specific, which attempt to match record pairs at attributes level. Unfortunately, in the real-world, tables that undergo EM may not have an aligned schema, and often, the schema or metadata of the table and attributes are not known beforehand.In view of this challenge, this paper presents an effective approach for schema-agnostic EM, where having schema-aligned tables is not compulsory. The proposed method stemmed from the idea of treating tuples in tables for EM similar to sentence pair classification problem in natural language processing (NLP). A pre-trained language model, BERT is adopted by fine-tuning it using labeled dataset. The proposed method was experimented using benchmark datasets and compared against two state-of-the-art approaches,namely DeepMatcher and Magellan. The experimental results show that our proposed solution outperforms by an average of 9% in F1 score. The performance is in fact consistent across different types of datasets, showing significant improvement of 29.6% for one of dirty datasets. These prove that our proposed solution is versatile for EM.
Original language | English |
---|---|
Title of host publication | Proceedings of the 29th ACM International Conference on Information & Knowledge Management |
Editors | Claudia Hauff, Edward Curry, Philippe Cudre Mauroux |
Place of Publication | New York NY USA |
Publisher | Association for Computing Machinery (ACM) |
Pages | 2241-2244 |
Number of pages | 4 |
ISBN (Electronic) | 9781450368599 |
DOIs | |
Publication status | Published - 2020 |
Event | ACM International Conference on Information and Knowledge Management 2020 - Virtual, Online, Ireland Duration: 19 Oct 2020 → 23 Oct 2020 Conference number: 29th https://dl.acm.org/doi/proceedings/10.1145/3340531 (Proceedings) https://www.cikm2020.org/ (Website) |
Conference
Conference | ACM International Conference on Information and Knowledge Management 2020 |
---|---|
Abbreviated title | CIKM 2020 |
Country/Territory | Ireland |
City | Virtual, Online |
Period | 19/10/20 → 23/10/20 |
Internet address |
|
Keywords
- language models
- schema agnostic, entity matching