Schema-agnostic entity matching using pre-trained language models

Kai Sheng Teong, Layki Soon, Tin Tin Su

Research output: Chapter in Book/Report/Conference proceedingConference PaperResearchpeer-review

12 Citations (Scopus)


Entity matching (EM) is the process of linking records from different data sources. While extensive research has been done in various aspects of EM, many of these studies generally assume EM tasks as schema-specific, which attempt to match record pairs at attributes level. Unfortunately, in the real-world, tables that undergo EM may not have an aligned schema, and often, the schema or metadata of the table and attributes are not known beforehand.In view of this challenge, this paper presents an effective approach for schema-agnostic EM, where having schema-aligned tables is not compulsory. The proposed method stemmed from the idea of treating tuples in tables for EM similar to sentence pair classification problem in natural language processing (NLP). A pre-trained language model, BERT is adopted by fine-tuning it using labeled dataset. The proposed method was experimented using benchmark datasets and compared against two state-of-the-art approaches,namely DeepMatcher and Magellan. The experimental results show that our proposed solution outperforms by an average of 9% in F1 score. The performance is in fact consistent across different types of datasets, showing significant improvement of 29.6% for one of dirty datasets. These prove that our proposed solution is versatile for EM.

Original languageEnglish
Title of host publicationProceedings of the 29th ACM International Conference on Information & Knowledge Management
EditorsClaudia Hauff, Edward Curry, Philippe Cudre Mauroux
Place of PublicationNew York NY USA
PublisherAssociation for Computing Machinery (ACM)
Number of pages4
ISBN (Electronic)9781450368599
Publication statusPublished - 2020
EventACM International Conference on Information and Knowledge Management 2020 - Virtual, Online, Ireland
Duration: 19 Oct 202023 Oct 2020
Conference number: 29th (Proceedings) (Website)


ConferenceACM International Conference on Information and Knowledge Management 2020
Abbreviated titleCIKM 2020
CityVirtual, Online
Internet address


  • language models
  • schema agnostic, entity matching

Cite this