COMETA: a corpus for medical Entity Linking in the social media

Marco Basaldella, Fangyu Liu, Ehsan Shareghi Nojehdeh, Nigel Collier

Research output: Chapter in Book/Report/Conference proceedingConference PaperResearchpeer-review

Abstract

Whilst there has been growing progress in Entity Linking (EL) for general language, existing datasets fail to address the complex nature of health terminology in layman’s language. Meanwhile, there is a growing need for applications that can understand the public’s voice in the health domain. To address this we introduce a new corpus called COMETA, consisting
of 20k English biomedical entity mentions from Reddit expert-annotated with links to SNOMED CT, a widely-used medical knowledge graph. Our corpus satisfies a combination of desirable properties, from scale and coverage
to diversity and quality, that to the best of our knowledge has not been met by any of the existing resources in the field. Through benchmark experiments on 20 EL baselines from string- to neural-based models we shed light on the ability of these systems to perform complex inference on entities and concepts under
2 challenging evaluation scenarios. Our experimental results on COMETA illustrate that no golden bullet exists and even the best mainstream techniques still have a significant performance gap to fill, while the best solution relies
on combining different views of data.
Original languageEnglish
Title of host publication2020 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference
EditorsTrevor Cohn, Yulan He, Yang Liu
Place of PublicationStroudsburg PA USA
PublisherAssociation for Computational Linguistics (ACL)
Pages3122-3137
Number of pages16
ISBN (Electronic)9781952148606
DOIs
Publication statusPublished - 2020
EventEmpirical Methods in Natural Language Processing 2020 - Virtual, Punta Cana, Dominican Republic
Duration: 16 Nov 202020 Nov 2020
https://2020.emnlp.org/ (Website)
http://Proceedings (www.aclweb.org/anthology/volumes/2020.emnlp-main/)

Conference

ConferenceEmpirical Methods in Natural Language Processing 2020
Abbreviated titleEMNLP
CountryDominican Republic
CityPunta Cana
Period16/11/2020/11/20
Internet address

Cite this