Self-alignment pretraining for biomedical entity representations

Fangyu Liu, Ehsan Shareghi, Zaiqiao Meng, Marco Basaldella, Nigel Collier

Research output: Chapter in Book/Report/Conference proceedingConference PaperResearchpeer-review


Despite the widespread success of self-supervised learning via masked language models (MLM), accurately capturing fine-grained semantic relationships in the biomedical domain remains a challenge. This is of paramount importance for entity-level tasks such as entity linking where the ability to model entity relations (especially synonymy) is pivotal. To address this challenge, we propose SapBERT, a pretraining scheme that self-aligns the representation space of biomedical entities. We design a scalable metric learning framework that can leverage UMLS, a massive collection of biomedical ontologies with 4M+ concepts. In contrast with previous pipeline-based hybrid systems, SapBERT offers an elegant one-model-for-all solution to the problem of medical entity linking (MEL), achieving a new state-of-the-art (SOTA) on six MEL benchmarking datasets. In the scientific domain, we achieve SOTA even without task-specific supervision. With substantial improvement over various domain-specific pretrained MLMs such as BioBERT, SciBERTand and PubMedBERT, our pretraining scheme proves to be both effective and robust.
Original languageEnglish
Title of host publicationThe 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Subtitle of host publicationProceedings of the Conference
EditorsDilek Hakkani-Tur, Anna Rumshisky, Luke Zettlemoyer
Place of PublicationStroudsburg PA USA
PublisherAssociation for Computational Linguistics (ACL)
Number of pages11
ISBN (Electronic)9781954085466
Publication statusPublished - Jun 2021
EventNorth American Association for Computational Linguistics 2021: Human Language Technologies - Online, United States of America
Duration: 6 Jun 202111 Jun 2021 (Website) (Proceedings)


ConferenceNorth American Association for Computational Linguistics 2021
Abbreviated titleNAACL-HLT 2021
CountryUnited States of America
Internet address

Cite this