Malaysian English News Decoded: A Linguistic Resource for Named Entity and Relation Extraction

Mohan Raj Chanthran, Lay-Ki Soon, Huey Fang Ong, Bhawani Selvaretnam

Research output: Chapter in Book/Report/Conference proceedingConference PaperResearchpeer-review

Abstract

Standard English and Malaysian English exhibit notable differences, posing challenges for natural language processing (NLP) tasks on Malaysian English. An experiment using state-of-the-art Named Entity Recognition (NER) solutions in Malaysian English news articles highlights that they cannot handle morphosyntactic variations in Malaysian English. Unfortunately, most of the existing datasets are mainly based on Standard English, which is not sufficient to enhance NLP tasks in Malaysian English. To the best of our knowledge, there is no annotated dataset that can be used to improve the model. To address this issue, we have constructed a Malaysian English News (MEN) dataset, which contains 200 news articles that are manually annotated with entities and relations. We then fine-tuned the spaCy NER tool and validated that having a dataset tailor-made for Malaysian English could significantly improve the performance of NER in Malaysian English. This paper presents our efforts to acquire data, the annotation methodology, and a detailed analysis of the annotated dataset. To ensure the quality of the annotation, we have measured the Inter-Annotator Agreement (IAA), and any disagreements were resolved by a subject matter expert through adjudication. After a rigorous quality check, we have developed a dataset with 6,061 entities and 3,268 relation instances. Finally, we discuss spaCy fine-tuning setup and analysis of NER performance. This unique dataset will contribute significantly to the advancement of NLP research in Malaysian English, allowing researchers to accelerate their progress, particularly in NER and relation extraction.

Original languageEnglish
Title of host publicationThe 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024) - Main Conference Proceedings
EditorsNicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, Nianwen Xue
Place of PublicationParis France
PublisherEuropean Language Resources Association (ELRA)
Pages10999-11022
Number of pages24
ISBN (Electronic)9782493814104
Publication statusPublished - 2024
EventJoint International Conference on Computational Linguistics and International Conference on Language Resources and Evaluation 2024 - Hybrid, Torino, Italy
Duration: 20 May 202425 May 2024
https://aclanthology.org/volumes/2024.lrec-main/ (Proceedings)
https://lrec-coling-2024.org/ (Website)

Conference

ConferenceJoint International Conference on Computational Linguistics and International Conference on Language Resources and Evaluation 2024
Abbreviated titleLREC-COLING 2024
Country/TerritoryItaly
CityHybrid, Torino
Period20/05/2425/05/24
Internet address

Keywords

  • Annotated Dataset
  • Low-Resource Language
  • Malaysian English
  • Named Entity Recognition
  • Relation Extraction

Cite this