XL-Sum: large-scale multilingual abstractive summarization for 44 languages

Tahmid Hasan, Abhik Bhattacharjee, Md Saiful Islam, Kazi Samin, Yuan-Fang Li, Yong-Bin Kang, M. Sohel Rahman, Rifat Shahriyar

Research output: Chapter in Book/Report/Conference proceedingConference PaperResearch

Abstract

Contemporary works on abstractive text summarization have focused primarily on high resource languages like English, mostly due to
the limited availability of datasets for low/midresource ones. In this work, we present XLSum, a comprehensive and diverse dataset
comprising 1 million professionally annotated
article-summary pairs from BBC, extracted
using a set of carefully designed heuristics.
The dataset covers 44 languages ranging from
low to high-resource, for many of which no
public dataset is currently available. XL-Sum
is highly abstractive, concise, and of high quality,
as indicated by human and intrinsic evaluation.
We fine-tune mT5, a state-of-theart pretrained multilingual model, with XLSum and experiment on multilingual and low resource summarization tasks. XL-Sum induces competitive results compared to the ones obtained using similar monolingual datasets: we show higher than 11 ROUGE-2 scores on
10 languages we benchmark on, with some
of them exceeding 15, as obtained by multilingual
training. Additionally, training on low-resource languages individually also provides competitive performance. To the best of our knowledge, XL-Sum is the largest abstractive summarization dataset in terms of the number of samples collected from a single source and the number of languages covered. We are releasing our dataset and models to encourage future research on multilingual abstractive summarization. The resources can be found at https://github. com/csebuetnlp/xl-sum.
Original languageEnglish
Title of host publicationFindings of the Association for Computational Linguistics
Subtitle of host publication ACL-IJCNLP 2021
EditorsFei Xia, Wenjie Li, Roberto Navigli
Place of PublicationStroudsburg PA USA
PublisherAssociation for Computational Linguistics (ACL)
Pages4693–4703
Number of pages11
ISBN (Electronic)9781954085541
DOIs
Publication statusPublished - 2021
EventAnnual Meeting of the Association for Computational Linguistics and the International Joint Conference on Natural Language Processing 2021 - Online, Bangkok, Thailand
Duration: 1 Aug 20216 Aug 2021
Conference number: 59th & 11th
https://aclanthology.org/2021.acl-long.0/ (Proceedings)
https://2021.aclweb.org (Website)

Conference

ConferenceAnnual Meeting of the Association for Computational Linguistics and the International Joint Conference on Natural Language Processing 2021
Abbreviated titleACL-IJCNLP 2021
CountryThailand
CityBangkok
Period1/08/216/08/21
Internet address

Cite this