CrossSum: Beyond English-centric cross-lingual summarization for 1,500+ language pairs

Abhik Bhattacharjee, Tahmid Hasan, Wasi Uddin Ahmad, Yuan Fang Li, Yong-Bin Kang, Rifat Shahriyar

Research output: Chapter in Book/Report/Conference proceedingConference PaperResearchpeer-review

32 Citations (Scopus)

Abstract

We present CrossSum, a large-scale cross-lingual summarization dataset comprising 1.68 million article-summary samples in 1,500+ language pairs. We create CrossSum by aligning parallel articles written in different languages via cross-lingual retrieval from a multilingual abstractive summarization dataset and perform a controlled human evaluation to validate its quality. We propose a multistage data sampling algorithm to effectively train a cross-lingual summarization model capable of summarizing an article in any target language. We also introduce LaSE, an embedding-based metric for automatically evaluating model-generated summaries. LaSE is strongly correlated with ROUGE and, unlike ROUGE, can be reliably measured even in the absence of references in the target language. Performance on ROUGE and LaSE indicate that our proposed model consistently outperforms baseline models. To the best of our knowledge, CrossSum is the largest cross-lingual summarization dataset and the first ever that is not centered around English. We are releasing the dataset, training and evaluation scripts, and models to spur future research on cross-lingual summarization. The resources can be found at https://github.com/csebuetnlp/CrossSum.

Original languageEnglish
Title of host publicationThe 61st Conference of the the Association for Computational Linguistics, Proceedings of the Conference, Volume 1: Long Papers
EditorsAnna Rogers, Jordan Boyd-Graber, Naoaki Okazaki
Place of PublicationStroudsburg PA USA
PublisherAssociation for Computational Linguistics (ACL)
Pages2541-2564
Number of pages24
ISBN (Electronic)9781959429722
DOIs
Publication statusPublished - 2023
EventAnnual Meeting of the Association of Computational Linguistics 2023 - Toronto, Canada
Duration: 9 Jul 202314 Jul 2023
Conference number: 61st
https://aclanthology.org/volumes/2023.acl-long/ (Proceedings - 1)
https://aclanthology.org/volumes/2023.findings-acl/ (Proceedings - 2)
https://2023.aclweb.org/ (Website)

Publication series

NameProceedings of the Annual Meeting of the Association for Computational Linguistics
PublisherAssociation for Computational Linguistics (ACL)
Volume1
ISSN (Print)0736-587X

Conference

ConferenceAnnual Meeting of the Association of Computational Linguistics 2023
Abbreviated titleACL 2023
Country/TerritoryCanada
CityToronto
Period9/07/2314/07/23
Internet address

Cite this