Generating Faithful and Salient Text from Multimodal Data

Research output: Chapter in Book/Report/Conference proceedingConference PaperResearchpeer-review

Abstract

While large multimodal models (LMMs) have obtained strong performance on many multimodal tasks, they may still hallucinate while generating text. Their performance on detecting salient features from visual data is also unclear. In this paper, we develop a framework to generate faithful and salient text from mixed-modal data, which includes images and structured data ( represented in knowledge graphs or tables). Specifically, we train a vision critic model to identify hallucinated and non-salient features from the image modality. The critic model also generates a list of salient image features. This information is used in the post editing step to improve the generation quality. Experiments on two datasets show that our framework improves LMMs’ generation quality on both faithfulness and saliency, outperforming recent techniques aimed at reducing hallucination. The dataset and code are available at https://github.com/TahsinaHashem/FaithD2T.
Original languageEnglish
Title of host publicationINLG 2024, The 17th International Natural Language Generation Conference, Proceedings of the Conference
EditorsChung-Chi Chen
Place of PublicationStroudsburg PA USA
PublisherAssociation for Computational Linguistics (ACL)
Pages646–662
Number of pages17
ISBN (Electronic)9798891761223
Publication statusPublished - 2024
EventInternational Natural Language Generation Conference 2024 - Tokyo, Japan
Duration: 23 Sept 202427 Sept 2024
Conference number: 17th
https://aclanthology.org/volumes/2024.inlg-main/ (Proceedings)
https://inlg2024.github.io/ (Website)
https://aclanthology.org/2024.inlg-main.0/ (Proceedings)

Conference

ConferenceInternational Natural Language Generation Conference 2024
Abbreviated titleINLG 2024
Country/TerritoryJapan
CityTokyo
Period23/09/2427/09/24
Internet address

Cite this