ViLPAct: A benchmark for compositional generalization on multimodal human activities

Terry Yue Zhuo, Yaqing Liao, Yuecheng Lei, Lizhen Qu, Gerard de Melo, Xiaojun Chang, Yazhou Ren, Zenglin Xu

Research output: Chapter in Book/Report/Conference proceedingConference PaperResearchpeer-review

1 Citation (Scopus)

Abstract

We introduce ViLPAct, a novel vision-language benchmark for human activity planning. It is designed for a task where embodied AI agents can reason and forecast future actions of humans based on video clips about their initial activities and intents in text. The dataset consists of 2.9k videos from Charades extended with intents via crowdsourcing, a multi-choice question test set, and four strong baselines. One of the baselines implements a neurosymbolic approach based on a multi-modal knowledge base (MKB), while the other ones are deep generative models adapted from recent state-of-the-art (SOTA) methods.

Original languageEnglish
Title of host publicationEACL 2023 - 17th Conference of the European Chapter of the Association for Computational Linguistics, Findings of EACL 2023
EditorsRya Cotterell, Carolina Scarton
Place of PublicationStroudsburg PA USA
PublisherAssociation for Computational Linguistics (ACL)
Pages2192-2207
Number of pages16
ISBN (Electronic)9781959429470
Publication statusPublished - 2023
EventEuropean Association of Computational Linguistics Conference 2023 - Dubrovnik, Croatia
Duration: 2 May 20236 May 2023
Conference number: 17th
https://2023.eacl.org/ (Website)
https://aclanthology.org/volumes/2023.eacl-main/ (Proceedings)

Conference

ConferenceEuropean Association of Computational Linguistics Conference 2023
Abbreviated titleEACL 2023
Country/TerritoryCroatia
CityDubrovnik
Period2/05/236/05/23
Internet address

Cite this