Abstract
We introduce ViLPAct, a novel vision-language benchmark for human activity planning. It is designed for a task where embodied AI agents can reason and forecast future actions of humans based on video clips about their initial activities and intents in text. The dataset consists of 2.9k videos from Charades extended with intents via crowdsourcing, a multi-choice question test set, and four strong baselines. One of the baselines implements a neurosymbolic approach based on a multi-modal knowledge base (MKB), while the other ones are deep generative models adapted from recent state-of-the-art (SOTA) methods.
Original language | English |
---|---|
Title of host publication | EACL 2023 - 17th Conference of the European Chapter of the Association for Computational Linguistics, Findings of EACL 2023 |
Editors | Rya Cotterell, Carolina Scarton |
Place of Publication | Stroudsburg PA USA |
Publisher | Association for Computational Linguistics (ACL) |
Pages | 2192-2207 |
Number of pages | 16 |
ISBN (Electronic) | 9781959429470 |
Publication status | Published - 2023 |
Event | European Association of Computational Linguistics Conference 2023 - Dubrovnik, Croatia Duration: 2 May 2023 → 6 May 2023 Conference number: 17th https://2023.eacl.org/ (Website) https://aclanthology.org/volumes/2023.eacl-main/ (Proceedings) |
Conference
Conference | European Association of Computational Linguistics Conference 2023 |
---|---|
Abbreviated title | EACL 2023 |
Country/Territory | Croatia |
City | Dubrovnik |
Period | 2/05/23 → 6/05/23 |
Internet address |
|