Multimodal transformer with Variable-length Memory for Vision-and-Language Navigation

Chuang Lin, Yi Jiang, Jianfei Cai, Lizhen Qu, Gholamreza Haffari, Zehuan Yuan

Research output: Chapter in Book/Report/Conference proceedingConference PaperResearchpeer-review

10 Citations (Scopus)

Abstract

Vision-and-Language Navigation (VLN) is a task that an agent is required to follow a language instruction to navigate to the goal position, which relies on the ongoing interactions with the environment during moving. Recent Transformer-based VLN methods have made great progress benefiting from the direct connections between visual observations and language instructions via the multimodal cross-attention mechanism. However, these methods usually represent temporal context as a fixed-length vector by using an LSTM decoder or using manually designed hidden states to build a recurrent Transformer. Considering a single fixed-length vector is often insufficient to capture long-term temporal context, in this paper, we introduce Multimodal Transformer with Variable-length Memory (MTVM) for visually-grounded natural language navigation by modeling the temporal context explicitly. Specifically, MTVM enables the agent to keep track of the navigation trajectory by directly storing activations in the previous time step in a memory bank. To further boost the performance, we propose a memory-aware consistency loss to help learn a better joint representation of temporal context with random masked instructions. We evaluate MTVM on popular R2R and CVDN datasets. Our model improves Success Rate on R2R test set by 2% and reduces Goal Process by 1.5 m on CVDN test set. Code is available at: https://github.com/clin1223/MTVM.

Original languageEnglish
Title of host publicationComputer Vision – ECCV 2022 - 17th European Conference Tel Aviv, Israel, October 23–27, 2022 Proceedings, Part XXXVI
EditorsShai Avidan, Gabriel Brostow, Moustapha Cissé, Giovanni Maria Farinella, Tal Hassner
Place of PublicationCham Switzerland
PublisherSpringer
Pages380-397
Number of pages18
ISBN (Electronic)9783031200595
ISBN (Print)9783031200588
DOIs
Publication statusPublished - 2022
EventEuropean Conference on Computer Vision 2022 - Tel Aviv, Israel
Duration: 23 Oct 202227 Oct 2022
Conference number: 17th
https://link.springer.com/book/10.1007/978-3-031-19830-4 (Proceedings)
https://eccv2022.ecva.net (Website)

Publication series

NameLecture Notes in Computer Science
PublisherSpringer
Volume13696
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

ConferenceEuropean Conference on Computer Vision 2022
Abbreviated titleECCV 2022
Country/TerritoryIsrael
CityTel Aviv
Period23/10/2227/10/22
Internet address

Keywords

  • Multimodal transformer
  • Vision-and-language navigation

Cite this