Vision-Language Navigation with self-supervised Auxiliary Reasoning Tasks

Fengda Zhu, Yi Zhu, Xiaojun Chang, Xiaodan Liang

    Research output: Chapter in Book/Report/Conference proceedingConference PaperResearchpeer-review

    39 Citations (Scopus)


    Vision-Language Navigation (VLN) is a task where an agent learns to navigate following a natural language instruction. The key to this task is to perceive both the visual scene and natural language sequentially. Conventional approaches fully exploit vision and language features in cross-modal grounding. However, the VLN task remains challenging, since previous works have implicitly neglected the rich semantic information contained in environments (such as navigation graphs or sub-trajectory semantics). In this paper, we introduce Auxiliary Reasoning Navigation (AuxRN), a framework with four self-supervised auxiliary reasoning tasks to exploit the additional training signals derived from these semantic information. The auxiliary tasks have four reasoning objectives: explaining the previous actions, evaluating the trajectory consistency, estimating the progress and predict the next direction. As a result, these additional training signals help the agent to acquire knowledge of semantic representations in order to reason about its activities and build a thorough perception of environments. Our experiments demonstrate that auxiliary reasoning tasks improve both the performance of the main task and the model generalizability by a large margin. We further demonstrate empirically that an agent trained with self-supervised auxiliary reasoning tasks substantially outperforms the previous state-of-the-art method, being the best existing approach on the standard benchmark.

    Original languageEnglish
    Title of host publicationProceedings - 33th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2020
    EditorsCe Liu, Greg Mori, Kate Saenko, Silvio Savarese
    Place of PublicationPiscataway NJ USA
    PublisherIEEE, Institute of Electrical and Electronics Engineers
    Number of pages11
    ISBN (Electronic)9781728171685
    ISBN (Print)9781728171692
    Publication statusPublished - 2020
    EventIEEE Conference on Computer Vision and Pattern Recognition 2020 - Virtual, China
    Duration: 14 Jun 202019 Jun 2020 (Website ) (Proceedings) (Proceedings)

    Publication series

    NameProceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
    PublisherIEEE, Institute of Electrical and Electronics Engineers
    ISSN (Print)1063-6919
    ISSN (Electronic)2575-7075


    ConferenceIEEE Conference on Computer Vision and Pattern Recognition 2020
    Abbreviated titleCVPR 2020
    Internet address

    Cite this