HYDRA: A Hyper Agent for Dynamic Compositional Visual Reasoning

Fucai Ke, Zhixi Cai, Simindokht Jahangard, Weiqing Wang, Pari Delir Haghighi, Hamid Rezatofighi

Research output: Chapter in Book/Report/Conference proceedingConference PaperResearchpeer-review

Abstract

Recent advances in visual reasoning (VR), particularly with the aid of Large Vision-Language Models (VLMs), show promise but require access to large-scale datasets and face challenges such as high computational costs and limited generalization capabilities. Compositional visual reasoning approaches have emerged as effective strategies; however, they heavily rely on the commonsense knowledge encoded in Large Language Models (LLMs) to perform planning, reasoning, or both, without considering the effect of their decisions on the visual reasoning process, which can lead to errors or failed procedures. To address these challenges, we introduce HYDRA, a multi-stage dynamic compositional visual reasoning framework designed for reliable and incrementally progressive general reasoning. HYDRA integrates three essential modules: a planner, a Reinforcement Learning (RL) agent serving as a cognitive controller, and a reasoner. The planner and reasoner modules utilize an LLM to generate instruction samples and executable code from the selected instruction, respectively, while the RL agent dynamically interacts with these modules, making high-level decisions on selection of the best instruction sample given information from the historical state stored through a feedback loop. This adaptable design enables HYDRA to adjust its actions based on previous feedback received during the reasoning process, leading to more reliable reasoning outputs and ultimately enhancing its overall effectiveness. Our framework demonstrates state-of-the-art performance in various VR tasks on four different widely-used datasets.

Original languageEnglish
Title of host publicationComputer Vision – ECCV 2024, 18th European Conference Milan, Italy, September 29–October 4, 2024 Proceedings, Part XX
EditorsAleš Leonardis, Elisa Ricci, Stefan Roth, Olga Russakovsky, Torsten Sattler, Gül Varol
Place of PublicationCham Switzerland
PublisherSpringer
Pages132-149
Number of pages18
ISBN (Electronic)9783031726613
ISBN (Print)9783031726606
DOIs
Publication statusPublished - 2025
EventEuropean Conference on Computer Vision 2024 - Milan, Italy
Duration: 29 Sept 20244 Oct 2024
Conference number: 18th
https://eccv2024.ecva.net/Conferences/2024/Dates
http://chrome-extension://efaidnbmnnnibpcajpcglclefindmkaj/https://media.eventhosts.cc/Conferences/ECCV2024/ConferenceProgram.pdf (Proceedings)

Publication series

NameLecture Notes in Computer Science
PublisherSpringer
Volume15078
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

ConferenceEuropean Conference on Computer Vision 2024
Abbreviated titleECCV 2024
Country/TerritoryItaly
CityMilan
Period29/09/244/10/24
Internet address

Keywords

  • Large language Models (LLMs)
  • Reinforcement learning
  • Visual reasoning

Cite this