Unified open-vocabulary dense visual prediction

Hengcan Shi, Munawar Hayat, Jianfei Cai

Research output: Contribution to journalArticleResearchpeer-review

1 Citation (Scopus)

Abstract

In recent years, open-vocabulary (OV) dense visual prediction (such as OV object detection, semantic, instance and panoptic segmentations) has attracted increasing research attention. However, most of the existing approaches are task-specific, i.e., tackling each task individually. In this paper, we propose a Unified Open-Vocabulary Network (UOVN) to jointly address four common dense prediction tasks. Compared with separate models, a unified network is more desirable for diverse industrial applications. Moreover, OV dense prediction training data is relatively less. Separate networks can only leverage task-relevant training data, while a unified approach can integrate diverse data to boost individual tasks. We address two major challenges in unified OV prediction. Firstly, unlike unified methods for fixed-set predictions, OV networks are usually trained with multi-modal data. Therefore, we propose a multi-modal, multi-scale and multi-task (MMM) decoding mechanism to better exploit multi-modal information for OV recognition. Secondly, because UOVN uses data from different tasks for training, there are significant domain and task gaps. We present a UOVN training mechanism to reduce such gaps. Experiments on four datasets demonstrate the effectiveness of our UOVN.

Original languageEnglish
Pages (from-to)8704-8716
Number of pages13
JournalIEEE Transactions on Multimedia
Volume26
DOIs
Publication statusPublished - 26 Mar 2024

Keywords

  • Decoding
  • Feature extraction
  • image segmentation
  • object detection
  • Object detection
  • open-vocabulary
  • Semantics
  • Task analysis
  • Training
  • Visualization

Cite this