Open-Vocabulary Multi-label Image Classification with Pretrained Vision-Language Model

Son D. Dao, Dat Huynh, He Zhao, Dinh Phung, Jianfei Cai

Research output: Chapter in Book/Report/Conference proceedingConference PaperResearchpeer-review

3 Citations (Scopus)

Abstract

We design an open-vocabulary multi-label image classification model to predict multiple novel concepts in an image based on a powerful language-image pretrained model i.e. CLIP. While CLIP achieves a remarkable performance on single-label zero-shot image classification, it only utilizes global image feature which is less applicable for predicting multiple labels. To address the problem, we propose a novel method that contains an Image-Text attention module to extract multiple class-specific image features from CLIP. In addition, we introduce a new training method with contrastive loss to help the attention module find diverse attention masks for all classes. During testing, the class-specific features are interpolated with CLIP features to boost the performance. Extensive experiments show that our proposed method achieves state-of-the-art performance on zero-shot learning tasks for multi-label image classifications on two benchmark datasets.

Original languageEnglish
Title of host publicationProceedings - 2023 IEEE International Conference on Multimedia and Expo, ICME 2023
EditorsAous Naman
Place of PublicationPiscataway NJ USA
PublisherIEEE, Institute of Electrical and Electronics Engineers
Pages2135-2140
Number of pages6
ISBN (Electronic)9781665468916
ISBN (Print)9781665468923
DOIs
Publication statusPublished - 2023
EventIEEE International Conference on Multimedia and Expo 2023 - Brisbane, Australia
Duration: 10 Jul 202314 Jul 2023
https://ieeexplore.ieee.org/xpl/conhome/10219544/proceeding (Proceedings)
https://www.2023.ieeeicme.org/ (Website)

Conference

ConferenceIEEE International Conference on Multimedia and Expo 2023
Abbreviated titleICME 2023
Country/TerritoryAustralia
CityBrisbane
Period10/07/2314/07/23
Internet address

Keywords

  • Open-Vocabulary Multi-Label Classification
  • Zero-Shot Learning

Cite this