Abstract
We design an open-vocabulary multi-label image classification model to predict multiple novel concepts in an image based on a powerful language-image pretrained model i.e. CLIP. While CLIP achieves a remarkable performance on single-label zero-shot image classification, it only utilizes global image feature which is less applicable for predicting multiple labels. To address the problem, we propose a novel method that contains an Image-Text attention module to extract multiple class-specific image features from CLIP. In addition, we introduce a new training method with contrastive loss to help the attention module find diverse attention masks for all classes. During testing, the class-specific features are interpolated with CLIP features to boost the performance. Extensive experiments show that our proposed method achieves state-of-the-art performance on zero-shot learning tasks for multi-label image classifications on two benchmark datasets.
Original language | English |
---|---|
Title of host publication | Proceedings - 2023 IEEE International Conference on Multimedia and Expo, ICME 2023 |
Editors | Aous Naman |
Place of Publication | Piscataway NJ USA |
Publisher | IEEE, Institute of Electrical and Electronics Engineers |
Pages | 2135-2140 |
Number of pages | 6 |
ISBN (Electronic) | 9781665468916 |
ISBN (Print) | 9781665468923 |
DOIs | |
Publication status | Published - 2023 |
Event | IEEE International Conference on Multimedia and Expo 2023 - Brisbane, Australia Duration: 10 Jul 2023 → 14 Jul 2023 https://ieeexplore.ieee.org/xpl/conhome/10219544/proceeding (Proceedings) https://www.2023.ieeeicme.org/ (Website) |
Conference
Conference | IEEE International Conference on Multimedia and Expo 2023 |
---|---|
Abbreviated title | ICME 2023 |
Country/Territory | Australia |
City | Brisbane |
Period | 10/07/23 → 14/07/23 |
Internet address |
|
Keywords
- Open-Vocabulary Multi-Label Classification
- Zero-Shot Learning