Scalable vision transformers with hierarchical pooling

Zizheng Pan, Bohan Zhuang, Jing Liu, Haoyu He, Jianfei Cai

Research output: Chapter in Book/Report/Conference proceedingConference PaperResearchpeer-review

67 Citations (Scopus)


The recently proposed Visual image Transformers (ViT) with pure attention have achieved promising performance on image recognition tasks, such as image classification. However, the routine of the current ViT model is to maintain a full-length patch sequence during inference, which is redundant and lacks hierarchical representation. To this end, we propose a Hierarchical Visual Transformer (HVT) which progressively pools visual tokens to shrink the sequence length and hence reduces the computational cost, analogous to the feature maps downsampling in Convolutional Neural Networks (CNNs). It brings a great benefit that we can increase the model capacity by scaling dimensions of depth/width/resolution/patch size without introducing extra computational complexity due to the reduced sequence length. Moreover, we empirically find that the average pooled visual tokens contain more discriminative information than the single class token. To demonstrate the improved scalability of our HVT, we conduct extensive experiments on the image classification task. With comparable FLOPs, our HVT outperforms the competitive baselines on ImageNet and CIFAR-100 datasets. Code is available at

Original languageEnglish
Title of host publicationProceedings - 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021
EditorsDima Damen, Tal Hassner, Chris Pal, Yoichi Sato
Place of PublicationPiscataway NJ USA
PublisherIEEE, Institute of Electrical and Electronics Engineers
Number of pages10
ISBN (Electronic)9781665428125
ISBN (Print)9781665428132
Publication statusPublished - 2021
EventIEEE International Conference on Computer Vision 2021 - Online, United States of America
Duration: 11 Oct 202117 Oct 2021 (Website) (Proceedings)

Publication series

NameProceedings of the IEEE International Conference on Computer Vision
PublisherIEEE, Institute of Electrical and Electronics Engineers
ISSN (Print)1550-5499
ISSN (Electronic)2380-7504


ConferenceIEEE International Conference on Computer Vision 2021
Abbreviated titleICCV 2021
Country/TerritoryUnited States of America
Internet address

Cite this