Multi-resolution Fine-Tuning of Vision Transformers

Kerr Fitzgerald, Meng Law, Jarrel C.Y. Seah, Jennifer Tang, Bogdan Matuszewski

Research output: Chapter in Book/Report/Conference proceedingConference PaperResearchpeer-review


For computer vision systems based on artificial neural networks, increasing the resolution of images typically improves the performance of the network. However, ImageNet pre-trained Vision Transformer (ViT) models are typically only openly available for 2242 and 3842 image resolutions. To determine the impact of using higher resolution images with ViT systems the performance differences between ViT-B/16 models (designed for 3842 and 5442 image resolutions) were evaluated. The multi-label classification RANZCR CLiP challenge dataset, which contains over 30,000 high resolution labelled chest X-ray images, was used throughout this investigation. The performance of the ViT 3842 and ViT 5442 models with no ImageNet pre-training (i.e. models were only trained using RANZCR data) was firstly compared to see if using higher resolution images increases performance. After this, a multi-resolution fine-tuning approach was investigated for transfer learning. This approach was achieved by transferring learned parameters from ImageNet pre-trained ViT 3842 models, which had undergone further training on the 3842 RANZCR data, to ViT 5442 models which were then trained on the 5442 RANZCR data. Learned parameters were transferred via a tensor slice copying technique. The results obtained provide evidence that using larger image resolutions positively impacts ViT network performance and that multi-resolution fine-tuning can lead to performance gains. The multi-resolution fine-tuning approach used in this investigation could potentially improve the performance of other computer vision systems which use ViT based networks. The results of this investigation may also warrant the development of new ViT variants optimized to work with high resolution image datasets.

Original languageEnglish
Title of host publicationMedical Image Understanding and Analysis
Subtitle of host publication26th Annual Conference, MIUA 2022, Cambridge, UK, July 27–29, 2022, Proceedings
EditorsGuang Yang, Angelica Aviles-Rivero, Michael Roberts, Carola-Bibiane Schönlieb
Place of PublicationSwitzerland
Number of pages12
ISBN (Electronic)9783031120534
ISBN (Print)9783031120527
Publication statusPublished - 2022
Externally publishedYes
EventMedical Image Understanding and Analysis 2022 - Cambridge, United Kingdom
Duration: 27 Jul 202229 Jul 2022

Publication series

NameLecture Notes in Computer Science
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349


ConferenceMedical Image Understanding and Analysis 2022
Abbreviated titleMIUA 2022
Country/TerritoryUnited Kingdom
Internet address


  • Computer vision
  • Fine-tuning
  • Medical data
  • Transfer learning
  • Vision transformer
  • ViT

Cite this