Patch-Level Consistency Regularization in Self-Supervised Transfer Learning for Fine-Grained Image Recognition

Fine-grained image recognition aims to classify fine subcategories belonging to the same parent category, such as vehicle model or bird species classification. This is an inherently challenging task because a classifier must capture subtle interclass differences under large intraclass variances. Mos...

Full description

Saved in:

Bibliographic Details
Published in	Applied sciences Vol. 13; no. 18; p. 10493
Main Authors	Lee, Yejin, Lee, Suho, Hwang, Sangheum
Format	Journal Article
Language	English
Published	Basel MDPI AG 01.09.2023
Subjects	Annotations Classification Computational linguistics Datasets fine-grained image recognition Graph representations Language processing Learning strategies Natural language interfaces self-supervised learning Semantics transfer learning Vision Transformer
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Fine-grained image recognition aims to classify fine subcategories belonging to the same parent category, such as vehicle model or bird species classification. This is an inherently challenging task because a classifier must capture subtle interclass differences under large intraclass variances. Most previous approaches are based on supervised learning, which requires a large-scale labeled dataset. However, such large-scale annotated datasets for fine-grained image recognition are difficult to collect because they generally require domain expertise during the labeling process. In this study, we propose a self-supervised transfer learning method based on Vision Transformer (ViT) to learn finer representations without human annotations. Interestingly, it is observed that existing self-supervised learning methods using ViT (e.g., DINO) show poor patch-level semantic consistency, which may be detrimental to learning finer representations. Motivated by this observation, we propose a consistency loss function that encourages patch embeddings of the overlapping area between two augmented views to be similar to each other during self-supervised learning on fine-grained datasets. In addition, we explore effective transfer learning strategies to fully leverage existing self-supervised models trained on large-scale labeled datasets. Contrary to the previous literature, our findings indicate that training only the last block of ViT is effective for self-supervised transfer learning. We demonstrate the effectiveness of our proposed approach through extensive experiments using six fine-grained image classification benchmark datasets, including FGVC Aircraft, CUB-200-2011, Food-101, Oxford 102 Flowers, Stanford Cars, and Stanford Dogs. Under the linear evaluation protocol, our method achieves an average accuracy of 78.5%, outperforming the existing transfer learning method, which yields 77.2%.
ISSN:	2076-3417 2076-3417
DOI:	10.3390/app131810493