A data efficient transformer based on Swin Transformer

Almost all Vision Transformer-based models need to pre-train on the massive datasets and costly computation. Suppose researchers do not have enough data to train a Vision Transformer-based model or do not have powerful GPUs to implement computation for millions of labeled data. In that case, Vision...

Full description

Saved in:

Bibliographic Details
Published in	The Visual computer Vol. 40; no. 4; pp. 2589 - 2598
Main Authors	Yao, Dazhi, Shao, Yunxue
Format	Journal Article
Language	English
Published	Berlin/Heidelberg Springer Berlin Heidelberg 01.04.2024 Springer Nature B.V
Subjects	Accuracy Artificial Intelligence Classification Computer Graphics Computer Science Computing costs Datasets Design Image Processing and Computer Vision Massive data points Modules Neural networks Original Article Transformers Computer vision Transformer Data efficient Classification
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Almost all Vision Transformer-based models need to pre-train on the massive datasets and costly computation. Suppose researchers do not have enough data to train a Vision Transformer-based model or do not have powerful GPUs to implement computation for millions of labeled data. In that case, Vision Transformer-based models have no advantages over CNNs. Swin Transformer is brought forward to figure out these problems by applying the shifted window-based self-attention, which has linear computational complexity. Although Swin Transformer significantly reduces computing costs and works well on mid-size datasets, it still performs not well when it trains on a small-size dataset. In this paper, we propose a hierarchical and data-efficient Transformer based on Swin Transformer, which we call ESwin Transformer. We mainly redesigned the patch embedding module and patch merging module of Swin Transformer. We merely applied some unsophisticated convolutional components to these modules, which significantly improved performance when we trained our model on a small dataset. Our empirical results show that ESwin Transformer trained on CIFAR10/CIFAR100 with no extra data for 300 epochs achieves 97.17 % / 83.78 % accuracy and performs better than Swin Transformer and DeiT in the same training time.
ISSN:	0178-2789 1432-2315
DOI:	10.1007/s00371-023-02939-2