A data efficient transformer based on Swin Transformer

Almost all Vision Transformer-based models need to pre-train on the massive datasets and costly computation. Suppose researchers do not have enough data to train a Vision Transformer-based model or do not have powerful GPUs to implement computation for millions of labeled data. In that case, Vision...

Full description

Saved in:
Bibliographic Details
Published inThe Visual computer Vol. 40; no. 4; pp. 2589 - 2598
Main Authors Yao, Dazhi, Shao, Yunxue
Format Journal Article
LanguageEnglish
Published Berlin/Heidelberg Springer Berlin Heidelberg 01.04.2024
Springer Nature B.V
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Almost all Vision Transformer-based models need to pre-train on the massive datasets and costly computation. Suppose researchers do not have enough data to train a Vision Transformer-based model or do not have powerful GPUs to implement computation for millions of labeled data. In that case, Vision Transformer-based models have no advantages over CNNs. Swin Transformer is brought forward to figure out these problems by applying the shifted window-based self-attention, which has linear computational complexity. Although Swin Transformer significantly reduces computing costs and works well on mid-size datasets, it still performs not well when it trains on a small-size dataset. In this paper, we propose a hierarchical and data-efficient Transformer based on Swin Transformer, which we call ESwin Transformer. We mainly redesigned the patch embedding module and patch merging module of Swin Transformer. We merely applied some unsophisticated convolutional components to these modules, which significantly improved performance when we trained our model on a small dataset. Our empirical results show that ESwin Transformer trained on CIFAR10/CIFAR100 with no extra data for 300 epochs achieves 97.17 % / 83.78 % accuracy and performs better than Swin Transformer and DeiT in the same training time.
ISSN:0178-2789
1432-2315
DOI:10.1007/s00371-023-02939-2