ViT-PGC: vision transformer for pedestrian gender classification on small-size dataset

Pedestrian gender classification (PGC) is a key task in full-body-based pedestrian image analysis and has become an important area in applications like content-based image retrieval, visual surveillance, smart city, and demographic collection. In the last decade, convolutional neural networks (CNN)...

Full description

Saved in:

Bibliographic Details
Published in	Pattern analysis and applications : PAA Vol. 26; no. 4; pp. 1805 - 1819
Main Authors	Abbas, Farhat, Yasmin, Mussarat, Fayyaz, Muhammad, Asim, Usman
Format	Journal Article
Language	English
Published	London Springer London 01.11.2023 Springer Nature B.V
Subjects	Artificial neural networks Classification Computer Science Datasets Image analysis Image retrieval Modules Object recognition Pattern Recognition Short Paper Vision Vision transformer Pedestrian gender classification Deep CNN models LSA and SPT SS datasets
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Pedestrian gender classification (PGC) is a key task in full-body-based pedestrian image analysis and has become an important area in applications like content-based image retrieval, visual surveillance, smart city, and demographic collection. In the last decade, convolutional neural networks (CNN) have appeared with great potential and with reliable choices for vision tasks, such as object classification, recognition, detection, etc. But CNN has a limited local receptive field that prevents them from learning information about the global context. In contrast, a vision transformer (ViT) is a better alternative to CNN because it utilizes a self-attention mechanism to attend to a different patch of an input image. In this work, generic and effective modules such as locality self-attention (LSA), and shifted patch tokenization (SPT)-based vision transformer model are explored for the PGC task. With the use of these modules in ViT, it is successfully able to learn from stretch even on small-size (SS) datasets and overcome the lack of locality inductive bias. Through extensive experimentation, we found that the proposed ViT model produced better results in terms of overall and mean accuracies. The better results confirm that ViT outperformed state-of-the-art (SOTA) PGC methods.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	1433-7541 1433-755X
DOI:	10.1007/s10044-023-01196-2