Are vision transformers replacing convolutional neural networks in scene interpretation?: A review

Visual scene interpretation is a significant and daunting process of observing, exploring, and elaborating dynamic scenes. It provides reliable and safe communication with the natural world and environmental affairs. Cutting-edge computer vision technology plays a key role in enabling communication...

Full description

Saved in:

Bibliographic Details
Published in	Discover applied sciences Vol. 7; no. 9; pp. 932 - 21
Main Authors	Rosy, N. Arockia, Balasubadra, K., Deepa, K.
Format	Journal Article
Language	English
Published	Cham Springer International Publishing 01.09.2025 Springer Nature B.V Springer
Subjects	Applied and Technical Physics Architecture Artificial neural networks Attention Chemistry/Food Science Computer vision Convolutional neural networks Datasets Deep learning Earth Sciences Engineering Environment Linear embedding Machine learning Materials Science Multi head attention Multimedia Neural networks Performance measurement Recognition Review Scene interpretation Semantics Vision transformers Visual observation Linear embedding Multi head attention Scene interpretation Convolutional neural networks Vision transformers
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Visual scene interpretation is a significant and daunting process of observing, exploring, and elaborating dynamic scenes. It provides reliable and safe communication with the natural world and environmental affairs. Cutting-edge computer vision technology plays a key role in enabling communication that allows individuals to understand visual scenes in the same way they do. Technical advancements in computer vision have been overwhelmingly successful, primarily driven by the harnessing of deep learning algorithms. Recently, Vision Transformers (ViTs) have emerged as a viable alternative to conventional neural networks. Powered by an attention mechanism, ViT-based approaches have demonstrated competitive or superior performance to CNNs in several benchmark scene interpretation tasks. This research carries out a detailed and inclusive exploration of the scene recognition approaches using Convolutional Neural Networks (CNN) and ViTs. This article aims to present a comprehensive study of the existing advanced research views for CNNs and ViTs in scene recognition. This review presents a comprehensive and methodical analysis of recent developments in CNN and ViT-based models for scene recognition. A total of 142 peer-reviewed studies published between 2017 and 2024 were reviewed based on defined inclusion criteria, focusing on works that evaluate these models on public datasets. The review begins with an overview of the architectural foundations and functional variations of CNNs used for scene interpretation. Next, it explores the structure of ViTs, including their multi-head self-attention mechanisms, and assesses state-of-the-art ViT variants with respect to design innovations, training strategies, and performance metrics. As a final point, we discuss some possible future research directions for designing ViT models. Hence, this study can be employed as a reference for scholars and experts to develop new ViT architectures in this domain.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	3004-9261 2523-3963 3004-9261 2523-3971
DOI:	10.1007/s42452-025-07574-1