Are vision transformers replacing convolutional neural networks in scene interpretation?: A review

Visual scene interpretation is a significant and daunting process of observing, exploring, and elaborating dynamic scenes. It provides reliable and safe communication with the natural world and environmental affairs. Cutting-edge computer vision technology plays a key role in enabling communication...

Full description

Saved in:
Bibliographic Details
Published inDiscover applied sciences Vol. 7; no. 9; pp. 932 - 21
Main Authors Rosy, N. Arockia, Balasubadra, K., Deepa, K.
Format Journal Article
LanguageEnglish
Published Cham Springer International Publishing 01.09.2025
Springer Nature B.V
Springer
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Visual scene interpretation is a significant and daunting process of observing, exploring, and elaborating dynamic scenes. It provides reliable and safe communication with the natural world and environmental affairs. Cutting-edge computer vision technology plays a key role in enabling communication that allows individuals to understand visual scenes in the same way they do. Technical advancements in computer vision have been overwhelmingly successful, primarily driven by the harnessing of deep learning algorithms. Recently, Vision Transformers (ViTs) have emerged as a viable alternative to conventional neural networks. Powered by an attention mechanism, ViT-based approaches have demonstrated competitive or superior performance to CNNs in several benchmark scene interpretation tasks. This research carries out a detailed and inclusive exploration of the scene recognition approaches using Convolutional Neural Networks (CNN) and ViTs. This article aims to present a comprehensive study of the existing advanced research views for CNNs and ViTs in scene recognition. This review presents a comprehensive and methodical analysis of recent developments in CNN and ViT-based models for scene recognition. A total of 142 peer-reviewed studies published between 2017 and 2024 were reviewed based on defined inclusion criteria, focusing on works that evaluate these models on public datasets. The review begins with an overview of the architectural foundations and functional variations of CNNs used for scene interpretation. Next, it explores the structure of ViTs, including their multi-head self-attention mechanisms, and assesses state-of-the-art ViT variants with respect to design innovations, training strategies, and performance metrics. As a final point, we discuss some possible future research directions for designing ViT models. Hence, this study can be employed as a reference for scholars and experts to develop new ViT architectures in this domain.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ISSN:3004-9261
2523-3963
3004-9261
2523-3971
DOI:10.1007/s42452-025-07574-1