An encoder-decoder network for crowd counting based on multi-scale attention mechanism

Crowd counting is a challenging computer vision task, which is widely used in video surveillance and public safety applications. With the increase of camera resolution and the complexity of crowd image, it becomes an important problem to predict crowd density and crowd count accurately. Recent CNN-b...

Full description

Saved in:

Bibliographic Details
Published in	Multimedia tools and applications Vol. 84; no. 3; pp. 1187 - 1210
Main Authors	Chuang, Hao-Hsiang, Chen, Yi-Cheng, Lin, Chang Hong
Format	Journal Article
Language	English
Published	New York Springer US 01.01.2025 Springer Nature B.V
Subjects	Computer Communication Networks Computer Science Computer vision Counting Crowd monitoring Data Structures and Information Theory Datasets Density Encoders-Decoders Facial recognition technology Feature extraction Feature maps Methods Multimedia Multimedia Information Systems Public safety Root-mean-square errors Sensors Special Purpose and Application-Based Systems Skip-connection Density estimation Attention mechanism Multi-scale attention Crowd counting
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Crowd counting is a challenging computer vision task, which is widely used in video surveillance and public safety applications. With the increase of camera resolution and the complexity of crowd image, it becomes an important problem to predict crowd density and crowd count accurately. Recent CNN-based density estimation methods have shown effectiveness in densely populated scenes. In this paper, we present a novel approach to crowd counting through the development of an Encoder-Decoder Multi-Scale Attention Network. Our approach leverages the robust U-net architecture as the backbone network, strengthened by the strategic integration of an attention mechanism. We adopt a multi-scale attention method to each different layers in the U-net backbone to make the network extract features which focus on the crowds, instead of the background in the images. The attention mechanism and the skip-connections can adjust the weights of feature maps while maintaining features at different scales. Extensive experiments on ShanghaiTech Part_A & B and UCF-QNRF dataset demonstrate that our network can achieve better performances with Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE) values outperforming existing methodologies: ShanghaiTech Part_A (MAE/RMSE: 60.0/104.9), Part_B (MAE/RMSE: 7.8/13.8), and UCF-QNRF (MAE/RMSE: 98.6/179.7).
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	1573-7721 1380-7501 1573-7721
DOI:	10.1007/s11042-024-19055-5