MSFFA: a multi-scale feature fusion and attention mechanism network for crowd counting

Crowd counting has been a growing hot topic in the computer vision community in recent years due to its extensive applications in the fields of public safety and commercial planning. However, up to now, it has been still a challenging task in realistic scenes owing to large-scale variations and comp...

Full description

Saved in:

Bibliographic Details
Published in	The Visual computer Vol. 39; no. 3; pp. 1045 - 1056
Main Authors	Li, Zhaoxin, Lu, Shuhua, Dong, Yishan, Guo, Jingyuan
Format	Journal Article
Language	English
Published	Berlin/Heidelberg Springer Berlin Heidelberg 01.03.2023 Springer Nature B.V
Subjects	Accuracy Artificial Intelligence Computer Graphics Computer Science Computer vision Datasets Deep learning Feature extraction Image Processing and Computer Vision Neural networks Original Article Public safety Attention mechanism Multi-scale Mixed loss function Crowd counting
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Crowd counting has been a growing hot topic in the computer vision community in recent years due to its extensive applications in the fields of public safety and commercial planning. However, up to now, it has been still a challenging task in realistic scenes owing to large-scale variations and complex background interference. In this paper, we have proposed an efficient end-to-end Multi-Scale Feature Fusion and Attention mechanism CNN network, named as MSFFA. The presented network consists of three parts: the front-end of the low-level feature extractor, the mid-end of the multi-scale feature fusion operator and the back-end of the density map generator. Among them, most significantly, in the mid-end, we stack three MSFF blocks with the residual connection, which on the one hand, makes the network obtain large-scale continuous variations and on the other hand, enhances the information transmission. Meanwhile, a global attention mechanism module is employed to extract effective features in complex background scenes. Our method has been evaluated on three public datasets, including ShanghaiTech, UCF-QNRF and UCF_CC_50. Experimental results show that our method outperforms some existing advanced approaches, indicating its excellent accuracy and stability.
ISSN:	0178-2789 1432-2315
DOI:	10.1007/s00371-021-02383-0