Action recognition in still images using a multi-attention guided network with weakly supervised saliency detection

Action recognition in still images is an interesting subject in computer vision. One of the most important problems in still image-based action recognition is the lack of temporal information; At the same time, other existing problems such as cluttered backgrounds and diverse objects make the recogn...

Full description

Saved in:

Bibliographic Details
Published in	Multimedia tools and applications Vol. 80; no. 21-23; pp. 32567 - 32593
Main Authors	Ashrafi, Seyed Sajad, Shokouhi, Shahriar B., Ayatollahi, Ahmad
Format	Journal Article
Language	English
Published	New York Springer US 01.09.2021 Springer Nature B.V
Subjects	Activity recognition Computer Communication Networks Computer Science Computer vision Data Structures and Information Theory Datasets Modules Multimedia Information Systems Object recognition Special Purpose and Application-Based Systems Teachers Still image-based action recognition Multi-attention Teacher-student network Convolutional neural network
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Action recognition in still images is an interesting subject in computer vision. One of the most important problems in still image-based action recognition is the lack of temporal information; At the same time, other existing problems such as cluttered backgrounds and diverse objects make the recognition task more challenging. However, there may be several salient regions in each action image, employing of which could lead to an improvement in the recognition performance. Moreover, since no unique and clear definition exists for detecting these salient regions in action recognition images, therefore, obtaining reliable ground truth salient regions is a highly challenging task. This paper presents a multi-attention guided network with weakly-supervised multiple salient regions detection for action recognition. A teacher-student structure is used to guide the attention of the student model into the salient regions. The teacher network with Salient Region Proposal (SRP) module generates weakly-supervised data for the student network in the training phase. The student network, with Multi-ATtention (MAT) module, proposes multiple salient regions and predicts the actions based on the found information in the evaluation phase. The proposed method obtains mean Average Precision (mAP) value of 94.2% and 93.80% on Stanford-40 Actions and PASCAL VOC2012 datasets, respectively. The experimental results, based on the ResNet-50 architecture, show the superiority of the proposed method compared to the existing ones on Stanford-40 and VOC2012 datasets. Also, we have made a major modification to the BU101 dataset which is now publicly available. The proposed method achieves mAP value of 90.16% on the new BU101 dataset.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	1380-7501 1573-7721
DOI:	10.1007/s11042-021-11215-1