Attention‐aware spatio‐temporal learning for multi‐view gait‐based age estimation and gender classification

Recently, gait‐based age and gender recognition have attracted considerable attention in the fields of advertisement marketing and surveillance retrieval due to the unique advantage that gaits can be perceived at a long distance. Intuitively, age and gender can be recognised by observing people'...

Full description

Saved in:

Bibliographic Details
Published in	IET computer vision Vol. 19; no. 1
Main Authors	Huang, Binyuan, Luo, Yongdong, Xie, Jiahui, Pan, Jiahui, Zhou, Chengju
Format	Journal Article
Language	English
Published	01.01.2025
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Recently, gait‐based age and gender recognition have attracted considerable attention in the fields of advertisement marketing and surveillance retrieval due to the unique advantage that gaits can be perceived at a long distance. Intuitively, age and gender can be recognised by observing people's static shape (e.g. different hairstyles between males and females) and dynamic motion (e.g. different walking velocities between the elderly and youth). However, most of the existing gait‐based age and gender recognition methods are based on Gait Energy Image (GEI), which loses the capability of explicitly modelling temporal dynamic information and is not robust to the multi‐view recognition that inevitably happens in a real application. Therefore, in this study, an Attention‐aware Spatio‐Temporal Learning (ASTL) framework is proposed, which employs a silhouette sequence as input to learn essential and invariable spatial‐temporal gait representations. More specifically, a Multi‐Scale Temporal Aggregation (MSTA) module provides an effective scheme for dynamic gait description by exploring and aggregating multi‐scale temporal interval information, which is a core supplement to spatial representation. Then, a Multiple Attention Aggregation (MAA) module is designed to help the network focus on the most discriminatory information along temporal, spatial and channel dimensions. Finally, a Multimodal Collaborative Learning (MCL) block gives full play to the advantages of different modal features through a multimodal cooperative learning strategy. The mean absolute error (MAE) for the age estimation and the correct classification rate (CCR) for the gender classification on OU‐MVLP achieve 6.68 years and 97%, respectively, demonstrating the superiority of the method. Ablation experiments and visualisation results also prove the effectiveness of the three individual modules in their framework.
ISSN:	1751-9632 1751-9640
DOI:	10.1049/cvi2.12165