Adaptive occlusion hybrid second-order attention network for head pose estimation

Head pose estimation (HPE) is a challenging and critical research subject with a wide range of applications in areas such as driver monitoring, attention recognition, and human-computer interaction. However, there are two challenging problems in HPE, the first one is that in real application scenari...

Full description

Saved in:

Bibliographic Details
Published in	International journal of machine learning and cybernetics Vol. 15; no. 2; pp. 667 - 683
Main Authors	Fu, Qi, Xie, Kai, Wen, Chang, He, Jianbiao, Zhang, Wei, Tian, Hongling, Yang, Sheng
Format	Journal Article
Language	English
Published	Berlin/Heidelberg Springer Berlin Heidelberg 01.02.2024
Subjects	Artificial Intelligence Complex Systems Computational Intelligence Control Engineering Mechatronics Original Article Pattern Recognition Robotics Systems Biology Attention mechanism Exponential map Occlusion-aware Head pose estimation
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Head pose estimation (HPE) is a challenging and critical research subject with a wide range of applications in areas such as driver monitoring, attention recognition, and human-computer interaction. However, there are two challenging problems in HPE, the first one is that in real application scenarios, occlusion is very common, which affects the accuracy of HPE to a great extent. The second is that most research works use Euler angles to represent the head pose, which may lead to problems in neural network optimization. To solve these problems, an adaptive occlusion hybrid second-order attention network model was proposed. First, facial landmarks were detected by the occlusion-aware module to generate heat maps reflecting the presence or absence of occlusion in the specific facial parts, thereby enhancing features in the non-occluded parts of the face and suppressing features in the occluded regions. Meanwhile, we designed a novel second-order information attention module to interact with spatial and channel information using second-order statistical information, such that the model learns the feature correlations of different facial parts while paying more attention to important channels and suppressing redundant ones to further reduce the effect of occlusion and excavate more powerful features. Furthermore, to avoid ambiguity in common head pose representation, we introduced an exponential map to represent the head pose and designed a prediction framework capable of capturing the geometry of the pose space. The results of the experiments showed that the proposed model was competitive with methods using depth information from the BIWI dataset and achieved obvious advantages on the challenging AFLW2000 dataset, with more robust performance under large poses and occlusion interference, and stronger robustness compared with other models.
ISSN:	1868-8071 1868-808X
DOI:	10.1007/s13042-023-01933-3