Multi-view facial action unit detection via DenseNets and CapsNets

Though the standard convolutional neural networks (CNNs) have been proposed to increase the robustness of facial action unit (AU) detection regarding pose variations, it is hard to enhance detection performance because the standard CNNs are not robust enough to affine transformation. To address this...

Full description

Saved in:

Bibliographic Details
Published in	Multimedia tools and applications Vol. 81; no. 14; pp. 19377 - 19394
Main Authors	Ren, Dakai, Wen, Xiangmin, Chen, Jiazhong, Han, Yu, Zhang, Shiqi
Format	Journal Article
Language	English
Published	New York Springer US 01.06.2022 Springer Nature B.V
Subjects	1182: Deep Processing of Multimedia Data Affine transformations Artificial neural networks Computer Communication Networks Computer Science Data Structures and Information Theory Datasets Deep learning Image reconstruction Multimedia Multimedia Information Systems Neural networks Robustness (mathematics) Special Purpose and Application-Based Systems Deep learning Facial action unit Facial expression recognition Emotion recognition CapsNets DenseNets Convolutional neural networks
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Though the standard convolutional neural networks (CNNs) have been proposed to increase the robustness of facial action unit (AU) detection regarding pose variations, it is hard to enhance detection performance because the standard CNNs are not robust enough to affine transformation. To address this issue, two novel architectures termed as AUCaps and AUCaps++ are proposed for multi-view and multi-label facial AU detection in this work. In these two architectures, one or more dense blocks and one capsule networks (CapsNets) are stacked. Specifically, The dense blocks prefixed before CapsNets are used to learn more discriminative high-level AU features, and the CapsNets is exploited to learn more view-invariant AU features. Moreover, the capsule types and digit capsule dimension are optimized to avoid the computation and storage burden caused by the dynamic routing in standard CapsNets. Because the AUCaps and AUCaps++ are trained by jointly optimizing multi-label loss of AU and reconstruction loss of viewpoint image, the proposed method could achieve high F1 score and learn human face roughly in the reconstruction images over different AUs. Numerical results of within-dataset and cross-dataset show that the average F1 scores of the proposed method outperform the competitors using hand-crafted features or deep learning features by a big margin on two public datasets.
ISSN:	1380-7501 1573-7721
DOI:	10.1007/s11042-021-11147-w