AF-Transformer: Attention Fusion Transformer for Facial Expression Recognition

Due to occlusions, variable head positions, face deformations, and motion blur under unconstrained situations, facial expression recognition (FER) in the wild is exceedingly difficult. Previous studies were mostly developed for laboratory-controlled FER, despite significant development in automated...

Full description

Saved in:

Bibliographic Details
Published in	2022 3rd International Conference on Computer Vision, Image and Deep Learning & International Conference on Computer Engineering and Applications (CVIDL & ICCEA) pp. 939 - 942
Main Author	Lu, Hanning
Format	Conference Proceeding
Language	English
Published	IEEE 20.05.2022
Subjects	Attention Mechanism Deep Learning Emotion recognition Face recognition Facial expression recognition Natural language processing Strain Transformer Transformers Visualization
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Due to occlusions, variable head positions, face deformations, and motion blur under unconstrained situations, facial expression recognition (FER) in the wild is exceedingly difficult. Previous studies were mostly developed for laboratory-controlled FER, despite significant development in automated FER over the last few decades. Real-world occlusions, varied head positions, and other difficulties significantly raise the difficulty of FER due to these poorly informed regions and complicated backdrops. Unlike prior exclusively CNN-based approaches, we think that converting face pictures into visual word sequences and performing global expression recognition is viable and practicable. As a result, we develop Attention Fusion Transformer (AF-Transformer) as a two-step solution to FER in the wild. To begin, we suggest using Attention Fusion (AF) to combine two feature maps produced by a dual-branch CNN. By combining numerous characteristics with global-local attention, AF collects discriminative information. Flattening and projecting the fused feature maps into visual word sequences follows. Second, we suggest modelling the link between these visual words and global self-attention, inspired by the success of Transformers in natural language processing. Three publicly available in-the-wild facial expression datasets are used to test the suggested approach. Extensive studies show that our technique outperforms other methods in the same context.
DOI:	10.1109/CVIDLICCEA56201.2022.9824452