LOGO-Former: Local-Global Spatio-Temporal Transformer for Dynamic Facial Expression Recognition
Previous methods for dynamic facial expression recognition (DFER) in the wild are mainly based on Convolutional Neural Networks (CNNs), whose local operations ignore the long-range dependencies in videos. Transformer-based methods for DFER can achieve better performances but result in higher FLOPs a...
Saved in:
Main Authors | , , |
---|---|
Format | Journal Article |
Language | English |
Published |
05.05.2023
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Previous methods for dynamic facial expression recognition (DFER) in the wild
are mainly based on Convolutional Neural Networks (CNNs), whose local
operations ignore the long-range dependencies in videos. Transformer-based
methods for DFER can achieve better performances but result in higher FLOPs and
computational costs. To solve these problems, the local-global spatio-temporal
Transformer (LOGO-Former) is proposed to capture discriminative features within
each frame and model contextual relationships among frames while balancing the
complexity. Based on the priors that facial muscles move locally and facial
expressions gradually change, we first restrict both the space attention and
the time attention to a local window to capture local interactions among
feature tokens. Furthermore, we perform the global attention by querying a
token with features from each local window iteratively to obtain long-range
information of the whole video sequence. In addition, we propose the compact
loss regularization term to further encourage the learned features have the
minimum intra-class distance and the maximum inter-class distance. Experiments
on two in-the-wild dynamic facial expression datasets (i.e., DFEW and FERV39K)
indicate that our method provides an effective way to make use of the spatial
and temporal dependencies for DFER. |
---|---|
DOI: | 10.48550/arxiv.2305.03343 |