DEP-Former: Multimodal Depression Recognition Based on Facial Expressions and Audio Features via Emotional Changes

Clinical research has demonstrated that exploring behavioral signal differences between depressed patients and non-depressed people using audiovisual technology is an effective approach for achieving depression recognition. Hence, in this paper we propose an emotion word reading experiment (EWRE), a...

Full description

Saved in:
Bibliographic Details
Published inIEEE transactions on circuits and systems for video technology Vol. 35; no. 3; pp. 2087 - 2100
Main Authors Ye, Jiayu, Yu, Yanhong, Lu, Lin, Wang, Hao, Zheng, Yunshao, Liu, Yang, Wang, Qingxiang
Format Journal Article
LanguageEnglish
Published IEEE 01.03.2025
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Clinical research has demonstrated that exploring behavioral signal differences between depressed patients and non-depressed people using audiovisual technology is an effective approach for achieving depression recognition. Hence, in this paper we propose an emotion word reading experiment (EWRE), and extract features from facial expressions and audios for depression recognition. Building upon this, we propose a depression recognition model (DEP-Former), which deeply integrates multimodal features. DEP-Former first designs a modality adapter to achieve emotion space mapping and the sharing of multimodal features, addressing cross-modal inconsistencies. Simultaneously, it proposes a mechanism of attention index sharing, exceeding the limitations of cognitive subjectivity by calculating confidence in key emotional information across modalities. Finally, we propose a multimodal cross-attention module and a Bernoulli distribution feature fusion prediction module to achieve deep integration of multilevel information, thereby enabling depression recognition. Compared with existing advanced multimodal models, DEP-Former demonstrates superior performance in EWRE, achieving an accuracy of 0.9500 and an F1 score of 0.9499, significantly enhancing depression recognition over the single-modality methods. Furthermore, its robust generalization ability is validated on the AVEC 2014 dataset. Through the attention query of the interpretability analysis module, we discover that depressed patients exhibit heightened sensitivity to negative emotional words, such as dismissal and tragedy. In contrast, healthy individuals tend to be more attuned to positive emotional words, including passion, purity, and justice. Additionally, depressed patients exhibit a degree of psychological state diversity, showing sensitivity to some positive emotional words as well. Our codes and data are available at https://github.com/QLUTEmoTechCrew/DEP-Former .
ISSN:1051-8215
1558-2205
DOI:10.1109/TCSVT.2024.3491098