MFE-Former: Disentangling Emotion-Identity Dynamics via Self-Supervised Learning for Enhancing Speech-Driven Depression Detection

Acoustic features are crucial behavioral indicators for depression detection. However, prior speech-based depression detection methods often overlook the variability of emotional patterns across samples, leading to interference from speaker identity and hindering the effective extraction of emotiona...

Full description

Saved in:
Bibliographic Details
Published inIEEE journal of biomedical and health informatics Vol. PP; pp. 1 - 12
Main Authors Wang, Hao, Ye, Jiayu, Yu, Yanhong, Lu, Lin, Yuan, Lin, Wang, Qingxiang
Format Journal Article
LanguageEnglish
Published United States IEEE 01.08.2025
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Acoustic features are crucial behavioral indicators for depression detection. However, prior speech-based depression detection methods often overlook the variability of emotional patterns across samples, leading to interference from speaker identity and hindering the effective extraction of emotional changes. To address this limitation, we developed the Emotional Word Reading Experiment (EWRE) and introduced a method combining self-supervised and supervised learning for depression detection from speech called MFE-Former. First, we generate fine-grained emotional representations for response segments by computing cosine similarity between intra-sample and inter-sample contexts. Concurrently, orthogonality constraints decouple identity information from emotional features, while a Transformer decoder reconstructs spectral structures to improve sensitivity to depression-related emotional patterns. Next, we propose a multi-scale emotion change perception module and a Bernoulli distribution-based joint decision module integrate multi-level information for depression detection. By enhancing the distribution differences among positive, neutral, and negative emotional features, we find that patients with depression are more inclined to express negative emotions, whereas healthy individuals express more positive emotions. The experimental results on EWRE and AVEC 2014 show that MFE-Former outperforms state-of-the-art temporal methods under conditions of variability in emotional patterns across samples. MFE-Former has been open sourced on https://github.com/QLUTEmoTechCrew/MFE-Former .
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 23
ISSN:2168-2194
2168-2208
2168-2208
DOI:10.1109/JBHI.2025.3594166