Decoupled Multi-perspective Fusion for Speech Depression Detection

S peech D epression D etection (SDD) has garnered attention from researchers due to its low cost and convenience. However, current algorithms lack methods for extracting interpretable acoustic features based on clinical manifestations. In addition, effectively fusing these features to overcome indiv...

Full description

Saved in:

Bibliographic Details
Published in	IEEE transactions on affective computing pp. 1 - 15
Main Authors	Zhao, Minghui, Gao, Hongxiang, Zhao, Lulu, Wang, Zhongyu, Wang, Fei, Zheng, Wenming, Li, Jianqing, Liu, Chengyu
Format	Journal Article
Language	English
Published	IEEE 04.02.2025
Subjects	Acoustics Affective computing Brain modeling decoupled feature fusion Depression Electronic mail Feature extraction Long short term memory Medical diagnostic imaging Mel frequency cepstral coefficient multi-perspective Spectrogram Speech depression voiceprint contrastive learning
Online Access	Get full text

Cover

Loading…

More Information
Summary:	S peech D epression D etection (SDD) has garnered attention from researchers due to its low cost and convenience. However, current algorithms lack methods for extracting interpretable acoustic features based on clinical manifestations. In addition, effectively fusing these features to overcome individual heterogeneity remains a challenge. This study proposes a decoupled multi-perspective fusion (DMPF) model. The model extracts five key features of voiceprint, emotion, pause, energy, and tremor based on the multi-perspective clinical manifestations. These features are then decoupled into common and private features, which fused through graph attention network to obtain the comprehensive depression representation. Notably, this study has collected a depression speech dataset, which includes standardized and comprehensive tasks along with diagnostic labels provided by psychologists. Extensive subject-independent experiments were conducted on the DAIC-WOZ, MODMA and MPSC datasets. The voiceprint features can automatically cluster the depressed and non-depressed populations. Furthermore, DMPF can effectively fuse common and private features from different perspectives, achieving AUC of 84.20%, 85.34%, 86.13% on three datasets. The results illustrate the interpretability of multi-perspective features and demonstrate that the combination of speech manifestations can enhance the detection ability, which can provide a multi-perspective observational tool for physicians and clinical practice. Code is available at https://github.com/zmh56/SDD-for-DMPF-MPSC .
ISSN:	1949-3045 1949-3045
DOI:	10.1109/TAFFC.2025.3538519