Multi-modal depression detection based on emotional audio and evaluation text

•We propose and prove a text reading experiment to make subjects emotions change rapidly.•Features analysis (Low-level audio features, DeepSpectrum features and word vector features).•Our research propose a multimodal fusion method based on DeepSpectrum features and word vector features to detect de...

Full description

Saved in:
Bibliographic Details
Published inJournal of affective disorders Vol. 295; pp. 904 - 913
Main Authors Ye, Jiayu, Yu, Yanhong, Wang, Qingxiang, Li, Wentao, Liang, Hu, Zheng, Yunshao, Fu, Gang
Format Journal Article
LanguageEnglish
Published Elsevier B.V 01.12.2021
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:•We propose and prove a text reading experiment to make subjects emotions change rapidly.•Features analysis (Low-level audio features, DeepSpectrum features and word vector features).•Our research propose a multimodal fusion method based on DeepSpectrum features and word vector features to detect depression by using deep learning. Early detection of depression is very important for the treatment of patients. In view of the current inefficient screening methods for depression, the research of depression identification technology is a complex problem with application value. Our research propose a new experimental method for depression detection based on audio and text. 160 Chinese subjects are investigated in this study. It is worth noting that we propose a text reading experiment to make subjects emotions change rapidly. It will be called Segmental Emotional Speech Experiment (SESE) below. We extract 384-dimensional Low-level audio features to find the differences of different emotional change in SESE. At the same time, our research propose a multi-modal fusion method based on DeepSpectrum features and word vector features to detect depression by using deep learning. Our experiment proved that SESE can improve the recognition accuracy of depression and found differences in Low-level audio features. Case group and Control group, gender and age are grouped for verification. It is also satisfactory that the multi-modal fusion model achieves accuracy of 0.912 and F1 score of 0.906. Our contribution is twofold. First, we propose and verify SESE, which can provide a new experimental idea for the follow-up researchers. Secondly, a new efficient multi-modal depression recognition model is proposed.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 23
ISSN:0165-0327
1573-2517
1573-2517
DOI:10.1016/j.jad.2021.08.090