Multimodal Emotion Recognition Using Transfer Learning on Audio and Text Data

Emotion recognition has been extensively studied in a single modality in the last decade. However, humans express their emotions usually through multiple modalities like voice, facial expressions, or text. In this paper, we propose a new method to find a unified emotion representation for multimodal...

Full description

Saved in:
Bibliographic Details
Published inComputational Science and Its Applications – ICCSA 2021 pp. 552 - 563
Main Authors Deng, James J., Leung, Clement H. C., Li, Yuanxi
Format Book Chapter
LanguageEnglish
Published Cham Springer International Publishing
SeriesLecture Notes in Computer Science
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Emotion recognition has been extensively studied in a single modality in the last decade. However, humans express their emotions usually through multiple modalities like voice, facial expressions, or text. In this paper, we propose a new method to find a unified emotion representation for multimodal emotion recognition through speech audio, and text. Emotion-based feature representation from speech audio is learned by an unsupervised triplet-loss objective, and a text-to-text transformer network is constructed to extract latent emotional meaning. As deep neural network models trained by huge datasets exhaust a lot of unaffordable resources, transfer learning provides a powerful and reusable technique to help fine-tune emotion recognition models trained on mega audio and text datasets respectively. Automatic multimodal fusion of emotion-based features from speech audio and text is conducted by a new transformer. Both the accuracy and robustness of proposed method are evaluated, and we show that our method for multimodal fusion using transfer learning in emotion recognition achieves good results.
ISBN:9783030869694
3030869695
ISSN:0302-9743
1611-3349
DOI:10.1007/978-3-030-86970-0_39