Multimodal Embeddings From Language Models for Emotion Recognition in the Wild

Word embeddings such as ELMo and BERT have been shown to model word usage in language with greater efficacy through contextualized learning on large-scale language corpora, resulting in significant performance improvement across many natural language processing tasks. In this work we integrate acous...

Full description

Saved in:

Bibliographic Details
Published in	IEEE signal processing letters Vol. 28; pp. 608 - 612
Main Authors	Tseng, Shao-Yen, Narayanan, Shrikanth, Georgiou, Panayiotis
Format	Journal Article
Language	English
Published	New York IEEE 2021 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects	Acoustics Audio data Bit error rate Context modeling Convolution Emotion recognition Emotions Feature extraction Language Machine learning Natural language processing speech processing Task analysis unsupervised learning Words (language)
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Word embeddings such as ELMo and BERT have been shown to model word usage in language with greater efficacy through contextualized learning on large-scale language corpora, resulting in significant performance improvement across many natural language processing tasks. In this work we integrate acoustic information into contextualized lexical embeddings through the addition of a parallel stream to the bidirectional language model. This multimodal language model is trained on spoken language data that includes both text and audio modalities. We show that embeddings extracted from this model integrate paralinguistic cues into word meanings and can provide vital affective information by applying these multimodal embeddings to the task of speaker emotion recognition.
ISSN:	1070-9908 1558-2361
DOI:	10.1109/LSP.2021.3065598