Fusion of deep learning features with mixture of brain emotional learning for audio-visual emotion recognition

•Propose an audio-visual fusion method with a bio-inspired Mixture of Brain Emotional Learning (MoBEL) for emotion recognition.•Utilize 3D-CNN and CRNN models to extract spatial-temporal features of visual and audio modality respectively.•Learn jointly audio-visual features and allocate weights to e...

Full description

Saved in:

Bibliographic Details
Published in	Speech communication Vol. 127; pp. 92 - 103
Main Authors	Farhoudi, Zeinab, Setayeshi, Saeed
Format	Journal Article
Language	English
Published	Amsterdam Elsevier B.V 01.03.2021 Elsevier Science Ltd
Subjects	Artificial neural networks Audio-Visual emotion recognition Brain Brain emotional learning Convolutional neural networks Data integration Deep learning Emotion recognition Emotions Feature extraction Machine learning Mass media Mixture of network Multimodal fusion Neural networks Recognition Recurrent neural networks Spectrograms Three dimensional models Deep learning Brain emotional learning Audio-Visual emotion recognition Mixture of network Convolutional neural networks Multimodal fusion
Online Access	Get full text

Cover

Loading…

More Information
Summary:	•Propose an audio-visual fusion method with a bio-inspired Mixture of Brain Emotional Learning (MoBEL) for emotion recognition.•Utilize 3D-CNN and CRNN models to extract spatial-temporal features of visual and audio modality respectively.•Learn jointly audio-visual features and allocate weights to each modality.•Use low memory consumption with high accuracy rate by MoBEL model. Multimodal emotion recognition is a challenging task due to different modalities emotions expressed during a specific time in video clips. Considering the existed spatial-temporal correlation in the video, we propose an audio-visual fusion model of deep learning features with a Mixture of Brain Emotional Learning (MoBEL) model inspired by the brain limbic system. The proposed model is composed of two stages. First, deep learning methods, especially Convolutional Neural Network (CNN) and Recurrent Neural Network (RNN), are applied to represent highly abstract features. Second, the fusion model, namely MoBEL, is designed to learn the previously joined audio-visual features simultaneously. For the visual modality representation, the 3D-CNN model has been used to learn the spatial-temporal features of visual expression. On the other hand, for the auditory modality, the Mel-spectrograms of speech signals have been fed into CNN-RNN for the spatial-temporal feature extraction. The high-level feature fusion approach with the MoBEL network is presented to make use of a correlation between the visual and auditory modalities for improving the performance of emotion recognition. The experimental results on the eNterface’05 database have been demonstrated that the performance of the proposed method is better than the hand-crafted features and the other state-of-the-art information fusion models in video emotion recognition.
ISSN:	0167-6393 1872-7182
DOI:	10.1016/j.specom.2020.12.001