Learning deep features to recognise speech emotion using merged deep CNN

This study aims at learning deep features from different data to recognise speech emotion. The authors designed a merged convolutional neural network (CNN), which had two branches, one being one-dimensional (1D) CNN branch and another 2D CNN branch, to learn the high-level features from raw audio cl...

Full description

Saved in:
Bibliographic Details
Published inIET signal processing Vol. 12; no. 6; pp. 713 - 721
Main Authors Zhao, Jianfeng, Mao, Xia, Chen, Lijiang
Format Journal Article
LanguageEnglish
Published The Institution of Engineering and Technology 01.08.2018
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:This study aims at learning deep features from different data to recognise speech emotion. The authors designed a merged convolutional neural network (CNN), which had two branches, one being one-dimensional (1D) CNN branch and another 2D CNN branch, to learn the high-level features from raw audio clips and log-mel spectrograms. The building of the merged deep CNN consists of two steps. First, one 1D CNN and one 2D CNN architectures were designed and evaluated; then, after the deletion of the second dense layers, the two CNN architectures were merged together. To speed up the training of the merged CNN, transfer learning was introduced in the training. The 1D CNN and 2D CNN were trained first. Then, the learned features of the 1D CNN and 2D CNN were repurposed and transferred to the merged CNN. Finally, the merged deep CNN initialised with transferred features was fine-tuned. Two hyperparameters of the designed architectures were chosen through Bayesian optimisation in the training. The experiments conducted on two benchmark datasets show that the merged deep CNN can improve emotion classification performance significantly.
ISSN:1751-9675
1751-9683
1751-9683
DOI:10.1049/iet-spr.2017.0320