Learning deep features to recognise speech emotion using merged deep CNN

This study aims at learning deep features from different data to recognise speech emotion. The authors designed a merged convolutional neural network (CNN), which had two branches, one being one-dimensional (1D) CNN branch and another 2D CNN branch, to learn the high-level features from raw audio cl...

Full description

Saved in:

Bibliographic Details
Published in	IET signal processing Vol. 12; no. 6; pp. 713 - 721
Main Authors	Zhao, Jianfeng, Mao, Xia, Chen, Lijiang
Format	Journal Article
Language	English
Published	The Institution of Engineering and Technology 01.08.2018
Subjects	1D CNN branch 2D CNN branch Bayesian optimisation deep features emotion classification emotion recognition learning (artificial intelligence) merged convolutional neural network merged deep CNN optimisation Research Article speech emotion recognition speech recognition transfer learning transfer learning deep features emotion recognition emotion classification 2D CNN branch optimisation speech emotion recognition speech recognition 1D CNN branch Bayesian optimisation merged deep CNN learning (artificial intelligence) merged convolutional neural network
Online Access	Get full text

Cover

Loading…

More Information
Summary:	This study aims at learning deep features from different data to recognise speech emotion. The authors designed a merged convolutional neural network (CNN), which had two branches, one being one-dimensional (1D) CNN branch and another 2D CNN branch, to learn the high-level features from raw audio clips and log-mel spectrograms. The building of the merged deep CNN consists of two steps. First, one 1D CNN and one 2D CNN architectures were designed and evaluated; then, after the deletion of the second dense layers, the two CNN architectures were merged together. To speed up the training of the merged CNN, transfer learning was introduced in the training. The 1D CNN and 2D CNN were trained first. Then, the learned features of the 1D CNN and 2D CNN were repurposed and transferred to the merged CNN. Finally, the merged deep CNN initialised with transferred features was fine-tuned. Two hyperparameters of the designed architectures were chosen through Bayesian optimisation in the training. The experiments conducted on two benchmark datasets show that the merged deep CNN can improve emotion classification performance significantly.
ISSN:	1751-9675 1751-9683 1751-9683
DOI:	10.1049/iet-spr.2017.0320