Learning deep features to recognise speech emotion using merged deep CNN
This study aims at learning deep features from different data to recognise speech emotion. The authors designed a merged convolutional neural network (CNN), which had two branches, one being one-dimensional (1D) CNN branch and another 2D CNN branch, to learn the high-level features from raw audio cl...
Saved in:
Published in | IET signal processing Vol. 12; no. 6; pp. 713 - 721 |
---|---|
Main Authors | , , |
Format | Journal Article |
Language | English |
Published |
The Institution of Engineering and Technology
01.08.2018
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | This study aims at learning deep features from different data to recognise speech emotion. The authors designed a merged convolutional neural network (CNN), which had two branches, one being one-dimensional (1D) CNN branch and another 2D CNN branch, to learn the high-level features from raw audio clips and log-mel spectrograms. The building of the merged deep CNN consists of two steps. First, one 1D CNN and one 2D CNN architectures were designed and evaluated; then, after the deletion of the second dense layers, the two CNN architectures were merged together. To speed up the training of the merged CNN, transfer learning was introduced in the training. The 1D CNN and 2D CNN were trained first. Then, the learned features of the 1D CNN and 2D CNN were repurposed and transferred to the merged CNN. Finally, the merged deep CNN initialised with transferred features was fine-tuned. Two hyperparameters of the designed architectures were chosen through Bayesian optimisation in the training. The experiments conducted on two benchmark datasets show that the merged deep CNN can improve emotion classification performance significantly. |
---|---|
ISSN: | 1751-9675 1751-9683 1751-9683 |
DOI: | 10.1049/iet-spr.2017.0320 |