Speaker-aware training of LSTM-RNNS for acoustic modelling

Long Short-Term Memory (LSTM) is a particular type of recurrent neural network (RNN) that can model long term temporal dynamics. Recently it has been shown that LSTM-RNNs can achieve higher recognition accuracy than deep feed-forword neural networks (DNNs) in acoustic modelling. However, speaker ada...

Full description

Saved in:

Bibliographic Details
Published in	2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) pp. 5280 - 5284
Main Authors	Tian Tan, Yanmin Qian, Dong Yu, Kundu, Souvik, Liang Lu, Khe Chai Sim, Xiong Xiao, Yu Zhang
Format	Conference Proceeding Journal Article
Language	English
Published	IEEE 01.03.2016
Subjects	Acoustics Adaptation models Architecture Electronics i-vector LSTM-RNNs Modelling Neural networks Recognition Recurrent neural networks Representations speaker adaptation speaker-aware training speaking rate Speech Training
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Long Short-Term Memory (LSTM) is a particular type of recurrent neural network (RNN) that can model long term temporal dynamics. Recently it has been shown that LSTM-RNNs can achieve higher recognition accuracy than deep feed-forword neural networks (DNNs) in acoustic modelling. However, speaker adaption for LSTM-RNN based acoustic models has not been well investigated. In this paper, we study the LSTM-RNN speaker-aware training that incorporates the speaker information during model training to normalise the speaker variability. We first present several speaker-aware training architectures, and then empirically evaluate three types of speaker representation: I-vectors, bottleneck speaker vectors and speaking rate. Furthermore, to factorize the variability in the acoustic signals caused by speakers and phonemes respectively, we investigate the speaker-aware and phone-aware joint training under the framework of multi-task learning. In AMI meeting speech transcription task, speaker-aware training of LSTM-RNNs reduces word error rates by 6.5% relative to a very strong LSTM-RNN baseline, which uses FMLLR features.
Bibliography:	ObjectType-Article-2 SourceType-Scholarly Journals-1 ObjectType-Conference-1 ObjectType-Feature-3 content type line 23 SourceType-Conference Papers & Proceedings-2
ISSN:	2379-190X
DOI:	10.1109/ICASSP.2016.7472685