Learning acoustic frame labeling for speech recognition with recurrent neural networks

We explore alternative acoustic modeling techniques for large vocabulary speech recognition using Long Short-Term Memory recurrent neural networks. For an acoustic frame labeling task, we compare the conventional approach of cross-entropy (CE) training using fixed forced-alignments of frames and lab...

Full description

Saved in:

Bibliographic Details
Published in	2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) pp. 4280 - 4284
Main Authors	Sak, Hasim, Senior, Andrew, Rao, Kanishka, Irsoy, Ozan, Graves, Alex, Beaufays, Francoise, Schalkwyk, Johan
Format	Conference Proceeding
Language	English
Published	IEEE 01.04.2015
Subjects	acoustic modeling Acoustics Context modeling CTC Gold Hidden Markov models LSTM Neural networks RNN Speech recognition Training
Online Access	Get full text
ISSN	1520-6149
DOI	10.1109/ICASSP.2015.7178778

Cover

Loading…

More Information
Summary:	We explore alternative acoustic modeling techniques for large vocabulary speech recognition using Long Short-Term Memory recurrent neural networks. For an acoustic frame labeling task, we compare the conventional approach of cross-entropy (CE) training using fixed forced-alignments of frames and labels, with the Connectionist Temporal Classification (CTC) method proposed for labeling unsegmented sequence data. We demonstrate that the latter can be implemented with finite state transducers. We experiment with phones and context dependent HMM states as acoustic modeling units. We also investigate the effect of context in acoustic input by training unidirectional and bidirectional LSTM RNN models. We show that a bidirectional LSTM RNN CTC model using phone units can perform as well as an LSTM RNN model trained with CE using HMM state alignments. Finally, we also show the effect of sequence discriminative training on these models and show the first results for sMBR training of CTC models.
ISSN:	1520-6149
DOI:	10.1109/ICASSP.2015.7178778