I-vector-based speaker adaptation of deep neural networks for French broadcast audio transcription

State of the art speaker recognition systems are based on the i-vector representation of speech segments. In this paper we show how this representation can be used to perform blind speaker adaptation of hybrid DNN-HMM speech recognition system and we report excellent results on a French language aud...

Full description

Saved in:

Bibliographic Details
Published in	2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) pp. 6334 - 6338
Main Authors	Gupta, Vishwa, Kenny, Patrick, Ouellet, Pierre, Stafylakis, Themos
Format	Conference Proceeding
Language	English
Published	IEEE 01.05.2014
Subjects	Acoustics Deep Neural Networks Hidden Markov models HMM i-vectors speaker adaptation Speech Speech recognition Training Transforms Vectors
Online Access	Get full text

Cover

Loading…

More Information
Summary:	State of the art speaker recognition systems are based on the i-vector representation of speech segments. In this paper we show how this representation can be used to perform blind speaker adaptation of hybrid DNN-HMM speech recognition system and we report excellent results on a French language audio transcription task. The implemenation is very simple. An audio file is first diarized and each speaker cluster is represented by an i-vector. Acoustic feature vectors are augmented by the corresponding i-vectors before being presented to the DNN. (The same i-vector is used for all acoustic feature vectors aligned with a given speaker.) This supplementary information improves the DNN's ability to discriminate between phonetic events in a speaker independent way without having to make any modification to the DNN training algorithms. We report results on the ETAPE 2011 transcription task, and show that i-vector based speaker adaptation is effective irrespective of whether cross-entropy or sequence training is used. For cross-entropy training, we obtained a word error rate (WER) reduction from 22.16% to 20.67% whereas for sequence training the WER reduces from 19.93% to 18.40%.
ISSN:	1520-6149 2379-190X
DOI:	10.1109/ICASSP.2014.6854823