Tts4pretrain 2.0: Advancing the use of Text and Speech in ASR Pretraining with Consistency and Contrastive Losses

An effective way to learn representations from untranscribed speech and unspoken text with linguistic/lexical representations derived from synthesized speech was introduced in tts4pretrain [1]. However, the representations learned from synthesized and real speech are likely to be different, potentia...

Full description

Saved in:

Bibliographic Details
Published in	ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) pp. 7677 - 7681
Main Authors	Chen, Zhehuai, Zhang, Yu, Rosenberg, Andrew, Ramabhadran, Bhuvana, Moreno, Pedro, Wang, Gary
Format	Conference Proceeding
Language	English
Published	IEEE 23.05.2022
Subjects	Consistency Regularization Error analysis Limiting Production Self-supervised Speech recognition Speech Synthesis Switches Training Training data
Online Access	Get full text

Cover

Loading…

More Information
Summary:	An effective way to learn representations from untranscribed speech and unspoken text with linguistic/lexical representations derived from synthesized speech was introduced in tts4pretrain [1]. However, the representations learned from synthesized and real speech are likely to be different, potentially limiting the improvements from incorporating unspoken text. In this paper, we introduce learning from supervised speech earlier on in the training process with consistency-based regularization between real and synthesized speech. This allows for better learning of shared speech and text representations. Thus, we introduce a new objective, with encoder and decoder consistency and contrastive regularization between real and synthesized speech derived from the labeled corpora during the pretraining stage. We show that the new objective leads to more similar representations derived from speech and text that help downstream ASR. The proposed pretraining method yields Word Error Rate (WER) reductions of 7-21% relative on six public corpora, Librispeech, AMI, TEDLIUM, Common Voice, Switchboard, CHiME-6, over a state-of-the-art baseline pretrained with wav2vec2.0 and 2-17% over the previously proposed tts4pretrain. The proposed method outperforms the supervised SpeechStew by up to 17%. Moreover, we show that the proposed method also yields WER reductions on larger data sets by evaluating on a large resource, in-house Voice Search task and streaming ASR.
ISSN:	2379-190X
DOI:	10.1109/ICASSP43922.2022.9746475