An Exploration of Hubert with Large Number of Cluster Units and Model Assessment Using Bayesian Information Criterion

Self-supervised learning (SSL) has become one of the most important technologies to realize spoken dialogue systems for languages that do not have much audio data and its transcription available. Speech representation models are one of the keys to achieving this, and have been actively studied in re...

Full description

Saved in:

Bibliographic Details
Published in	ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) pp. 7107 - 7111
Main Authors	Maekaku, Takashi, Chang, Xuankai, Fujita, Yuya, Watanabe, Shinji
Format	Conference Proceeding
Language	English
Published	IEEE 23.05.2022
Subjects	acoustic unit discovery BIC Bit error rate Conferences HuBERT Measurement Phonetics self-supervised learning Signal processing Syntactics Training unit-based language model
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Self-supervised learning (SSL) has become one of the most important technologies to realize spoken dialogue systems for languages that do not have much audio data and its transcription available. Speech representation models are one of the keys to achieving this, and have been actively studied in recent years. Among them, Hidden-Unit BERT (HuBERT) has shown promising results in automatic speech recognition (ASR) tasks. However, previous studies have investigated with limited iterations and cluster units. We explore HuBERT with larger numbers of clusters and iterations in order to obtain better speech representation. Furthermore, we introduce the Bayesian Information Criterion (BIC) as the performance measure of the model. Experimental results show that our model achieves the best performance in 5 out of 8 scores in the 4 metrics for the Zero Resource Speech 2021 task. It also outperforms the HuBERT BASE model trained with 960-hour LibriSpeech (LS) even though our model is only trained with 100-hour LS. In addition, we report that BIC is useful as a clue for determining the appropriate number of clusters to improve performance on phonetic, lexical, and syntactic metrics. Finally, we show that these findings are also effective for the ASR task.
ISSN:	2379-190X
DOI:	10.1109/ICASSP43922.2022.9746097