Fully Unsupervised Topic Clustering of Unlabelled Spoken Audio Using Self-Supervised Representation Learning and Topic Model

Unsupervised topic clustering of spoken audio is an important research topic for zero-resourced unwritten languages. A classical approach is to find a set of spoken terms from only the audio based on dynamic time warping or generative modeling (e.g., hidden Markov model), and apply a topic model to...

Full description

Saved in:

Bibliographic Details
Published in	ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) pp. 1 - 5
Main Authors	Maekaku, Takashi, Fujita, Yuya, Chang, Xuankai, Watanabe, Shinji
Format	Conference Proceeding
Language	English
Published	IEEE 04.06.2023
Subjects	Acoustics Hidden Markov models HuBERT LDA Predictive models Probabilistic logic Representation learning Self-supervised learning Signal processing Topic model Unsupervised WavLM
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Unsupervised topic clustering of spoken audio is an important research topic for zero-resourced unwritten languages. A classical approach is to find a set of spoken terms from only the audio based on dynamic time warping or generative modeling (e.g., hidden Markov model), and apply a topic model to classify topics. The spoken term discovery is the most important and difficult part. In this paper, we propose to combine self-supervised representation learning (SSRL) methods as a component of spoken term discovery and probabilistic topic models. Most SSRL methods pre-train a model which predicts high-quality pseudo labels generated from an audio-only corpus. These pseudo labels can be used to produce a sequence of pseudo subwords by applying deduplication and a subword model. Then, we apply a topic model based on latent Dirichlet allocation for these pseudo-subword sequences in an unsupervised manner. The clustering performance is evaluated on the Fisher corpus using normalized mutual information. We confirm the improvement of the proposed method and its effectiveness compared to an existing approach using dynamic time warping and topic models although the experimental setups are not directly comparable.
ISSN:	2379-190X
DOI:	10.1109/ICASSP49357.2023.10095280