Towards A Common Speech Analysis Engine

Recent innovations in self-supervised representation learning have led to remarkable advances in natural language processing. That said, in the speech processing domain, self-supervised representation learning-based systems are not yet considered state-of-the-art.We propose leveraging recent advance...

Full description

Saved in:
Bibliographic Details
Published inICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) pp. 8162 - 8166
Main Authors Aronowitz, Hagai, Gat, Itai, Morais, Edmilson, Zhu, Weizhong, Hoory, Ron
Format Conference Proceeding
LanguageEnglish
Published IEEE 23.05.2022
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Recent innovations in self-supervised representation learning have led to remarkable advances in natural language processing. That said, in the speech processing domain, self-supervised representation learning-based systems are not yet considered state-of-the-art.We propose leveraging recent advances in self-supervised-based speech processing to create a common speech analysis engine. Such an engine should be able to handle multiple speech processing tasks, using a single architecture, to obtain state-of-the-art accuracy. The engine must also enable support for new tasks with small training datasets. Beyond that, a common engine should be capable of supporting distributed training with client in-house private data.We present the architecture for a common speech analysis engine based on the HuBERT self-supervised speech representation. Based on experiments, we report our results for language identification and emotion recognition on the standard evaluations NIST-LRE 07 and IEMOCAP. Our results surpass the state-of-the-art performance reported so far on these tasks.We also analyzed our engine on the emotion recognition task using reduced amounts of training data and show how to achieve improved results.
ISSN:2379-190X
DOI:10.1109/ICASSP43922.2022.9747756