Joint streaming model for backchannel prediction and automatic speech recognition

In human conversations, listeners often utilize brief backchannels such as "uh-huh" or "yeah." Timely backchannels are crucial to understanding and increasing trust among conversational partners. In human-machine conversation systems, users can engage in natural conversations whe...

Full description

Saved in:
Bibliographic Details
Published inETRI journal Vol. 46; no. 1; pp. 118 - 126
Main Authors Yong-Seok Choi, Jeong-Uk Bang, Seung Hi Kim
Format Journal Article
LanguageKorean
Published 2024
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:In human conversations, listeners often utilize brief backchannels such as "uh-huh" or "yeah." Timely backchannels are crucial to understanding and increasing trust among conversational partners. In human-machine conversation systems, users can engage in natural conversations when a conversational agent generates backchannels like a human listener. We propose a method that simultaneously predicts backchannels and recognizes speech in real time. We use a streaming transformer and adopt multitask learning for concurrent backchannel prediction and speech recognition. The experimental results demonstrate the superior performance of our method compared with previous works while maintaining a similar single-task speech recognition performance. Owing to the extremely imbalanced training data distribution, the single-task backchannel prediction model fails to predict any of the backchannel categories, and the proposed multitask approach substantially enhances the backchannel prediction performance. Notably, in the streaming prediction scenario, the performance of backchannel prediction improves by up to 18.7% compared with existing methods.
Bibliography:KISTI1.1003/JNL.JAKO202450348465478
ISSN:1225-6463
2233-7326