Joint streaming model for backchannel prediction and automatic speech recognition

In human conversations, listeners often utilize brief backchannels such as "uh-huh" or "yeah." Timely backchannels are crucial to understanding and increasing trust among conversational partners. In human-machine conversation systems, users can engage in natural conversations whe...

Full description

Saved in:

Bibliographic Details
Published in	ETRI journal Vol. 46; no. 1; pp. 118 - 126
Main Authors	Yong-Seok Choi, Jeong-Uk Bang, Seung Hi Kim
Format	Journal Article
Language	Korean
Published	2024
Subjects	streaming transformer block processing backchannel prediction automatic speech recognition multitask learning streaming fashion
Online Access	Get full text

Cover

Loading…

More Information
Summary:	In human conversations, listeners often utilize brief backchannels such as "uh-huh" or "yeah." Timely backchannels are crucial to understanding and increasing trust among conversational partners. In human-machine conversation systems, users can engage in natural conversations when a conversational agent generates backchannels like a human listener. We propose a method that simultaneously predicts backchannels and recognizes speech in real time. We use a streaming transformer and adopt multitask learning for concurrent backchannel prediction and speech recognition. The experimental results demonstrate the superior performance of our method compared with previous works while maintaining a similar single-task speech recognition performance. Owing to the extremely imbalanced training data distribution, the single-task backchannel prediction model fails to predict any of the backchannel categories, and the proposed multitask approach substantially enhances the backchannel prediction performance. Notably, in the streaming prediction scenario, the performance of backchannel prediction improves by up to 18.7% compared with existing methods.
Bibliography:	KISTI1.1003/JNL.JAKO202450348465478
ISSN:	1225-6463 2233-7326