PP-MeT: a Real-world Personalized Prompt based Meeting Transcription System

Speaker-attributed automatic speech recognition (SA-ASR) improves the accuracy and applicability of multi-speaker ASR systems in real-world scenarios by assigning speaker labels to transcribed texts. However, SA-ASR poses unique challenges due to factors such as speaker overlap, speaker variability,...

Full description

Saved in:

Bibliographic Details
Main Authors	Lyu, Xiang, Cao, Yuhang, Wang, Qing, Yin, Jingjing, Yang, Yuguang, Zou, Pengpeng, Hu, Yanni, Lu, Heng
Format	Journal Article
Language	English
Published	28.09.2023
Subjects	Computer Science - Sound
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Speaker-attributed automatic speech recognition (SA-ASR) improves the accuracy and applicability of multi-speaker ASR systems in real-world scenarios by assigning speaker labels to transcribed texts. However, SA-ASR poses unique challenges due to factors such as speaker overlap, speaker variability, background noise, and reverberation. In this study, we propose PP-MeT system, a real-world personalized prompt based meeting transcription system, which consists of a clustering system, target-speaker voice activity detection (TS-VAD), and TS-ASR. Specifically, we utilize target-speaker embedding as a prompt in TS-VAD and TS-ASR modules in our proposed system. In constrast with previous system, we fully leverage pre-trained models for system initialization, thereby bestowing our approach with heightened generalizability and precision. Experiments on M2MeT2.0 Challenge dataset show that our system achieves a cp-CER of 11.27% on the test set, ranking first in both fixed and open training conditions.
DOI:	10.48550/arxiv.2309.16247