PP-MeT: a Real-world Personalized Prompt based Meeting Transcription System
Speaker-attributed automatic speech recognition (SA-ASR) improves the accuracy and applicability of multi-speaker ASR systems in real-world scenarios by assigning speaker labels to transcribed texts. However, SA-ASR poses unique challenges due to factors such as speaker overlap, speaker variability,...
Saved in:
Main Authors | , , , , , , , |
---|---|
Format | Journal Article |
Language | English |
Published |
28.09.2023
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Speaker-attributed automatic speech recognition (SA-ASR) improves the
accuracy and applicability of multi-speaker ASR systems in real-world scenarios
by assigning speaker labels to transcribed texts. However, SA-ASR poses unique
challenges due to factors such as speaker overlap, speaker variability,
background noise, and reverberation. In this study, we propose PP-MeT system, a
real-world personalized prompt based meeting transcription system, which
consists of a clustering system, target-speaker voice activity detection
(TS-VAD), and TS-ASR. Specifically, we utilize target-speaker embedding as a
prompt in TS-VAD and TS-ASR modules in our proposed system. In constrast with
previous system, we fully leverage pre-trained models for system
initialization, thereby bestowing our approach with heightened generalizability
and precision. Experiments on M2MeT2.0 Challenge dataset show that our system
achieves a cp-CER of 11.27% on the test set, ranking first in both fixed and
open training conditions. |
---|---|
DOI: | 10.48550/arxiv.2309.16247 |