BASPRO: A Balanced Script Producer for Speech Corpus Collection Based on the Genetic Algorithm

The performance of speech-processing models is heavily influenced by the speech corpus that is used for training and evaluation. In this study, we propose BAlanced Script PROducer (BASPRO) system, which can automatically construct a phonetically balanced and rich set of Chinese sentences for collect...

Full description

Saved in:
Bibliographic Details
Published inAPSIPA transactions on signal and information processing Vol. 12; no. 3
Main Authors Chen, Yu-Wen, Wang, Hsin-Min, Tsao, Yu
Format Journal Article
LanguageEnglish
Published Boston — Delft Now Publishers 01.01.2023
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:The performance of speech-processing models is heavily influenced by the speech corpus that is used for training and evaluation. In this study, we propose BAlanced Script PROducer (BASPRO) system, which can automatically construct a phonetically balanced and rich set of Chinese sentences for collecting Mandarin Chinese speech data. First, we used pretrained natural language processing systems to extract ten-character candidate sentences from a large corpus of Chinese news texts. Then, we applied a genetic algorithm-based method to select 20 phonetically balanced sentence sets, each containing 20 sentences, from the candidate sentences. Using BASPRO, we obtained a recording script called TMNews, which contains 400 ten-character sentences. TMNews covers 84% of the syllables used in the real world. Moreover, the syllable distribution has 0.96 cosine similarity to the real-world syllable distribution. We converted the script into a speech corpus using two text-to-speech systems. Using the designed speech corpus, we tested the performances of speech enhancement (SE) and automatic speech recognition (ASR), which are one of the most important regression- and classification-based speech processing tasks, respectively. The experimental results show that the SE and ASR models trained on the designed speech corpus outperform their counterparts trained on a randomly composed speech corpus.
Bibliography:genetic algorithm
Mandarin Chinese speech corpus
recording script design
phonetically balanced and rich corpus
SIP-2022-0055
Now Publishers
Corpus design
ISSN:2048-7703
2048-7703
DOI:10.1561/116.00000155