A two-level Item Response Theory model to evaluate speech synthesis and recognition

Automatic speech recognition (ASR) systems should be tested ideally using diverse speech test data. A promising alternative to produce such test data is to synthesize speeches from diverse sentences and speakers. However, despite the great amount of test data that can be produced, not all speeches a...

Full description

Saved in:
Bibliographic Details
Published inSpeech communication Vol. 137; pp. 19 - 34
Main Authors Oliveira, Chaina S., Moraes, João V.C., Filho, Telmo Silva, Prudêncio, Ricardo B.C.
Format Journal Article
LanguageEnglish
Published Amsterdam Elsevier B.V 01.02.2022
Elsevier Science Ltd
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Automatic speech recognition (ASR) systems should be tested ideally using diverse speech test data. A promising alternative to produce such test data is to synthesize speeches from diverse sentences and speakers. However, despite the great amount of test data that can be produced, not all speeches are equally relevant. This paper proposes a two-level Item Response Theory (IRT) model to simultaneously evaluate ASR systems, speakers and sentences. In the first level, the transcription rates obtained by a pool of ASR systems on a set of synthesized speeches are recorded and then analyzed to estimate: each speech’s difficulty and each ASR system’s ability. In the second level, each speech’s difficulty is decomposed as a function of two factors: the sentence’s difficulty and the speaker’s quality. Thus, the speech’s difficulty is high when generated from a difficult sentence and a bad speaker, while an ASR is good when it is robust to hard speeches. Performed experiments revealed useful insights on how the quality of speech synthesis and recognition can be affected by distinct factors (e.g., sentence difficulty and speaker ability). •An original solution for simultaneously evaluating speech synthesis and recognition using Item Response Theory.•The difficulty of a synthesized speech depends on the performance of automatic speech recognition systems with different abilities when transcribing it.•Specific sentences may have a more significant influence on the synthesis quality than the speakers’ abilities.
ISSN:0167-6393
1872-7182
DOI:10.1016/j.specom.2021.11.002