A two-level Item Response Theory model to evaluate speech synthesis and recognition

Automatic speech recognition (ASR) systems should be tested ideally using diverse speech test data. A promising alternative to produce such test data is to synthesize speeches from diverse sentences and speakers. However, despite the great amount of test data that can be produced, not all speeches a...

Full description

Saved in:

Bibliographic Details
Published in	Speech communication Vol. 137; pp. 19 - 34
Main Authors	Oliveira, Chaina S., Moraes, João V.C., Filho, Telmo Silva, Prudêncio, Ricardo B.C.
Format	Journal Article
Language	English
Published	Amsterdam Elsevier B.V 01.02.2022 Elsevier Science Ltd
Subjects	Automatic speech recognition Item Response Theory Sentences Speech quality measurement Speech recognition Speech recognition evaluation Speech synthesis Speech synthesis evaluation Speech tests Speeches Synthesis Systems analysis Transcription Voice recognition Speech synthesis evaluation Speech recognition evaluation Speech quality measurement Item Response Theory
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Automatic speech recognition (ASR) systems should be tested ideally using diverse speech test data. A promising alternative to produce such test data is to synthesize speeches from diverse sentences and speakers. However, despite the great amount of test data that can be produced, not all speeches are equally relevant. This paper proposes a two-level Item Response Theory (IRT) model to simultaneously evaluate ASR systems, speakers and sentences. In the first level, the transcription rates obtained by a pool of ASR systems on a set of synthesized speeches are recorded and then analyzed to estimate: each speech’s difficulty and each ASR system’s ability. In the second level, each speech’s difficulty is decomposed as a function of two factors: the sentence’s difficulty and the speaker’s quality. Thus, the speech’s difficulty is high when generated from a difficult sentence and a bad speaker, while an ASR is good when it is robust to hard speeches. Performed experiments revealed useful insights on how the quality of speech synthesis and recognition can be affected by distinct factors (e.g., sentence difficulty and speaker ability). •An original solution for simultaneously evaluating speech synthesis and recognition using Item Response Theory.•The difficulty of a synthesized speech depends on the performance of automatic speech recognition systems with different abilities when transcribing it.•Specific sentences may have a more significant influence on the synthesis quality than the speakers’ abilities.
ISSN:	0167-6393 1872-7182
DOI:	10.1016/j.specom.2021.11.002