Statistical Testing on ASR Performance via Blockwise Bootstrap
A common question being raised in automatic speech recognition (ASR) evaluations is how reliable is an observed word error rate (WER) improvement comparing two ASR systems, where statistical hypothesis testing and confidence interval (CI) can be utilized to tell whether this improvement is real or o...
Saved in:
Main Authors | , |
---|---|
Format | Journal Article |
Language | English |
Published |
19.12.2019
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | A common question being raised in automatic speech recognition (ASR)
evaluations is how reliable is an observed word error rate (WER) improvement
comparing two ASR systems, where statistical hypothesis testing and confidence
interval (CI) can be utilized to tell whether this improvement is real or only
due to random chance. The bootstrap resampling method has been popular for such
significance analysis which is intuitive and easy to use. However, this method
fails in dealing with dependent data, which is prevalent in speech world - for
example, ASR performance on utterances from the same speaker could be
correlated. In this paper we present blockwise bootstrap approach - by dividing
evaluation utterances into nonoverlapping blocks, this method resamples these
blocks instead of original data. We show that the resulting variance estimator
of absolute WER difference between two ASR systems is consistent under mild
conditions. We also demonstrate the validity of blockwise bootstrap method on
both synthetic and real-world speech data. |
---|---|
DOI: | 10.48550/arxiv.1912.09508 |