Using fine-tuning and min lookahead beam search to improve Whisper

The performance of Whisper in low-resource languages is still far from perfect. In addition to a lack of training data on low-resource languages, we identify some limitations in the beam search algorithm used in Whisper. To address these issues, we fine-tune Whisper on additional data and propose an...

Full description

Saved in:

Bibliographic Details
Main Authors	Do, Andrea, Brown, Oscar, Wang, Zhengjie, Mathew, Nikhil, Liu, Zixin, Ahmed, Jawwad, Yu, Cheng
Format	Journal Article
Language	English
Published	19.09.2023
Subjects	Computer Science - Computation and Language Computer Science - Learning Computer Science - Sound
Online Access	Get full text

Cover

Loading…

More Information
Summary:	The performance of Whisper in low-resource languages is still far from perfect. In addition to a lack of training data on low-resource languages, we identify some limitations in the beam search algorithm used in Whisper. To address these issues, we fine-tune Whisper on additional data and propose an improved decoding algorithm. On the Vietnamese language, fine-tuning Whisper-Tiny with LoRA leads to an improvement of 38.49 in WER over the zero-shot Whisper-Tiny setting which is a further reduction of 1.45 compared to full-parameter fine-tuning. Additionally, by using Filter-Ends and Min Lookahead decoding algorithms, the WER reduces by 2.26 on average over a range of languages compared to standard beam search. These results generalise to larger Whisper model sizes. We also prove a theorem that Min Lookahead outperforms the standard beam search algorithm used in Whisper.
DOI:	10.48550/arxiv.2309.10299