Confidence Calibration for Audio Captioning Models

Systems that automatically generate text captions for audio, images and video lack a confidence indicator of the relevance and correctness of the generated sequences. To address this, we build on existing methods of confidence measurement for text by introduce selective pooling of token probabilitie...

Full description

Saved in:

Bibliographic Details
Main Authors	Mahfuz, Rehana, Guo, Yinyi, Visser, Erik
Format	Journal Article
Language	English
Published	12.09.2024
Subjects	Computer Science - Multimedia Computer Science - Sound
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Systems that automatically generate text captions for audio, images and video lack a confidence indicator of the relevance and correctness of the generated sequences. To address this, we build on existing methods of confidence measurement for text by introduce selective pooling of token probabilities, which aligns better with traditional correctness measures than conventional pooling does. Further, we propose directly measuring the similarity between input audio and text in a shared embedding space. To measure self-consistency, we adapt semantic entropy for audio captioning, and find that these two methods align even better than pooling-based metrics with the correctness measure that calculates acoustic similarity between captions. Finally, we explain why temperature scaling of confidences improves calibration.
DOI:	10.48550/arxiv.2409.08489