Confidence Calibration for Audio Captioning Models

Systems that automatically generate text captions for audio, images and video lack a confidence indicator of the relevance and correctness of the generated sequences. To address this, we build on existing methods of confidence measurement for text by introduce selective pooling of token probabilitie...

Full description

Saved in:
Bibliographic Details
Main Authors Mahfuz, Rehana, Guo, Yinyi, Visser, Erik
Format Journal Article
LanguageEnglish
Published 12.09.2024
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Systems that automatically generate text captions for audio, images and video lack a confidence indicator of the relevance and correctness of the generated sequences. To address this, we build on existing methods of confidence measurement for text by introduce selective pooling of token probabilities, which aligns better with traditional correctness measures than conventional pooling does. Further, we propose directly measuring the similarity between input audio and text in a shared embedding space. To measure self-consistency, we adapt semantic entropy for audio captioning, and find that these two methods align even better than pooling-based metrics with the correctness measure that calculates acoustic similarity between captions. Finally, we explain why temperature scaling of confidences improves calibration.
DOI:10.48550/arxiv.2409.08489