Variational Stacked Local Attention Networks for Diverse Video Captioning

While describing spatiotemporal events in natural language, video captioning models mostly rely on the en-coder's latent visual representation. Recent progress on the encoder-decoder model attends encoder features mainly in linear interaction with the decoder. However, growing model complexity...

Full description

Saved in:

Bibliographic Details
Published in	2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) pp. 2493 - 2502
Main Authors	Deb, Tonmoay, Sadmanee, Akib, Bhaumik, Kishor Kumar, Ahsan Ali, Amin, Amin, M Ashraful, Mahbubur Rahman, A K M
Format	Conference Proceeding
Language	English
Published	IEEE 01.01.2022
Subjects	Measurement Natural languages Redundancy Stacking Streaming media Syntactics Vision and Languages Datasets; Evaluation and Comparison of Vision Algorithms; Deep Learning; Multimedia Applications; Visual Reasoning; Analysis and Understanding Visualization
Online Access	Get full text

Cover

Loading…

More Information
Summary:	While describing spatiotemporal events in natural language, video captioning models mostly rely on the en-coder's latent visual representation. Recent progress on the encoder-decoder model attends encoder features mainly in linear interaction with the decoder. However, growing model complexity for visual data encourages more explicit feature interaction for fine-grained information, which is currently absent in the video captioning domain. Moreover, feature aggregations methods have been used to un-veil richer visual representation, either by the concatenation or using a linear layer. Though feature sets for a video semantically overlap to some extent, these approaches result in objective mismatch and feature redundancy. In addition, diversity in captions is a fundamental component of expressing one event from several meaningful perspectives, currently missing in the temporal, i.e., video captioning domain. To this end, we propose Variational Stacked Local Attention Network (VSLAN), which exploits low-rank bilinear pooling for self-attentive feature interaction and stacking multiple video feature streams in a discount fashion. Each feature stack's learned attributes contribute to our proposed diversity encoding module, followed by the decoding query stage to facilitate end-to-end diverse and natural captions without any explicit supervision on attributes. We evaluate VSLAN on MSVD and MSR-VTT datasets in terms of syntax and diversity. The CIDEr score of VSLAN outperforms current off-the-shelf methods by 7.8% on MSVD and 4.5% on MSR-VTT, respectively. On the same datasets, VSLAN achieves competitive results in caption diversity metrics.
ISSN:	2642-9381
DOI:	10.1109/WACV51458.2022.00255