An attention-based hybrid deep learning approach for bengali video captioning

Video captioning is an automated process of captioning a video by understanding the content within it. Although numerous studies have been performed on video captioning in English, the field of video captioning in Bengali remains nearly unexplored. Therefore, this research aims at generating Bengali...

Full description

Saved in:

Bibliographic Details
Published in	Journal of King Saud University. Computer and information sciences Vol. 35; no. 1; pp. 257 - 269
Main Authors	Zaoad, Md. Shahir, Mannan, M.M. Rushadul, Mandol, Angshu Bikash, Rahman, Mostafizur, Islam, Md. Adnanul, Rahman, Md. Mahbubur
Format	Journal Article
Language	English
Published	Elsevier B.V 01.01.2023 Elsevier
Subjects	Attention-mechanism Bengali video captioning Convolutional neural network Encoder-decoder model Recurrent neural network Attention-mechanism Encoder-decoder model Recurrent neural network Bengali video captioning Convolutional neural network
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Video captioning is an automated process of captioning a video by understanding the content within it. Although numerous studies have been performed on video captioning in English, the field of video captioning in Bengali remains nearly unexplored. Therefore, this research aims at generating Bengali captions that plausibly describe the gist of a specific video as well as identifying the best performing model for Bengali video captioning. To accomplish this, several sequence-to-sequence models – LSTM, BiLSTM, and GRU are implemented that takes the video frame features as input, extracted through different CNN models – VGG-19, Inceptionv3, and ResNet50v2, and provides a corresponding textual description as output. Moreover, the Attention mechanism is incorporated with these models as a first-ever attempt in Bengali video captioning. In this study, a novel Bengali video captioning dataset is constructed from Microsoft Research Video Description Corpus (MSVD) dataset (an English video captioning dataset) through utilizing a deep learning-based translator and manual post-editing efforts. Finally, the model’s performance is evaluated in terms of popular performance evaluation metrics - BLEU, METEOR, and ROUGE. The proposed attention-based hybrid model outperforms the existing models in terms of these evaluation metrics, establishing a new benchmark for Bengali video captioning.
ISSN:	1319-1578 2213-1248
DOI:	10.1016/j.jksuci.2022.11.015