Empowering Vision-Language Tuning with Machine Translation for Driving Scene Caption Generation

In the Autonomous Driving (AD) scenario, accurate, informative, and understandable descriptions of the traffic conditions and the ego-vehicle motions can increase the interpretability of an autonomous driving system to the vehicle user. End-to-end free-form video captioning is a straightforward visi...

Full description

Saved in:
Bibliographic Details
Published inIEICE Transactions on Information and Systems Vol. E108.D; no. 9; pp. 1082 - 1094
Main Authors ZHOU, Lei, SASANO, Ryohei, TAKEDA, Koichi
Format Journal Article
LanguageEnglish
Published The Institute of Electronics, Information and Communication Engineers 01.09.2025
一般社団法人 電子情報通信学会
Subjects
Online AccessGet full text
ISSN0916-8532
1745-1361
DOI10.1587/transinf.2024EDP7126

Cover

More Information
Summary:In the Autonomous Driving (AD) scenario, accurate, informative, and understandable descriptions of the traffic conditions and the ego-vehicle motions can increase the interpretability of an autonomous driving system to the vehicle user. End-to-end free-form video captioning is a straightforward vision-to-text task to address such needs. However, insufficient real-world driving scene descriptive data hinders the performance of caption generation under a simple supervised training paradigm. Recently, large-scale Vision-Language Pre-training (VLP) foundation models have attracted much attention from the community. Tuning large foundation models on task-specific datasets becomes a prevailing paradigm for caption generation. However, for the application in autonomous driving, we often encounter large gaps between the training data for VLP foundation models and the real-world driving scene captioning data, which impedes the immense potential of VLP foundation models. In this paper, we present to tackle this problem via a unified framework for cross-lingual cross-domain vision-language tuning empowered by Machine Translation (MT) techniques. We aim to obtain a captioning system for driving scene caption generation in Japanese from a domain-general and English-centric VLP model. The framework comprises two core components: (i) bidirectional knowledge distillation by MT teachers; (ii) fusing objectives for cross-lingual fine-tuning. Moreover, we introduce three schedulers to operate the vision-language tuning process with fusing objectives. Based on GIT, we implement our framework and verify its effectiveness on real-world driving scenes with natural caption texts annotated by experienced vehicle users. The caption generation performance with our framework reveals a significant advantage over the baseline settings.
ISSN:0916-8532
1745-1361
DOI:10.1587/transinf.2024EDP7126