Empowering Vision-Language Tuning with Machine Translation for Driving Scene Caption Generation

In the Autonomous Driving (AD) scenario, accurate, informative, and understandable descriptions of the traffic conditions and the ego-vehicle motions can increase the interpretability of an autonomous driving system to the vehicle user. End-to-end free-form video captioning is a straightforward visi...

Full description

Saved in:

Bibliographic Details
Published in	IEICE Transactions on Information and Systems Vol. E108.D; no. 9; pp. 1082 - 1094
Main Authors	ZHOU, Lei, SASANO, Ryohei, TAKEDA, Koichi
Format	Journal Article
Language	English
Published	The Institute of Electronics, Information and Communication Engineers 01.09.2025 一般社団法人電子情報通信学会
Subjects	autonomous driving cross-domain cross-lingual machine translation video captioning vision-language tuning
Online Access	Get full text
ISSN	0916-8532 1745-1361
DOI	10.1587/transinf.2024EDP7126

Cover

More Information
Summary:	In the Autonomous Driving (AD) scenario, accurate, informative, and understandable descriptions of the traffic conditions and the ego-vehicle motions can increase the interpretability of an autonomous driving system to the vehicle user. End-to-end free-form video captioning is a straightforward vision-to-text task to address such needs. However, insufficient real-world driving scene descriptive data hinders the performance of caption generation under a simple supervised training paradigm. Recently, large-scale Vision-Language Pre-training (VLP) foundation models have attracted much attention from the community. Tuning large foundation models on task-specific datasets becomes a prevailing paradigm for caption generation. However, for the application in autonomous driving, we often encounter large gaps between the training data for VLP foundation models and the real-world driving scene captioning data, which impedes the immense potential of VLP foundation models. In this paper, we present to tackle this problem via a unified framework for cross-lingual cross-domain vision-language tuning empowered by Machine Translation (MT) techniques. We aim to obtain a captioning system for driving scene caption generation in Japanese from a domain-general and English-centric VLP model. The framework comprises two core components: (i) bidirectional knowledge distillation by MT teachers; (ii) fusing objectives for cross-lingual fine-tuning. Moreover, we introduce three schedulers to operate the vision-language tuning process with fusing objectives. Based on GIT, we implement our framework and verify its effectiveness on real-world driving scenes with natural caption texts annotated by experienced vehicle users. The caption generation performance with our framework reveals a significant advantage over the baseline settings.
ISSN:	0916-8532 1745-1361
DOI:	10.1587/transinf.2024EDP7126