Empowering Vision-Language Tuning with Machine Translation for Driving Scene Caption Generation
In the Autonomous Driving (AD) scenario, accurate, informative, and understandable descriptions of the traffic conditions and the ego-vehicle motions can increase the interpretability of an autonomous driving system to the vehicle user. End-to-end free-form video captioning is a straightforward visi...
Saved in:
Published in | IEICE Transactions on Information and Systems Vol. E108.D; no. 9; pp. 1082 - 1094 |
---|---|
Main Authors | , , |
Format | Journal Article |
Language | English |
Published |
The Institute of Electronics, Information and Communication Engineers
01.09.2025
一般社団法人 電子情報通信学会 |
Subjects | |
Online Access | Get full text |
ISSN | 0916-8532 1745-1361 |
DOI | 10.1587/transinf.2024EDP7126 |
Cover
Summary: | In the Autonomous Driving (AD) scenario, accurate, informative, and understandable descriptions of the traffic conditions and the ego-vehicle motions can increase the interpretability of an autonomous driving system to the vehicle user. End-to-end free-form video captioning is a straightforward vision-to-text task to address such needs. However, insufficient real-world driving scene descriptive data hinders the performance of caption generation under a simple supervised training paradigm. Recently, large-scale Vision-Language Pre-training (VLP) foundation models have attracted much attention from the community. Tuning large foundation models on task-specific datasets becomes a prevailing paradigm for caption generation. However, for the application in autonomous driving, we often encounter large gaps between the training data for VLP foundation models and the real-world driving scene captioning data, which impedes the immense potential of VLP foundation models. In this paper, we present to tackle this problem via a unified framework for cross-lingual cross-domain vision-language tuning empowered by Machine Translation (MT) techniques. We aim to obtain a captioning system for driving scene caption generation in Japanese from a domain-general and English-centric VLP model. The framework comprises two core components: (i) bidirectional knowledge distillation by MT teachers; (ii) fusing objectives for cross-lingual fine-tuning. Moreover, we introduce three schedulers to operate the vision-language tuning process with fusing objectives. Based on GIT, we implement our framework and verify its effectiveness on real-world driving scenes with natural caption texts annotated by experienced vehicle users. The caption generation performance with our framework reveals a significant advantage over the baseline settings. |
---|---|
ISSN: | 0916-8532 1745-1361 |
DOI: | 10.1587/transinf.2024EDP7126 |