TrafficVLM: A Controllable Visual Language Model for Traffic Video Captioning
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2024, pp. 7134-7143 Traffic video description and analysis have received much attention recently due to the growing demand for efficient and reliable urban surveillance systems. Most existing methods...
Saved in:
Main Authors | , , , |
---|---|
Format | Journal Article |
Language | English |
Published |
14.04.2024
|
Subjects | |
Online Access | Get full text |
DOI | 10.48550/arxiv.2404.09275 |
Cover
Loading…
Summary: | Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition (CVPR) Workshops, 2024, pp. 7134-7143 Traffic video description and analysis have received much attention recently
due to the growing demand for efficient and reliable urban surveillance
systems. Most existing methods only focus on locating traffic event segments,
which severely lack descriptive details related to the behaviour and context of
all the subjects of interest in the events. In this paper, we present
TrafficVLM, a novel multi-modal dense video captioning model for vehicle ego
camera view. TrafficVLM models traffic video events at different levels of
analysis, both spatially and temporally, and generates long fine-grained
descriptions for the vehicle and pedestrian at different phases of the event.
We also propose a conditional component for TrafficVLM to control the
generation outputs and a multi-task fine-tuning paradigm to enhance
TrafficVLM's learning capability. Experiments show that TrafficVLM performs
well on both vehicle and overhead camera views. Our solution achieved
outstanding results in Track 2 of the AI City Challenge 2024, ranking us third
in the challenge standings. Our code is publicly available at
https://github.com/quangminhdinh/TrafficVLM. |
---|---|
DOI: | 10.48550/arxiv.2404.09275 |