VTD-FCENet: A Real-Time HD Video Text Detection with Scale-Aware Fourier Contour Embedding

Video text detection (VTD) aims to localize text instances in videos, which has wide applications for downstream tasks. To deal with the variances of different scenes and text instances, multiple models and feature fusion strategies were typically integrated in existing VTD methods. A VTD method con...

Full description

Saved in:

Bibliographic Details
Published in	IEICE Transactions on Information and Systems Vol. E107.D; no. 4; pp. 574 - 578
Main Authors	XIAO, Wocheng, LIANG, Lingyu, CHEN, Jianyong, WANG, Tao
Format	Journal Article
Language	English
Published	Tokyo The Institute of Electronics, Information and Communication Engineers 01.04.2024 Japan Science and Technology Agency
Subjects	Accuracy Contours Embedding Lightweight Real time scene text detection Video video text detection Weight reduction
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Video text detection (VTD) aims to localize text instances in videos, which has wide applications for downstream tasks. To deal with the variances of different scenes and text instances, multiple models and feature fusion strategies were typically integrated in existing VTD methods. A VTD method consisting of sophisticated components can efficiently improve detection accuracy, but may suffer from a limitation for real-time applications. This paper aims to achieve real-time VTD with an adaptive lightweight end-to-end framework. Different from previous methods that represent text in a spatial domain, we model text instances in the Fourier domain. Specifically, we propose a scale-aware Fourier Contour Embedding method, which not only models arbitrary shaped text contours of videos as compact signatures, but also adaptively select proper scales for features in a backbone in the training stage. Then, we construct VTD-FCENet to achieve real-time VTD, which encodes temporal correlations of adjacent frames with scale-aware FCE in a lightweight and adaptive manner. Quantitative evaluations were conducted on ICDAR2013 Video, Minetto and YVT benchmark datasets, and the results show that our VTD-FCENet not only obtains the state-of-the-arts or competitive detection accuracy, but also allows real-time text detection on HD videos simultaneously.
ISSN:	0916-8532 1745-1361
DOI:	10.1587/transinf.2023EDL8030