A Large Chinese Text Dataset in the Wild

In this paper, we introduce a very large Chinese text dataset in the wild. While optical character recognition (OCR) in document images is well studied and many commercial tools are available, the detection and recognition of text in natural images is still a challenging problem, especially for some...

Full description

Saved in:
Bibliographic Details
Published inJournal of computer science and technology Vol. 34; no. 3; pp. 509 - 521
Main Authors Yuan, Tai-Ling, Zhu, Zhe, Xu, Kun, Li, Cheng-Jun, Mu, Tai-Jiang, Hu, Shi-Min
Format Journal Article
LanguageEnglish
Published New York Springer US 01.05.2019
Springer
Springer Nature B.V
Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China%Department of Radiology, Duke University, North Carolina 27708, U.S.A.%Tencent Technology (Beijing) Co. Ltd., Beijing 100080, China
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:In this paper, we introduce a very large Chinese text dataset in the wild. While optical character recognition (OCR) in document images is well studied and many commercial tools are available, the detection and recognition of text in natural images is still a challenging problem, especially for some more complicated character sets such as Chinese text. Lack of training data has always been a problem, especially for deep learning methods which require massive training data. In this paper, we provide details of a newly created dataset of Chinese text with about 1 million Chinese characters from 3 850 unique ones annotated by experts in over 30 000 street view images. This is a challenging dataset with good diversity containing planar text, raised text, text under poor illumination, distant text, partially occluded text, etc. For each character, the annotation includes its underlying character, bounding box, and six attributes. The attributes indicate the character’s background complexity, appearance, style, etc. Besides the dataset, we give baseline results using state-of-the-art methods for three tasks: character recognition (top-1 accuracy of 80.5%), character detection (AP of 70.9%), and text line detection (AED of 22.1). The dataset, source code, and trained models are publicly available.
ISSN:1000-9000
1860-4749
DOI:10.1007/s11390-019-1923-y