A Large Chinese Text Dataset in the Wild

In this paper, we introduce a very large Chinese text dataset in the wild. While optical character recognition (OCR) in document images is well studied and many commercial tools are available, the detection and recognition of text in natural images is still a challenging problem, especially for some...

Full description

Saved in:

Bibliographic Details
Published in	Journal of computer science and technology Vol. 34; no. 3; pp. 509 - 521
Main Authors	Yuan, Tai-Ling, Zhu, Zhe, Xu, Kun, Li, Cheng-Jun, Mu, Tai-Jiang, Hu, Shi-Min
Format	Journal Article
Language	English
Published	New York Springer US 01.05.2019 Springer Springer Nature B.V Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China%Department of Radiology, Duke University, North Carolina 27708, U.S.A.%Tencent Technology (Beijing) Co. Ltd., Beijing 100080, China
Subjects	Annotations Artificial Intelligence Computer Science Data Structures and Information Theory Datasets Information Systems Applications (incl.Internet) Machine learning Optical character recognition Regular Paper Software Engineering Source code Theory of Computation Chinese text dataset Chinese text recognition Chinese text detection
Online Access	Get full text

Cover

Loading…

More Information
Summary:	In this paper, we introduce a very large Chinese text dataset in the wild. While optical character recognition (OCR) in document images is well studied and many commercial tools are available, the detection and recognition of text in natural images is still a challenging problem, especially for some more complicated character sets such as Chinese text. Lack of training data has always been a problem, especially for deep learning methods which require massive training data. In this paper, we provide details of a newly created dataset of Chinese text with about 1 million Chinese characters from 3 850 unique ones annotated by experts in over 30 000 street view images. This is a challenging dataset with good diversity containing planar text, raised text, text under poor illumination, distant text, partially occluded text, etc. For each character, the annotation includes its underlying character, bounding box, and six attributes. The attributes indicate the character’s background complexity, appearance, style, etc. Besides the dataset, we give baseline results using state-of-the-art methods for three tasks: character recognition (top-1 accuracy of 80.5%), character detection (AP of 70.9%), and text line detection (AED of 22.1). The dataset, source code, and trained models are publicly available.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	1000-9000 1860-4749
DOI:	10.1007/s11390-019-1923-y