A Large Chinese Text Dataset in the Wild
In this paper, we introduce a very large Chinese text dataset in the wild. While optical character recognition (OCR) in document images is well studied and many commercial tools are available, the detection and recognition of text in natural images is still a challenging problem, especially for some...
Saved in:
Published in | Journal of computer science and technology Vol. 34; no. 3; pp. 509 - 521 |
---|---|
Main Authors | , , , , , |
Format | Journal Article |
Language | English |
Published |
New York
Springer US
01.05.2019
Springer Springer Nature B.V Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China%Department of Radiology, Duke University, North Carolina 27708, U.S.A.%Tencent Technology (Beijing) Co. Ltd., Beijing 100080, China |
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | In this paper, we introduce a very large Chinese text dataset in the wild. While optical character recognition (OCR) in document images is well studied and many commercial tools are available, the detection and recognition of text in natural images is still a challenging problem, especially for some more complicated character sets such as Chinese text. Lack of training data has always been a problem, especially for deep learning methods which require massive training data. In this paper, we provide details of a newly created dataset of Chinese text with about 1 million Chinese characters from 3 850 unique ones annotated by experts in over 30 000 street view images. This is a challenging dataset with good diversity containing planar text, raised text, text under poor illumination, distant text, partially occluded text, etc. For each character, the annotation includes its underlying character, bounding box, and six attributes. The attributes indicate the character’s background complexity, appearance, style, etc. Besides the dataset, we give baseline results using state-of-the-art methods for three tasks: character recognition (top-1 accuracy of 80.5%), character detection (AP of 70.9%), and text line detection (AED of 22.1). The dataset, source code, and trained models are publicly available. |
---|---|
ISSN: | 1000-9000 1860-4749 |
DOI: | 10.1007/s11390-019-1923-y |