Reichsanzeiger-GT: An OCR ground truth dataset based on the historical newspaper “Deutscher Reichsanzeiger und Preußischer Staatsanzeiger” (German Imperial Gazette and Prussian Official Gazette) (1819–1945)
Reichsanzeiger-GT is a ground truth dataset for OCR training and evaluation based on the historical German newspaper “Deutscher Reichsanzeiger und Preußischer Staatsanzeiger” (German Imperial Gazette and Prussian Official Gazette), which was published from 1819 to 1945 and printed mostly in the type...
Saved in:
Published in | Data in brief Vol. 54; p. 110274 |
---|---|
Main Authors | , , |
Format | Journal Article |
Language | English |
Published |
Netherlands
Elsevier Inc
01.06.2024
Elsevier |
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Reichsanzeiger-GT is a ground truth dataset for OCR training and evaluation based on the historical German newspaper “Deutscher Reichsanzeiger und Preußischer Staatsanzeiger” (German Imperial Gazette and Prussian Official Gazette), which was published from 1819 to 1945 and printed mostly in the typeface Fraktur (Black Letter). The dataset consists of 101 newspaper pages for the years 1820–1939, that cover a wide variety of topics, page layouts (lists, tables, and advertisements) as well as different typefaces. Using the transcription software Transkribus and the open-source OCR engine Tesseract we automatically created and manually corrected layout segmentations and transcriptions for each page, resulting in 65,563 text regions, 412 table regions, 119,429 text lines and 490,679 words. By applying transcription guidelines that preserve the printing conditions, the dataset contains language and printing specific phenomena like the historical use of glyphs like long s (ſ), rotunda r (ꝛ), and historical currency symbols (M, ₰) among others. The dataset is provided in two variants in PAGE XML format. The first one contains ground truth data with table regions transformed to text regions for easier processing. The second variant preserves all table regions. Researchers can reuse this dataset to train new or finetune existing text recognition or layout segmentation models. The dataset can also be used to evaluate the accuracy of existing OCR models. Using specific, community driven transcription guidelines our dataset is easily interoperable and reusable with other datasets based on the same transcription level. |
---|---|
Bibliography: | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23 |
ISSN: | 2352-3409 2352-3409 |
DOI: | 10.1016/j.dib.2024.110274 |