Reichsanzeiger-GT: An OCR ground truth dataset based on the historical newspaper “Deutscher Reichsanzeiger und Preußischer Staatsanzeiger” (German Imperial Gazette and Prussian Official Gazette) (1819–1945)

Reichsanzeiger-GT is a ground truth dataset for OCR training and evaluation based on the historical German newspaper “Deutscher Reichsanzeiger und Preußischer Staatsanzeiger” (German Imperial Gazette and Prussian Official Gazette), which was published from 1819 to 1945 and printed mostly in the type...

Full description

Saved in:

Bibliographic Details
Published in	Data in brief Vol. 54; p. 110274
Main Authors	Schmidt, Thomas, Kamlah, Jan, Weil, Stefan
Format	Journal Article
Language	English
Published	Netherlands Elsevier Inc 01.06.2024 Elsevier
Subjects	Ground truth Historical newspapers OCR Text recognition Historical newspapers Text recognition Ground truth OCR
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Reichsanzeiger-GT is a ground truth dataset for OCR training and evaluation based on the historical German newspaper “Deutscher Reichsanzeiger und Preußischer Staatsanzeiger” (German Imperial Gazette and Prussian Official Gazette), which was published from 1819 to 1945 and printed mostly in the typeface Fraktur (Black Letter). The dataset consists of 101 newspaper pages for the years 1820–1939, that cover a wide variety of topics, page layouts (lists, tables, and advertisements) as well as different typefaces. Using the transcription software Transkribus and the open-source OCR engine Tesseract we automatically created and manually corrected layout segmentations and transcriptions for each page, resulting in 65,563 text regions, 412 table regions, 119,429 text lines and 490,679 words. By applying transcription guidelines that preserve the printing conditions, the dataset contains language and printing specific phenomena like the historical use of glyphs like long s (ſ), rotunda r (ꝛ), and historical currency symbols (M, ₰) among others. The dataset is provided in two variants in PAGE XML format. The first one contains ground truth data with table regions transformed to text regions for easier processing. The second variant preserves all table regions. Researchers can reuse this dataset to train new or finetune existing text recognition or layout segmentation models. The dataset can also be used to evaluate the accuracy of existing OCR models. Using specific, community driven transcription guidelines our dataset is easily interoperable and reusable with other datasets based on the same transcription level.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23
ISSN:	2352-3409 2352-3409
DOI:	10.1016/j.dib.2024.110274