Large Synthetic Data from the arXiv for OCR Post Correction of Historic Scientific Articles
Scientific articles published prior to the "age of digitization" (~1997) require Optical Character Recognition (OCR) to transform scanned documents into machine-readable text, a process that often produces errors. We develop a pipeline for the generation of a synthetic ground truth/OCR dat...
Saved in:
Main Authors | , , , |
---|---|
Format | Journal Article |
Language | English |
Published |
20.09.2023
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Scientific articles published prior to the "age of digitization" (~1997)
require Optical Character Recognition (OCR) to transform scanned documents into
machine-readable text, a process that often produces errors. We develop a
pipeline for the generation of a synthetic ground truth/OCR dataset to correct
the OCR results of the astrophysics literature holdings of the NASA
Astrophysics Data System (ADS). By mining the arXiv we create, to the authors'
knowledge, the largest scientific synthetic ground truth/OCR post correction
dataset of 203,354,393 character pairs. We provide baseline models trained with
this dataset and find the mean improvement in character and word error rates of
7.71% and 18.82% for historical OCR text, respectively. When used to classify
parts of sentences as inline math, we find a classification F1 score of 77.82%.
Interactive dashboards to explore the dataset are available online:
https://readingtimemachine.github.io/projects/1-ocr-groundtruth-may2023, and
data and code, within the limitations of our agreement with the arXiv, are
hosted on GitHub: https://github.com/ReadingTimeMachine/ocr_post_correction. |
---|---|
DOI: | 10.48550/arxiv.2309.11549 |