Large Synthetic Data from the arXiv for OCR Post Correction of Historic Scientific Articles

Scientific articles published prior to the "age of digitization" (~1997) require Optical Character Recognition (OCR) to transform scanned documents into machine-readable text, a process that often produces errors. We develop a pipeline for the generation of a synthetic ground truth/OCR dat...

Full description

Saved in:
Bibliographic Details
Main Authors Naiman, Jill P, Cosillo, Morgan G, Williams, Peter K. G, Goodman, Alyssa
Format Journal Article
LanguageEnglish
Published 20.09.2023
Subjects
Online AccessGet full text

Cover

Loading…