Large Synthetic Data from the arXiv for OCR Post Correction of Historic Scientific Articles
Scientific articles published prior to the "age of digitization" (~1997) require Optical Character Recognition (OCR) to transform scanned documents into machine-readable text, a process that often produces errors. We develop a pipeline for the generation of a synthetic ground truth/OCR dat...
Saved in:
Main Authors | , , , |
---|---|
Format | Journal Article |
Language | English |
Published |
20.09.2023
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Be the first to leave a comment!