CREPE: Coordinate-Aware End-to-End Document Parser
In this study, we formulate an OCR-free sequence generation model for visual document understanding (VDU). Our model not only parses text from document images but also extracts the spatial coordinates of the text based on the multi-head architecture. Named as Coordinate-aware End-to-end Document Par...
Saved in:
Main Authors | , , , , , , , |
---|---|
Format | Journal Article |
Language | English |
Published |
30.04.2024
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | In this study, we formulate an OCR-free sequence generation model for visual
document understanding (VDU). Our model not only parses text from document
images but also extracts the spatial coordinates of the text based on the
multi-head architecture. Named as Coordinate-aware End-to-end Document Parser
(CREPE), our method uniquely integrates these capabilities by introducing a
special token for OCR text, and token-triggered coordinate decoding. We also
proposed a weakly-supervised framework for cost-efficient training, requiring
only parsing annotations without high-cost coordinate annotations. Our
experimental evaluations demonstrate CREPE's state-of-the-art performances on
document parsing tasks. Beyond that, CREPE's adaptability is further
highlighted by its successful usage in other document understanding tasks such
as layout analysis, document visual question answering, and so one. CREPE's
abilities including OCR and semantic parsing not only mitigate error
propagation issues in existing OCR-dependent methods, it also significantly
enhance the functionality of sequence generation models, ushering in a new era
for document understanding studies. |
---|---|
DOI: | 10.48550/arxiv.2405.00260 |