PDF-TREX: An Approach for Recognizing and Extracting Tables from PDF Documents

This paper presents PDF-TREX, an heuristic approach for table recognition and extraction from PDF documents.The heuristics starts from an initial set of basic content elements and aligns and groups them, in bottom-up way by considering only their spatial features, in order to identify tabular arrang...

Full description

Saved in:

Bibliographic Details
Published in	2009 10th International Conference on Document Analysis and Recognition pp. 906 - 910
Main Authors	Oro, E., Ruffolo, M.
Format	Conference Proceeding
Language	English
Published	IEEE 01.07.2009
Subjects	Councils Data mining Document Analysis Encoding Hierarchical Clustering High performance computing HTML Humans Information Extraction Layout Table Recognition and Extraction Text analysis Visualization XML
Online Access	Get full text

Cover

Loading…

More Information
Summary:	This paper presents PDF-TREX, an heuristic approach for table recognition and extraction from PDF documents.The heuristics starts from an initial set of basic content elements and aligns and groups them, in bottom-up way by considering only their spatial features, in order to identify tabular arrangements of information. The scope of the approach is to recognize tables contained in PDF documents as a 2-dimensional grid on a Cartesian plane and extract them as a set of cells equipped by 2-dimensional coordinates. Experiments, carried out on a dataset composed of tables contained in documents coming from different domains, shows that the approach is well performing in recognizing table cells.The approach aims at improving PDF document annotation and information extraction by providing an output that can be further processed for understanding table and document contents.
ISBN:	1424445000 9781424445004
ISSN:	1520-5363 2379-2140
DOI:	10.1109/ICDAR.2009.12