Zero-shot Key Information Extraction from Mixed-Style Tables: Pre-training on Wikipedia

Table, widely used in documents from various vertical domains, is a compact representation of data. There is always some strong demand to automatically extract key information from tables for further analysis. In addition, the set of keys that need to be extracted information is usually time-varying...

Full description

Saved in:
Bibliographic Details
Published in2021 IEEE International Conference on Data Mining (ICDM) pp. 1451 - 1456
Main Authors Yang, Qingping, Hu, Yingpeng, Cao, Rongyu, Li, Hongwei, Luo, Ping
Format Conference Proceeding
LanguageEnglish
Published IEEE 01.12.2021
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Table, widely used in documents from various vertical domains, is a compact representation of data. There is always some strong demand to automatically extract key information from tables for further analysis. In addition, the set of keys that need to be extracted information is usually time-varying, which arises the issue of zero-shot keys in this situation. To increase the efficiency of these knowledge workers, in this study we aim to extract the values of a given set of keys from tables. Previous table-related studies mainly focus on relational, entity, and matrix tables. However, their methods fail on mixed-style tables, in which table headers might exist in any non-merged or merged cell, and the spatial relationships between headers and corresponding values are diverse. Here, we address this problem while taking mixed-style tables into account. To this end, we propose an end-to-end neural-based model, called Information Extraction in Mixed-style Table (IEMT). IEMT first uses BERT to extract textual semantics of the given key and the words in each cell. Then, it uses multi-layer CNN to capture the spatial and textual interactions among adjacent cells. Furthermore, to improve the accuracy on zero-shot keys, we pre-train IEMT on a dataset constructed on 0.4 million tables from Wikipedia and 140 million triplets from Ownthink. Experiments with the fine-tuning step on 26,869 financial tables show that the proposed model achieves 0.9323 accuracy for zero-shot keys, obtaining more than 8% increase compared with the model without pre-training.
ISSN:2374-8486
DOI:10.1109/ICDM51629.2021.00187