A hybrid quantum approach to leveraging data from HTML tables

The Web provides many data that are encoded using HTML tables. This facilitates rendering them, but obfuscates their structure and makes it difficult for automated business processes to leverage them. This has motivated many authors to work on proposals to extract them as automatically as possible....

Full description

Saved in:

Bibliographic Details
Published in	Knowledge and information systems Vol. 64; no. 2; pp. 441 - 474
Main Authors	Jiménez, Patricia, Roldán, Juan C., Corchuelo, Rafael
Format	Journal Article
Language	English
Published	London Springer London 01.02.2022 Springer Nature B.V
Subjects	Automation Clustering Computer Science Data Mining and Knowledge Discovery Database Management HyperText Markup Language Information Storage and Retrieval Information Systems and Communication Service Information Systems Applications (incl.Internet) IT in Business Polynomials Proposals Quantum computers Quantum computing Regular Paper Semantics HTML tables Quantum computing Data extraction
Online Access	Get full text

Cover

Loading…

More Information
Summary:	The Web provides many data that are encoded using HTML tables. This facilitates rendering them, but obfuscates their structure and makes it difficult for automated business processes to leverage them. This has motivated many authors to work on proposals to extract them as automatically as possible. In this article, we present a new unsupervised proposal that uses a hybrid approach in which a standard computer is used to perform pre- and post-processing tasks and a quantum computer is used to perform the core task: guessing whether the cells have labels or values. The problem is addressed using a clustering approach that is known to be NP using standard computers, but our proposal can solve it in polynomial time, which implies a significant performance improvement. It is novel in that it relies on an entropy-preservation metaphor that has proven to work very well on two large collections of real-world tables from the Wikipedia and the Dresden Web Table Corpus. Our experiments prove that our proposal can beat the state-of-the-art proposal in terms of both effectiveness and efficiency; the key difference is that our proposal is totally unsupervised, whereas the state-of-the-art proposal is supervised.
ISSN:	0219-1377 0219-3116
DOI:	10.1007/s10115-021-01636-7