OXPath: A language for scalable data extraction, automation, and crawling on the deep web
The evolution of the web has outpaced itself: A growing wealth of information and increasingly sophisticated interfaces necessitate automated processing, yet existing automation and data extraction technologies have been overwhelmed by this very growth. To address this trend, we identify four key re...
Saved in:
Published in | The VLDB journal Vol. 22; no. 1; pp. 47 - 72 |
---|---|
Main Authors | , , , , |
Format | Journal Article |
Language | English |
Published |
Berlin/Heidelberg
Springer-Verlag
01.02.2013
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | The evolution of the web has outpaced itself: A growing wealth of information and increasingly sophisticated interfaces necessitate automated processing, yet existing automation and data extraction technologies have been overwhelmed by this very growth. To address this trend, we identify four key requirements for web data extraction, automation, and (focused) web crawling:
(1)
interact with sophisticated web application interfaces,
(2)
precisely capture the relevant data to be extracted,
(3)
scale with the number of visited pages, and
(4)
readily embed into existing web technologies. We introduce
OXPath
as an extension of
XPath
for interacting with web applications and extracting data thus revealed—matching all the above requirements.
OXPath
’s page-at-a-time evaluation guarantees memory use independent of the number of visited pages, yet remains polynomial in time. We experimentally validate the theoretical complexity and demonstrate that
OXPath
’s resource consumption is dominated by page rendering in the underlying browser. With an extensive study of sublanguages and properties of
OXPath
, we pinpoint the effect of specific features on evaluation performance. Our experiments show that
OXPath
outperforms existing commercial and academic data extraction tools by a wide margin. |
---|---|
ISSN: | 1066-8888 0949-877X |
DOI: | 10.1007/s00778-012-0286-6 |