Detecting Informative Web Page Blocks for Efficient Information Extraction Using Visual Block Segmentation

As the structure of a Web page is getting more complicated, the construction of wrapper induction rules becomes more difficult and time-consuming. The main problem in most wrapper induction methods is the difficulty in discriminating the meaningful blocks that contain the target information from the...

Full description

Saved in:
Bibliographic Details
Published in2007 International Symposium on Information Technology Convergence (ISITC 2007) pp. 306 - 310
Main Authors Jinbeom Kang, Joongmin Choi
Format Conference Proceeding
LanguageEnglish
Published IEEE 01.11.2007
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:As the structure of a Web page is getting more complicated, the construction of wrapper induction rules becomes more difficult and time-consuming. The main problem in most wrapper induction methods is the difficulty in discriminating the meaningful blocks that contain the target information from the noise blocks that contains irrelevant information such as advertisements, menus, or copyright statements. To solve this problem, this paper proposes the RIPB(recognizing informative page blocks) algorithm that detects the informative blocks in a Web page by exploiting the visual block segmentation scheme. RIPB uses the visual page segmentation algorithm to analyze and partition a Web page into a set of logical blocks, and then groups related blocks with similar structures into a block cluster and recognizes the informative block clusters by applying some heuristic rules to the cluster information. The results of a series of experiments indicate that RIPB contributes to improve the accuracy of information extraction by allowing the wrapper induction module to focus only on the informative block information and ignore other noise information in building extraction rules.
ISBN:0769530451
9780769530451
DOI:10.1109/ISITC.2007.6