Method for extracting regular noise from single record web pages

The invention provides a method for extracting regular noise from single record web pages. The method comprises the steps of firstly converting multiple record web pages into document object model (DOM) trees, and classifying the DOM trees according to structures; then aligning and integrating the D...

Full description

Saved in:
Bibliographic Details
Main Authors YU ZHIHUA, CHENG XUEQI, WAN SHENGXIAN, LI HAIYAN, LIU YUE, GUO SHAOHUA, GUO YAN
Format Patent
LanguageChinese
English
Published 24.04.2013
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:The invention provides a method for extracting regular noise from single record web pages. The method comprises the steps of firstly converting multiple record web pages into document object model (DOM) trees, and classifying the DOM trees according to structures; then aligning and integrating the DOM trees belonging to the same type to obtain site section style trees; and positioning approximate positions of web page text headline nodes and approximate positions of web page text main body nodes in the site section style trees, and finally extracting the regular noise in front of texts, in the texts and after the texts according to the web page text headline nodes and the web page text main body nodes. By means of the method, space resources required by construction of the site section style trees is decreased, possible extraction leakage situations are decreased, and extracting speed is accelerated. In addition, an extracting result has high accuracy, good effect is obtained, and the reliability is high.
Bibliography:Application Number: CN20121592795