Method for extracting regular noise from single record web pages
The invention provides a method for extracting regular noise from single record web pages. The method comprises the steps of firstly converting multiple record web pages into document object model (DOM) trees, and classifying the DOM trees according to structures; then aligning and integrating the D...
Saved in:
Main Authors | , , , , , , |
---|---|
Format | Patent |
Language | Chinese English |
Published |
24.04.2013
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | The invention provides a method for extracting regular noise from single record web pages. The method comprises the steps of firstly converting multiple record web pages into document object model (DOM) trees, and classifying the DOM trees according to structures; then aligning and integrating the DOM trees belonging to the same type to obtain site section style trees; and positioning approximate positions of web page text headline nodes and approximate positions of web page text main body nodes in the site section style trees, and finally extracting the regular noise in front of texts, in the texts and after the texts according to the web page text headline nodes and the web page text main body nodes. By means of the method, space resources required by construction of the site section style trees is decreased, possible extraction leakage situations are decreased, and extracting speed is accelerated. In addition, an extracting result has high accuracy, good effect is obtained, and the reliability is high. |
---|---|
Bibliography: | Application Number: CN20121592795 |