Method for extracting regular noise from single record web pages
The invention provides a method for extracting regular noise from single record web pages. The method comprises the steps of firstly converting multiple record web pages into document object model (DOM) trees, and classifying the DOM trees according to structures; then aligning and integrating the D...
Saved in:
Main Authors | , , , , , , |
---|---|
Format | Patent |
Language | Chinese English |
Published |
24.04.2013
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Abstract | The invention provides a method for extracting regular noise from single record web pages. The method comprises the steps of firstly converting multiple record web pages into document object model (DOM) trees, and classifying the DOM trees according to structures; then aligning and integrating the DOM trees belonging to the same type to obtain site section style trees; and positioning approximate positions of web page text headline nodes and approximate positions of web page text main body nodes in the site section style trees, and finally extracting the regular noise in front of texts, in the texts and after the texts according to the web page text headline nodes and the web page text main body nodes. By means of the method, space resources required by construction of the site section style trees is decreased, possible extraction leakage situations are decreased, and extracting speed is accelerated. In addition, an extracting result has high accuracy, good effect is obtained, and the reliability is high. |
---|---|
AbstractList | The invention provides a method for extracting regular noise from single record web pages. The method comprises the steps of firstly converting multiple record web pages into document object model (DOM) trees, and classifying the DOM trees according to structures; then aligning and integrating the DOM trees belonging to the same type to obtain site section style trees; and positioning approximate positions of web page text headline nodes and approximate positions of web page text main body nodes in the site section style trees, and finally extracting the regular noise in front of texts, in the texts and after the texts according to the web page text headline nodes and the web page text main body nodes. By means of the method, space resources required by construction of the site section style trees is decreased, possible extraction leakage situations are decreased, and extracting speed is accelerated. In addition, an extracting result has high accuracy, good effect is obtained, and the reliability is high. |
Author | WAN SHENGXIAN LI HAIYAN GUO YAN LIU YUE CHENG XUEQI GUO SHAOHUA YU ZHIHUA |
Author_xml | – fullname: YU ZHIHUA – fullname: CHENG XUEQI – fullname: WAN SHENGXIAN – fullname: LI HAIYAN – fullname: LIU YUE – fullname: GUO SHAOHUA – fullname: GUO YAN |
BookMark | eNrjYmDJy89L5WRw8E0tychPUUjLL1JIrSgpSkwuycxLVyhKTS_NSSxSyMvPLE5VSCvKz1UoBornpAJlkvOLUhTKU5MUChLTU4t5GFjTEnOKU3mhNDeDoptriLOHbmpBfnxqcUFicmpeakm8s5-hgbGBmYmlmZmjMTFqANwuMv8 |
ContentType | Patent |
DBID | EVB |
DatabaseName | esp@cenet |
DatabaseTitleList | |
Database_xml | – sequence: 1 dbid: EVB name: esp@cenet url: http://worldwide.espacenet.com/singleLineSearch?locale=en_EP sourceTypes: Open Access Repository |
DeliveryMethod | fulltext_linktorsrc |
Discipline | Medicine Chemistry Sciences Physics |
ExternalDocumentID | CN103064966A |
GroupedDBID | EVB |
ID | FETCH-epo_espacenet_CN103064966A3 |
IEDL.DBID | EVB |
IngestDate | Fri Jul 19 12:10:39 EDT 2024 |
IsOpenAccess | true |
IsPeerReviewed | false |
IsScholarly | false |
Language | Chinese English |
LinkModel | DirectLink |
MergedId | FETCHMERGED-epo_espacenet_CN103064966A3 |
Notes | Application Number: CN20121592795 |
OpenAccessLink | https://worldwide.espacenet.com/publicationDetails/biblio?FT=D&date=20130424&DB=EPODOC&CC=CN&NR=103064966A |
ParticipantIDs | epo_espacenet_CN103064966A |
PublicationCentury | 2000 |
PublicationDate | 20130424 |
PublicationDateYYYYMMDD | 2013-04-24 |
PublicationDate_xml | – month: 04 year: 2013 text: 20130424 day: 24 |
PublicationDecade | 2010 |
PublicationYear | 2013 |
RelatedCompanies | INSTITUTE OF COMPUTING TECHNOLOGY, CHINESE ACADEMY OF SCIENCES |
RelatedCompanies_xml | – name: INSTITUTE OF COMPUTING TECHNOLOGY, CHINESE ACADEMY OF SCIENCES |
Score | 2.9995768 |
Snippet | The invention provides a method for extracting regular noise from single record web pages. The method comprises the steps of firstly converting multiple record... |
SourceID | epo |
SourceType | Open Access Repository |
SubjectTerms | CALCULATING COMPUTING COUNTING ELECTRIC DIGITAL DATA PROCESSING PHYSICS |
Title | Method for extracting regular noise from single record web pages |
URI | https://worldwide.espacenet.com/publicationDetails/biblio?FT=D&date=20130424&DB=EPODOC&locale=&CC=CN&NR=103064966A |
hasFullText | 1 |
inHoldings | 1 |
isFullTextHit | |
isPrint | |
link | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwfV1LS8NAEB5qfd40KlofrCC5Be26ieYQ1G4SipC0SJXeyibd1IokpYkI_npn19R60esMLMvCPHe--QDOU3qDQZI6FsYGaTE2ZpYQdmJRmojUdjOsQBQ4OYqd7hN7GNrDBrwusDB6T-iHXo6IFpWivVfaX8-WTSxfz1aWF8kURcVtOPB8s66O26o4Z6bf8YJ-z-9xk3OPx2b86Ck2LYdhbn-_AqsqjVZ79oPnjkKlzH6HlHAb1vp4Wl7tQOPzxYBNvmBeM2Ajqj-8DVjXE5ppicLaCstduIs07TPBfJOgb9U4p3xC5ppWfk7yYlpKonAjRDUC3iT57sQQ9JhE-Y9yD87CYMC7Fl5q9PMCIx4v73-1D828yOUBkIxJFEppp5mLuktx3R67thBCoWkZTQ-h9fc5rf-UR7BFNe8Dsyg7hmY1f5cnGH2r5FQ_2xfD9YjZ |
link.rule.ids | 230,309,786,891,25594,76903 |
linkProvider | European Patent Office |
linkToHtml | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwfV3fS8MwED7m_DHftCo6f0WQvhVdTDf3UNSlK1XXbsiUvZW0y3Qi3Vgrgn-9l9g5X_T1DkI4uLt8l3z5AE4TeolNktYt7A3SYmzILCHs2KI0FondHCECUeTkIKz7j-xuYA9K8Drnwuh_Qj_054iYUQnme67r9XQxxHL128rsLB6jaXLl9R3XLNBxTYFzZrotp93rul1ucu7w0AwfHKWmVWd4tr9ZguUGQkINlZ5aipUy_d1SvA1Y6eFqab4Jpc8XAyp8rrxmwFpQXHgbsKpfaCYZGosszLbgOtCyzwTPmwRrq-Y5pc9kpmXlZySdjDNJFG-EqEHAmyTfkxiCFZOo-pFtw4nX7nPfwk1FPxGIeLjY_8UOlNNJKneBjJhEo5R2Mmqi71w0asOmLYRQbFpGkz2o_r1O9T_nMVT8ftCJOrfh_T6sU60BwSzKDqCcz97lIXbiPD7SIfwCT2aLww |
openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Apatent&rft.title=Method+for+extracting+regular+noise+from+single+record+web+pages&rft.inventor=YU+ZHIHUA&rft.inventor=CHENG+XUEQI&rft.inventor=WAN+SHENGXIAN&rft.inventor=LI+HAIYAN&rft.inventor=LIU+YUE&rft.inventor=GUO+SHAOHUA&rft.inventor=GUO+YAN&rft.date=2013-04-24&rft.externalDBID=A&rft.externalDocID=CN103064966A |