Method for extracting regular noise from single record web pages

The invention provides a method for extracting regular noise from single record web pages. The method comprises the steps of firstly converting multiple record web pages into document object model (DOM) trees, and classifying the DOM trees according to structures; then aligning and integrating the D...

Full description

Saved in:
Bibliographic Details
Main Authors YU ZHIHUA, CHENG XUEQI, WAN SHENGXIAN, LI HAIYAN, LIU YUE, GUO SHAOHUA, GUO YAN
Format Patent
LanguageChinese
English
Published 24.04.2013
Subjects
Online AccessGet full text

Cover

Loading…
Abstract The invention provides a method for extracting regular noise from single record web pages. The method comprises the steps of firstly converting multiple record web pages into document object model (DOM) trees, and classifying the DOM trees according to structures; then aligning and integrating the DOM trees belonging to the same type to obtain site section style trees; and positioning approximate positions of web page text headline nodes and approximate positions of web page text main body nodes in the site section style trees, and finally extracting the regular noise in front of texts, in the texts and after the texts according to the web page text headline nodes and the web page text main body nodes. By means of the method, space resources required by construction of the site section style trees is decreased, possible extraction leakage situations are decreased, and extracting speed is accelerated. In addition, an extracting result has high accuracy, good effect is obtained, and the reliability is high.
AbstractList The invention provides a method for extracting regular noise from single record web pages. The method comprises the steps of firstly converting multiple record web pages into document object model (DOM) trees, and classifying the DOM trees according to structures; then aligning and integrating the DOM trees belonging to the same type to obtain site section style trees; and positioning approximate positions of web page text headline nodes and approximate positions of web page text main body nodes in the site section style trees, and finally extracting the regular noise in front of texts, in the texts and after the texts according to the web page text headline nodes and the web page text main body nodes. By means of the method, space resources required by construction of the site section style trees is decreased, possible extraction leakage situations are decreased, and extracting speed is accelerated. In addition, an extracting result has high accuracy, good effect is obtained, and the reliability is high.
Author WAN SHENGXIAN
LI HAIYAN
GUO YAN
LIU YUE
CHENG XUEQI
GUO SHAOHUA
YU ZHIHUA
Author_xml – fullname: YU ZHIHUA
– fullname: CHENG XUEQI
– fullname: WAN SHENGXIAN
– fullname: LI HAIYAN
– fullname: LIU YUE
– fullname: GUO SHAOHUA
– fullname: GUO YAN
BookMark eNrjYmDJy89L5WRw8E0tychPUUjLL1JIrSgpSkwuycxLVyhKTS_NSSxSyMvPLE5VSCvKz1UoBornpAJlkvOLUhTKU5MUChLTU4t5GFjTEnOKU3mhNDeDoptriLOHbmpBfnxqcUFicmpeakm8s5-hgbGBmYmlmZmjMTFqANwuMv8
ContentType Patent
DBID EVB
DatabaseName esp@cenet
DatabaseTitleList
Database_xml – sequence: 1
  dbid: EVB
  name: esp@cenet
  url: http://worldwide.espacenet.com/singleLineSearch?locale=en_EP
  sourceTypes: Open Access Repository
DeliveryMethod fulltext_linktorsrc
Discipline Medicine
Chemistry
Sciences
Physics
ExternalDocumentID CN103064966A
GroupedDBID EVB
ID FETCH-epo_espacenet_CN103064966A3
IEDL.DBID EVB
IngestDate Fri Jul 19 12:10:39 EDT 2024
IsOpenAccess true
IsPeerReviewed false
IsScholarly false
Language Chinese
English
LinkModel DirectLink
MergedId FETCHMERGED-epo_espacenet_CN103064966A3
Notes Application Number: CN20121592795
OpenAccessLink https://worldwide.espacenet.com/publicationDetails/biblio?FT=D&date=20130424&DB=EPODOC&CC=CN&NR=103064966A
ParticipantIDs epo_espacenet_CN103064966A
PublicationCentury 2000
PublicationDate 20130424
PublicationDateYYYYMMDD 2013-04-24
PublicationDate_xml – month: 04
  year: 2013
  text: 20130424
  day: 24
PublicationDecade 2010
PublicationYear 2013
RelatedCompanies INSTITUTE OF COMPUTING TECHNOLOGY, CHINESE ACADEMY OF SCIENCES
RelatedCompanies_xml – name: INSTITUTE OF COMPUTING TECHNOLOGY, CHINESE ACADEMY OF SCIENCES
Score 2.9995768
Snippet The invention provides a method for extracting regular noise from single record web pages. The method comprises the steps of firstly converting multiple record...
SourceID epo
SourceType Open Access Repository
SubjectTerms CALCULATING
COMPUTING
COUNTING
ELECTRIC DIGITAL DATA PROCESSING
PHYSICS
Title Method for extracting regular noise from single record web pages
URI https://worldwide.espacenet.com/publicationDetails/biblio?FT=D&date=20130424&DB=EPODOC&locale=&CC=CN&NR=103064966A
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwfV1LS8NAEB5qfd40KlofrCC5Be26ieYQ1G4SipC0SJXeyibd1IokpYkI_npn19R60esMLMvCPHe--QDOU3qDQZI6FsYGaTE2ZpYQdmJRmojUdjOsQBQ4OYqd7hN7GNrDBrwusDB6T-iHXo6IFpWivVfaX8-WTSxfz1aWF8kURcVtOPB8s66O26o4Z6bf8YJ-z-9xk3OPx2b86Ck2LYdhbn-_AqsqjVZ79oPnjkKlzH6HlHAb1vp4Wl7tQOPzxYBNvmBeM2Ajqj-8DVjXE5ppicLaCstduIs07TPBfJOgb9U4p3xC5ppWfk7yYlpKonAjRDUC3iT57sQQ9JhE-Y9yD87CYMC7Fl5q9PMCIx4v73-1D828yOUBkIxJFEppp5mLuktx3R67thBCoWkZTQ-h9fc5rf-UR7BFNe8Dsyg7hmY1f5cnGH2r5FQ_2xfD9YjZ
link.rule.ids 230,309,786,891,25594,76903
linkProvider European Patent Office
linkToHtml http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwfV3fS8MwED7m_DHftCo6f0WQvhVdTDf3UNSlK1XXbsiUvZW0y3Qi3Vgrgn-9l9g5X_T1DkI4uLt8l3z5AE4TeolNktYt7A3SYmzILCHs2KI0FondHCECUeTkIKz7j-xuYA9K8Drnwuh_Qj_054iYUQnme67r9XQxxHL128rsLB6jaXLl9R3XLNBxTYFzZrotp93rul1ucu7w0AwfHKWmVWd4tr9ZguUGQkINlZ5aipUy_d1SvA1Y6eFqab4Jpc8XAyp8rrxmwFpQXHgbsKpfaCYZGosszLbgOtCyzwTPmwRrq-Y5pc9kpmXlZySdjDNJFG-EqEHAmyTfkxiCFZOo-pFtw4nX7nPfwk1FPxGIeLjY_8UOlNNJKneBjJhEo5R2Mmqi71w0asOmLYRQbFpGkz2o_r1O9T_nMVT8ftCJOrfh_T6sU60BwSzKDqCcz97lIXbiPD7SIfwCT2aLww
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Apatent&rft.title=Method+for+extracting+regular+noise+from+single+record+web+pages&rft.inventor=YU+ZHIHUA&rft.inventor=CHENG+XUEQI&rft.inventor=WAN+SHENGXIAN&rft.inventor=LI+HAIYAN&rft.inventor=LIU+YUE&rft.inventor=GUO+SHAOHUA&rft.inventor=GUO+YAN&rft.date=2013-04-24&rft.externalDBID=A&rft.externalDocID=CN103064966A