Method for extracting regular noise from single record web pages

The invention provides a method for extracting regular noise from single record web pages. The method comprises the steps of firstly converting multiple record web pages into document object model (DOM) trees, and classifying the DOM trees according to structures; then aligning and integrating the D...

Full description

Saved in:

Bibliographic Details
Main Authors	YU ZHIHUA, CHENG XUEQI, WAN SHENGXIAN, LI HAIYAN, LIU YUE, GUO SHAOHUA, GUO YAN
Format	Patent
Language	Chinese English
Published	24.04.2013
Subjects	CALCULATING COMPUTING COUNTING ELECTRIC DIGITAL DATA PROCESSING PHYSICS
Online Access	Get full text

Cover

Loading…

Abstract	The invention provides a method for extracting regular noise from single record web pages. The method comprises the steps of firstly converting multiple record web pages into document object model (DOM) trees, and classifying the DOM trees according to structures; then aligning and integrating the DOM trees belonging to the same type to obtain site section style trees; and positioning approximate positions of web page text headline nodes and approximate positions of web page text main body nodes in the site section style trees, and finally extracting the regular noise in front of texts, in the texts and after the texts according to the web page text headline nodes and the web page text main body nodes. By means of the method, space resources required by construction of the site section style trees is decreased, possible extraction leakage situations are decreased, and extracting speed is accelerated. In addition, an extracting result has high accuracy, good effect is obtained, and the reliability is high.
AbstractList	The invention provides a method for extracting regular noise from single record web pages. The method comprises the steps of firstly converting multiple record web pages into document object model (DOM) trees, and classifying the DOM trees according to structures; then aligning and integrating the DOM trees belonging to the same type to obtain site section style trees; and positioning approximate positions of web page text headline nodes and approximate positions of web page text main body nodes in the site section style trees, and finally extracting the regular noise in front of texts, in the texts and after the texts according to the web page text headline nodes and the web page text main body nodes. By means of the method, space resources required by construction of the site section style trees is decreased, possible extraction leakage situations are decreased, and extracting speed is accelerated. In addition, an extracting result has high accuracy, good effect is obtained, and the reliability is high.
Author	WAN SHENGXIAN LI HAIYAN GUO YAN LIU YUE CHENG XUEQI GUO SHAOHUA YU ZHIHUA
Author_xml	– fullname: YU ZHIHUA – fullname: CHENG XUEQI – fullname: WAN SHENGXIAN – fullname: LI HAIYAN – fullname: LIU YUE – fullname: GUO SHAOHUA – fullname: GUO YAN
BookMark	eNrjYmDJy89L5WRw8E0tychPUUjLL1JIrSgpSkwuycxLVyhKTS_NSSxSyMvPLE5VSCvKz1UoBornpAJlkvOLUhTKU5MUChLTU4t5GFjTEnOKU3mhNDeDoptriLOHbmpBfnxqcUFicmpeakm8s5-hgbGBmYmlmZmjMTFqANwuMv8
ContentType	Patent
DBID	EVB
DatabaseName	esp@cenet
DatabaseTitleList
Database_xml	– sequence: 1 dbid: EVB name: esp@cenet url: http://worldwide.espacenet.com/singleLineSearch?locale=en_EP sourceTypes: Open Access Repository
DeliveryMethod	fulltext_linktorsrc
Discipline	Medicine Chemistry Sciences Physics
ExternalDocumentID	CN103064966A
GroupedDBID	EVB
ID	FETCH-epo_espacenet_CN103064966A3
IEDL.DBID	EVB
IngestDate	Fri Jul 19 12:10:39 EDT 2024
IsOpenAccess	true
IsPeerReviewed	false
IsScholarly	false
Language	Chinese English
LinkModel	DirectLink
MergedId	FETCHMERGED-epo_espacenet_CN103064966A3
Notes	Application Number: CN20121592795
OpenAccessLink	https://worldwide.espacenet.com/publicationDetails/biblio?FT=D&date=20130424&DB=EPODOC&CC=CN&NR=103064966A
ParticipantIDs	epo_espacenet_CN103064966A
PublicationCentury	2000
PublicationDate	20130424
PublicationDateYYYYMMDD	2013-04-24
PublicationDate_xml	– month: 04 year: 2013 text: 20130424 day: 24
PublicationDecade	2010
PublicationYear	2013
RelatedCompanies	INSTITUTE OF COMPUTING TECHNOLOGY, CHINESE ACADEMY OF SCIENCES
RelatedCompanies_xml	– name: INSTITUTE OF COMPUTING TECHNOLOGY, CHINESE ACADEMY OF SCIENCES
Score	2.9995768
Snippet	The invention provides a method for extracting regular noise from single record web pages. The method comprises the steps of firstly converting multiple record...
SourceID	epo
SourceType	Open Access Repository
SubjectTerms	CALCULATING COMPUTING COUNTING ELECTRIC DIGITAL DATA PROCESSING PHYSICS
Title	Method for extracting regular noise from single record web pages
URI	https://worldwide.espacenet.com/publicationDetails/biblio?FT=D&date=20130424&DB=EPODOC&locale=&CC=CN&NR=103064966A
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
link	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwfV1LS8NAEB5qfd40KlofrCC5Be26ieYQ1G4SipC0SJXeyibd1IokpYkI_npn19R60esMLMvCPHe--QDOU3qDQZI6FsYGaTE2ZpYQdmJRmojUdjOsQBQ4OYqd7hN7GNrDBrwusDB6T-iHXo6IFpWivVfaX8-WTSxfz1aWF8kURcVtOPB8s66O26o4Z6bf8YJ-z-9xk3OPx2b86Ck2LYdhbn-_AqsqjVZ79oPnjkKlzH6HlHAb1vp4Wl7tQOPzxYBNvmBeM2Ajqj-8DVjXE5ppicLaCstduIs07TPBfJOgb9U4p3xC5ppWfk7yYlpKonAjRDUC3iT57sQQ9JhE-Y9yD87CYMC7Fl5q9PMCIx4v73-1D828yOUBkIxJFEppp5mLuktx3R67thBCoWkZTQ-h9fc5rf-UR7BFNe8Dsyg7hmY1f5cnGH2r5FQ_2xfD9YjZ
link.rule.ids	230,309,786,891,25594,76903
linkProvider	European Patent Office
linkToHtml	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwfV3fS8MwED7m_DHftCo6f0WQvhVdTDf3UNSlK1XXbsiUvZW0y3Qi3Vgrgn-9l9g5X_T1DkI4uLt8l3z5AE4TeolNktYt7A3SYmzILCHs2KI0FondHCECUeTkIKz7j-xuYA9K8Drnwuh_Qj_054iYUQnme67r9XQxxHL128rsLB6jaXLl9R3XLNBxTYFzZrotp93rul1ucu7w0AwfHKWmVWd4tr9ZguUGQkINlZ5aipUy_d1SvA1Y6eFqab4Jpc8XAyp8rrxmwFpQXHgbsKpfaCYZGosszLbgOtCyzwTPmwRrq-Y5pc9kpmXlZySdjDNJFG-EqEHAmyTfkxiCFZOo-pFtw4nX7nPfwk1FPxGIeLjY_8UOlNNJKneBjJhEo5R2Mmqi71w0asOmLYRQbFpGkz2o_r1O9T_nMVT8ftCJOrfh_T6sU60BwSzKDqCcz97lIXbiPD7SIfwCT2aLww
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Apatent&rft.title=Method+for+extracting+regular+noise+from+single+record+web+pages&rft.inventor=YU+ZHIHUA&rft.inventor=CHENG+XUEQI&rft.inventor=WAN+SHENGXIAN&rft.inventor=LI+HAIYAN&rft.inventor=LIU+YUE&rft.inventor=GUO+SHAOHUA&rft.inventor=GUO+YAN&rft.date=2013-04-24&rft.externalDBID=A&rft.externalDocID=CN103064966A