An efficient content extraction method for webpage based on tag-line-block analysis

World Wide Web is a vast information resource that can be used in a broad range of applications. Web content is an efficient way to derive valuable information from webpages, and many efforts have been made on this subject. However, due to the increasing complexity of webpage technology, the existin...

Full description

Saved in:
Bibliographic Details
Published inSoft computing (Berlin, Germany) Vol. 27; no. 20; pp. 14631 - 14645
Main Authors Chen, Zeqiu, Zhou, Jianghui, Sun, Ruizhi
Format Journal Article
LanguageEnglish
Published Berlin/Heidelberg Springer Berlin Heidelberg 01.10.2023
Springer Nature B.V
Subjects
Online AccessGet full text
ISSN1432-7643
1433-7479
DOI10.1007/s00500-023-09076-x

Cover

Abstract World Wide Web is a vast information resource that can be used in a broad range of applications. Web content is an efficient way to derive valuable information from webpages, and many efforts have been made on this subject. However, due to the increasing complexity of webpage technology, the existing methods cannot match quite well the requirements for the content extraction of webpages. This paper proposed an improved content extraction method for webpage based on Cx-Extractor, which is capable of dealing with content extraction for different types of webpages. Several improvements have been made for the proposed method: (1) The hyperlink tags are not removed directly to avoid mistaking the dense hyperlink groups for the main content. (2) The starting point of the main content is taken as the line number of tag-line-block whose size exceeds the threshold and thus the first few short texts of the main content can be retained. (3) The threshold value of tag-line-block for the main content is calculated automatically instead of being set manually. The above can improve the accuracy of the extracted content. Moreover, (4) the blank spaces in the original text of webpage are retained, which can increase the readability of the extracted content by avoiding connecting English words into pieces. (5) The multimedia information (e.g., pictures and videos) can be selectively retained by users, allowing for maximum flexibility and usage in multiple industries. The experimental results conducted on real-world webpages show that the proposed content extraction method works well for both single-content and multi-content webpages. Furthermore, the performance of the proposed content extraction method was compared with the Chinese extraction method called Cx-Extractor and the English extraction method called Readability. It is found that the proposed method in this study outperforms these two methods in precision, recall, and readability. In addition, the extraction efficiency of the proposed method is superior to that of the Readability method.
AbstractList World Wide Web is a vast information resource that can be used in a broad range of applications. Web content is an efficient way to derive valuable information from webpages, and many efforts have been made on this subject. However, due to the increasing complexity of webpage technology, the existing methods cannot match quite well the requirements for the content extraction of webpages. This paper proposed an improved content extraction method for webpage based on Cx-Extractor, which is capable of dealing with content extraction for different types of webpages. Several improvements have been made for the proposed method: (1) The hyperlink tags are not removed directly to avoid mistaking the dense hyperlink groups for the main content. (2) The starting point of the main content is taken as the line number of tag-line-block whose size exceeds the threshold and thus the first few short texts of the main content can be retained. (3) The threshold value of tag-line-block for the main content is calculated automatically instead of being set manually. The above can improve the accuracy of the extracted content. Moreover, (4) the blank spaces in the original text of webpage are retained, which can increase the readability of the extracted content by avoiding connecting English words into pieces. (5) The multimedia information (e.g., pictures and videos) can be selectively retained by users, allowing for maximum flexibility and usage in multiple industries. The experimental results conducted on real-world webpages show that the proposed content extraction method works well for both single-content and multi-content webpages. Furthermore, the performance of the proposed content extraction method was compared with the Chinese extraction method called Cx-Extractor and the English extraction method called Readability. It is found that the proposed method in this study outperforms these two methods in precision, recall, and readability. In addition, the extraction efficiency of the proposed method is superior to that of the Readability method.
Author Sun, Ruizhi
Chen, Zeqiu
Zhou, Jianghui
Author_xml – sequence: 1
  givenname: Zeqiu
  surname: Chen
  fullname: Chen, Zeqiu
  organization: College of Information and Electrical Engineering, China Agricultural University
– sequence: 2
  givenname: Jianghui
  surname: Zhou
  fullname: Zhou, Jianghui
  organization: JD Tech
– sequence: 3
  givenname: Ruizhi
  orcidid: 0000-0001-7267-5283
  surname: Sun
  fullname: Sun, Ruizhi
  email: sunruizhi@cau.edu.cn
  organization: College of Information and Electrical Engineering, China Agricultural University, Scientific Research Base for Integrated Technologies of Precision Agriculture (Animal Husbandry), The Ministry of Agriculture
BookMark eNp9kE9LAzEQxYNUsK1-AU8LnqOTTdJsjqX4Dwoe1HPIZpO6dZvUJMX227ttBcFDT29g3m-Y90Zo4IO3CF0TuCUA4i4BcAAMJcUgQUzw9gwNCaMUCybk4DCXWEwYvUCjlJYAJRGcDtHr1BfWuda01ufCBJ_3arc5apPb4IuVzR-hKVyIxbet13phi1on2xT9LusF7lpvcd0F81lor7tdatMlOne6S_bqV8fo_eH-bfaE5y-Pz7PpHBtKZMaScgaGG1bamhvNJTgqKXWVNrKWjpUTkNSSilaSN0JIxrmAmumGMy24o3SMbo531zF8bWzKahk2sX8iqVKSChgVJetd1dFlYkgpWqdMm_U-W5-x7RQBta9QHStUfYXqUKHa9mj5D13HdqXj7jREj1DqzX5h499XJ6gfRHiFuw
CitedBy_id crossref_primary_10_3390_info16030198
Cites_doi 10.1145/1497308.1497418
10.1002/int.4550080704
10.1145/2034691.2034720
10.1109/ICDE.2000.839475
10.1109/ACCESS.2018.2877592
10.1016/j.knosys.2014.07.007
10.1088/1742-6596/1299/1/012040
10.1007/978-981-10-3376-6_6
10.1145/2009916.2009952
10.1145/775152.775182
10.1109/DSDE.2010.53
10.4018/IJWP.2019070103
10.1145/3485447.3512032
10.14236/ewic/ADBIS1997.22
10.1016/j.ins.2015.12.025
10.1109/ACCESS.2019.2907570
10.1145/3543507.3583387
10.1016/j.ocecoaman.2023.106660
10.1145/1772690.1772789
10.1145/1062745.1062763
10.1145/1645953.1646204
10.1177/0894439316643050
10.1007/3-540-36901-5_42
ContentType Journal Article
Copyright The Author(s), under exclusive licence to Springer-Verlag GmbH Germany, part of Springer Nature 2023. Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
Copyright_xml – notice: The Author(s), under exclusive licence to Springer-Verlag GmbH Germany, part of Springer Nature 2023. Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
DBID AAYXX
CITATION
8FE
8FG
AFKRA
ARAPS
AZQEC
BENPR
BGLVJ
CCPQU
DWQXO
GNUQQ
HCIFZ
JQ2
K7-
P5Z
P62
PHGZM
PHGZT
PKEHL
PQEST
PQGLB
PQQKQ
PQUKI
DOI 10.1007/s00500-023-09076-x
DatabaseName CrossRef
ProQuest SciTech Collection
ProQuest Technology Collection
ProQuest Central UK/Ireland
Advanced Technologies & Aerospace Collection
ProQuest Central Essentials
ProQuest Central
Technology Collection
ProQuest One
ProQuest Central Korea
ProQuest Central Student
SciTech Collection (ProQuest)
ProQuest Computer Science Collection
Computer Science Database
Advanced Technologies & Aerospace Database
ProQuest Advanced Technologies & Aerospace Collection
ProQuest Central Premium
ProQuest One Academic (New)
ProQuest One Academic Middle East (New)
ProQuest One Academic Eastern Edition (DO NOT USE)
ProQuest One Applied & Life Sciences
ProQuest One Academic
ProQuest One Academic UKI Edition
DatabaseTitle CrossRef
Advanced Technologies & Aerospace Collection
Computer Science Database
ProQuest Central Student
Technology Collection
ProQuest One Academic Middle East (New)
ProQuest Advanced Technologies & Aerospace Collection
ProQuest Central Essentials
ProQuest Computer Science Collection
ProQuest One Academic Eastern Edition
SciTech Premium Collection
ProQuest One Community College
ProQuest Technology Collection
ProQuest SciTech Collection
ProQuest Central
Advanced Technologies & Aerospace Database
ProQuest One Applied & Life Sciences
ProQuest One Academic UKI Edition
ProQuest Central Korea
ProQuest Central (New)
ProQuest One Academic
ProQuest One Academic (New)
DatabaseTitleList Advanced Technologies & Aerospace Collection

Database_xml – sequence: 1
  dbid: 8FG
  name: ProQuest Technology Collection
  url: https://search.proquest.com/technologycollection1
  sourceTypes: Aggregation Database
DeliveryMethod fulltext_linktorsrc
Discipline Engineering
Computer Science
EISSN 1433-7479
EndPage 14645
ExternalDocumentID 10_1007_s00500_023_09076_x
GrantInformation_xml – fundername: National Key Research and Development Program of China
  grantid: 2021YFD1300101
GroupedDBID -5B
-5G
-BR
-EM
-Y2
-~C
.86
.VR
06D
0R~
0VY
1N0
1SB
203
29~
2J2
2JN
2JY
2KG
2LR
2P1
2VQ
2~H
30V
4.4
406
408
409
40D
40E
5VS
67Z
6NX
8TC
8UJ
95-
95.
95~
96X
AAAVM
AABHQ
AACDK
AAHNG
AAIAL
AAJBT
AAJKR
AANZL
AARHV
AARTL
AASML
AATNV
AATVU
AAUYE
AAWCG
AAYIU
AAYQN
AAYTO
AAYZH
ABAKF
ABBBX
ABBXA
ABDZT
ABECU
ABFTD
ABFTV
ABHLI
ABHQN
ABJNI
ABJOX
ABKCH
ABKTR
ABMNI
ABMQK
ABNWP
ABQBU
ABQSL
ABSXP
ABTEG
ABTHY
ABTKH
ABTMW
ABULA
ABWNU
ABXPI
ACAOD
ACBXY
ACDTI
ACGFS
ACHSB
ACHXU
ACKNC
ACMDZ
ACMLO
ACOKC
ACOMO
ACPIV
ACSNA
ACZOJ
ADHHG
ADHIR
ADINQ
ADKNI
ADKPE
ADRFC
ADTPH
ADURQ
ADYFF
ADZKW
AEBTG
AEFQL
AEGAL
AEGNC
AEJHL
AEJRE
AEKMD
AEMSY
AENEX
AEOHA
AEPYU
AESKC
AETLH
AEVLU
AEXYK
AFBBN
AFGCZ
AFKRA
AFLOW
AFQWF
AFWTZ
AFZKB
AGAYW
AGDGC
AGGDS
AGJBK
AGMZJ
AGQEE
AGQMX
AGRTI
AGWIL
AGWZB
AGYKE
AHAVH
AHBYD
AHSBF
AHYZX
AIAKS
AIGIU
AIIXL
AILAN
AITGF
AJBLW
AJRNO
AJZVZ
ALMA_UNASSIGNED_HOLDINGS
ALWAN
AMKLP
AMXSW
AMYLF
AMYQR
AOCGG
ARAPS
ARMRJ
ASPBG
AVWKF
AXYYD
AYJHY
AZFZN
B-.
BA0
BDATZ
BENPR
BGLVJ
BGNMA
BSONS
CAG
CCPQU
COF
CS3
CSCUP
DDRTE
DL5
DNIVK
DPUIP
DU5
EBLON
EBS
EIOEI
EJD
ESBYG
FEDTE
FERAY
FFXSO
FIGPU
FINBP
FNLPD
FRRFC
FSGXE
FWDCC
GGCAI
GGRSB
GJIRD
GNWQR
GQ6
GQ7
GQ8
GXS
H13
HCIFZ
HF~
HG5
HG6
HMJXF
HQYDN
HRMNR
HVGLF
HZ~
I09
IHE
IJ-
IKXTQ
IWAJR
IXC
IXD
IXE
IZIGR
IZQ
I~X
I~Z
J-C
J0Z
JBSCW
JCJTX
JZLTJ
K7-
KDC
KOV
LAS
LLZTM
M4Y
MA-
N2Q
NB0
NPVJJ
NQJWS
NU0
O9-
O93
O9J
OAM
P2P
P9P
PF0
PT4
PT5
QOS
R89
R9I
RIG
RNI
ROL
RPX
RSV
RZK
S16
S1Z
S27
S3B
SAP
SDH
SEG
SHX
SISQX
SJYHP
SNE
SNPRN
SNX
SOHCF
SOJ
SPISZ
SRMVM
SSLCW
STPWE
SZN
T13
TSG
TSK
TSV
TUC
U2A
UG4
UOJIU
UTJUX
UZXMN
VC2
VFIZW
W23
W48
WK8
YLTOR
Z45
Z5O
Z7R
Z7X
Z7Y
Z7Z
Z81
Z83
Z88
ZMTXR
AAPKM
AAYXX
ABBRH
ABDBE
ABFSG
ACSTC
ADHKG
ADKFA
AEZWR
AFDZB
AFHIU
AFOHR
AGQPQ
AHPBZ
AHWEU
AIXLP
ATHPR
AYFIA
CITATION
PHGZM
PHGZT
8FE
8FG
ABRTQ
AZQEC
DWQXO
GNUQQ
JQ2
P62
PKEHL
PQEST
PQGLB
PQQKQ
PQUKI
ID FETCH-LOGICAL-c319t-93540c5c42eb5ca590f3933f8ac9b9f426093e183895d77945570b4ad54a75f33
IEDL.DBID 8FG
ISSN 1432-7643
IngestDate Fri Jul 25 23:38:36 EDT 2025
Thu Apr 24 23:03:59 EDT 2025
Fri Jul 04 01:04:22 EDT 2025
Fri Feb 21 02:43:28 EST 2025
IsPeerReviewed true
IsScholarly true
Issue 20
Keywords Tag semantic information
Automatic threshold setting
Tag-line-block distribution function
Web content extraction
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-c319t-93540c5c42eb5ca590f3933f8ac9b9f426093e183895d77945570b4ad54a75f33
Notes ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ORCID 0000-0001-7267-5283
PQID 2918043724
PQPubID 2043697
PageCount 15
ParticipantIDs proquest_journals_2918043724
crossref_citationtrail_10_1007_s00500_023_09076_x
crossref_primary_10_1007_s00500_023_09076_x
springer_journals_10_1007_s00500_023_09076_x
PublicationCentury 2000
PublicationDate 20231000
2023-10-00
20231001
PublicationDateYYYYMMDD 2023-10-01
PublicationDate_xml – month: 10
  year: 2023
  text: 20231000
PublicationDecade 2020
PublicationPlace Berlin/Heidelberg
PublicationPlace_xml – name: Berlin/Heidelberg
– name: Heidelberg
PublicationSubtitle A Fusion of Foundations, Methodologies and Applications
PublicationTitle Soft computing (Berlin, Germany)
PublicationTitleAbbrev Soft Comput
PublicationYear 2023
Publisher Springer Berlin Heidelberg
Springer Nature B.V
Publisher_xml – name: Springer Berlin Heidelberg
– name: Springer Nature B.V
References Joe DhanithPRSurendiranBAn ontology learning based approach for focused web crawling using combined normalized pointwise mutual information and Resnik algorithmInt J Comput Appl2022441211231129
YunisHSteinBKieselJContent extraction from webpages using machine learning2016Bauhaus-Universitaet Weimar
FerraraEDe MeoPFiumaraGWeb data extraction, applications and techniques: a surveyKnowl-Based Syst20147030132310.1016/j.knosys.2014.07.007
WaldherrAMaierDMiltnerPBig data, big noise: the challenge of finding issue networks on the webSoc Sci Comput Rev201735442744310.1177/0894439316643050
SandeepKSPatilNA multidimensional approach to blog mining. progress in intelligent computing techniques: theory, practice, and applicationsAdv Intell Syst Comput2018719515810.1007/978-981-10-3376-6_6
SestitoSDillonTKnowledge acquisition of conjunctive rules using multilayered neural networksInt J Intell Syst19938777980510.1002/int.4550080704
Zhang Z, Yu B, Liu T, et al. (2023) Learning structural co-occurrences for structured web data extraction in low-resource settings. In: Proceedings of the 2023 ACM web conference, pp 1683–1692
Crescenzi V, Mecca G, Merialdo P (2001) Roadrunner: towards automatic data extraction from large web sites. In: Proceedings of the 27th international conference on very large data bases, vol. 1, pp 109–118
YuMChenTXuHResearch and design of HTML parser based on page segmentationJ Comput Appl2005254974976
GuYGaoYGaoBResearch on deep web information extraction based on template and domain ontologyComput Eng Des201435327332
Liu L, Pu C, Han W (2000) XWRAP: an XML-enabled wrapper construction system for web information sources. In: Proceedings of the 16th international conference on data engineering, pp 611–621
WuYLanguage independent web news extraction system based on text detection frameworkInf Sci2016342132149346184410.1016/j.ins.2015.12.025
Cardoso E, Jabour I, Laber E, et al (2011) An efficient language-independent method to extract content from news webpages. In: Proceedings of the 11th ACM symposium on document engineering, pp 121–128
Wang Q, Fang Y, Ravula A, et al (2022) Webformer: the web-page transformer for structure information extraction. In: Proceedings of the 2022 ACM web conference, pp 3124–3133
KarthikeyanTSekaranKRanjithDPersonalized content extraction and text classification using effective web scraping techniquesInt J Web Port2019112415210.4018/IJWP.2019070103
Sun F, Song D, Liao L (2011) Dom based content extraction via text density. In: Proceedings of the 34th international ACM SIGIR conference on research and development in information retrieval, pp 245–254
Cai D, Yu S, Wen J R, et al (2003) Extracting content structure for web pages based on visual representation. In: Proceedings of the 5th Asia-pacific web conference on web technologies and applications, pp 406–417
Laber ES, de Souza CP, Jabour IV et al (2009) A fast and simple method for extracting relevant content from news webpages. In: Proceedings of the 18th ACM conference on information and knowledge management, pp 1685–1688
IDC, Statista (2022) Volume of data/information created, captured, copied, and consumed worldwide from 2010 to 2020, with forecasts from 2021 to 2025 (in zettabytes). https://www.statista.com/statistics/871513/worldwide-data-created
Gottron T (2008) Combining content extraction heuristics: the CombinE system. In: Proceedings of the 10th international conference on information integration and web-based applications and services, pp 591–595
Gupta S, Kaiser G, Neistadt D et al (2003) DOM-based content extraction of html documents. In: Proceedings of the 12th international conference on World Wide Web, pp 207–214
GanLYeBHuangZKnowledge graph construction based on ship collision accident reports to improve maritime traffic safetyOcean Coast Manag202324010666010.1016/j.ocecoaman.2023.106660
Ramakrishna M, Gowdar L, Havanur MS et al (2010) Web mining: key accomplishments, applications and future directions. In: Proceedings of the 2010 international conference on data storage and data engineering, pp 187–191
SunCGuanYA statistical approach for content extraction from web pageJ Chin Inf Process20041851722
Chen X (2011) Universal web content extraction based on row block distribution function. https://code.google.com/p/cx-extractor
Rahman A, Alam H, Hartono R (2001) Content extraction from html documents. In: Proceedings of the 1st international workshop on web document analysis, pp 1–4
Samuel MO, Tolulope AI, Oyejoke OO (2019) A systematic review of current trends in web content mining. In: Proceedings of the 3th international conference on science and sustainable development, vol. 1299, p 012040
Gibson D, Punera K, Tomkins A (2005) The volume and evolution of web page templates. In: Special interest tracks and posters of the 14th international conference on World Wide Web, pp 830–839
Weninger T, Hsu WH, Han J (2010) CETR: content extraction via tag ratios. In: Proceedings of the 19th international conference on World Wide Web, pp 971–980
ZhangHLiLHuWVisualization of location-referenced web textual information based on map mashupsIEEE Access20197404754048710.1109/ACCESS.2019.2907570
Hammer J, McHugh J, Garcia-Molina H (1997) Semistructured data: the TSIMMIS experience. In: Proceedings of the 1th East-European symposium on advances in databases and information systems, vol. 1, pp 1–13
TanZHeCFangYTitle-based extraction of news contents for text miningIEEE Access20186640856409510.1109/ACCESS.2018.2877592
LiangDYangYWeiZInformation extraction of web pages based on support vector machineComput Mod201892126
Baroni M, Chantree F, Kilgarriff A et al (2008) Cleaneval: a competition for cleaning web pages. In: Proceedings of the 6th international conference on language resources and evaluation, pp 638–643
KS Sandeep (9076_CR22) 2018; 719
Y Gu (9076_CR10) 2014; 35
M Yu (9076_CR31) 2005; 25
L Gan (9076_CR7) 2023; 240
9076_CR5
9076_CR4
9076_CR13
Y Wu (9076_CR30) 2016; 342
9076_CR12
9076_CR34
9076_CR11
9076_CR1
9076_CR18
9076_CR3
9076_CR16
9076_CR2
E Ferrara (9076_CR6) 2014; 70
H Yunis (9076_CR32) 2016
H Zhang (9076_CR33) 2019; 7
PR Joe Dhanith (9076_CR14) 2022; 44
A Waldherr (9076_CR27) 2017; 35
9076_CR19
Z Tan (9076_CR26) 2018; 6
9076_CR9
9076_CR8
9076_CR24
S Sestito (9076_CR23) 1993; 8
9076_CR29
9076_CR28
9076_CR21
C Sun (9076_CR25) 2004; 18
9076_CR20
T Karthikeyan (9076_CR15) 2019; 11
D Liang (9076_CR17) 2018; 9
References_xml – reference: Wang Q, Fang Y, Ravula A, et al (2022) Webformer: the web-page transformer for structure information extraction. In: Proceedings of the 2022 ACM web conference, pp 3124–3133
– reference: Rahman A, Alam H, Hartono R (2001) Content extraction from html documents. In: Proceedings of the 1st international workshop on web document analysis, pp 1–4
– reference: Hammer J, McHugh J, Garcia-Molina H (1997) Semistructured data: the TSIMMIS experience. In: Proceedings of the 1th East-European symposium on advances in databases and information systems, vol. 1, pp 1–13
– reference: IDC, Statista (2022) Volume of data/information created, captured, copied, and consumed worldwide from 2010 to 2020, with forecasts from 2021 to 2025 (in zettabytes). https://www.statista.com/statistics/871513/worldwide-data-created/
– reference: YuMChenTXuHResearch and design of HTML parser based on page segmentationJ Comput Appl2005254974976
– reference: WuYLanguage independent web news extraction system based on text detection frameworkInf Sci2016342132149346184410.1016/j.ins.2015.12.025
– reference: Joe DhanithPRSurendiranBAn ontology learning based approach for focused web crawling using combined normalized pointwise mutual information and Resnik algorithmInt J Comput Appl2022441211231129
– reference: WaldherrAMaierDMiltnerPBig data, big noise: the challenge of finding issue networks on the webSoc Sci Comput Rev201735442744310.1177/0894439316643050
– reference: GanLYeBHuangZKnowledge graph construction based on ship collision accident reports to improve maritime traffic safetyOcean Coast Manag202324010666010.1016/j.ocecoaman.2023.106660
– reference: Gupta S, Kaiser G, Neistadt D et al (2003) DOM-based content extraction of html documents. In: Proceedings of the 12th international conference on World Wide Web, pp 207–214
– reference: KarthikeyanTSekaranKRanjithDPersonalized content extraction and text classification using effective web scraping techniquesInt J Web Port2019112415210.4018/IJWP.2019070103
– reference: Liu L, Pu C, Han W (2000) XWRAP: an XML-enabled wrapper construction system for web information sources. In: Proceedings of the 16th international conference on data engineering, pp 611–621
– reference: Crescenzi V, Mecca G, Merialdo P (2001) Roadrunner: towards automatic data extraction from large web sites. In: Proceedings of the 27th international conference on very large data bases, vol. 1, pp 109–118
– reference: Cardoso E, Jabour I, Laber E, et al (2011) An efficient language-independent method to extract content from news webpages. In: Proceedings of the 11th ACM symposium on document engineering, pp 121–128
– reference: FerraraEDe MeoPFiumaraGWeb data extraction, applications and techniques: a surveyKnowl-Based Syst20147030132310.1016/j.knosys.2014.07.007
– reference: Laber ES, de Souza CP, Jabour IV et al (2009) A fast and simple method for extracting relevant content from news webpages. In: Proceedings of the 18th ACM conference on information and knowledge management, pp 1685–1688
– reference: Cai D, Yu S, Wen J R, et al (2003) Extracting content structure for web pages based on visual representation. In: Proceedings of the 5th Asia-pacific web conference on web technologies and applications, pp 406–417
– reference: Gottron T (2008) Combining content extraction heuristics: the CombinE system. In: Proceedings of the 10th international conference on information integration and web-based applications and services, pp 591–595
– reference: SunCGuanYA statistical approach for content extraction from web pageJ Chin Inf Process20041851722
– reference: Gibson D, Punera K, Tomkins A (2005) The volume and evolution of web page templates. In: Special interest tracks and posters of the 14th international conference on World Wide Web, pp 830–839
– reference: Zhang Z, Yu B, Liu T, et al. (2023) Learning structural co-occurrences for structured web data extraction in low-resource settings. In: Proceedings of the 2023 ACM web conference, pp 1683–1692
– reference: Baroni M, Chantree F, Kilgarriff A et al (2008) Cleaneval: a competition for cleaning web pages. In: Proceedings of the 6th international conference on language resources and evaluation, pp 638–643
– reference: GuYGaoYGaoBResearch on deep web information extraction based on template and domain ontologyComput Eng Des201435327332
– reference: SandeepKSPatilNA multidimensional approach to blog mining. progress in intelligent computing techniques: theory, practice, and applicationsAdv Intell Syst Comput2018719515810.1007/978-981-10-3376-6_6
– reference: Chen X (2011) Universal web content extraction based on row block distribution function. https://code.google.com/p/cx-extractor
– reference: YunisHSteinBKieselJContent extraction from webpages using machine learning2016Bauhaus-Universitaet Weimar
– reference: Weninger T, Hsu WH, Han J (2010) CETR: content extraction via tag ratios. In: Proceedings of the 19th international conference on World Wide Web, pp 971–980
– reference: SestitoSDillonTKnowledge acquisition of conjunctive rules using multilayered neural networksInt J Intell Syst19938777980510.1002/int.4550080704
– reference: Sun F, Song D, Liao L (2011) Dom based content extraction via text density. In: Proceedings of the 34th international ACM SIGIR conference on research and development in information retrieval, pp 245–254
– reference: Ramakrishna M, Gowdar L, Havanur MS et al (2010) Web mining: key accomplishments, applications and future directions. In: Proceedings of the 2010 international conference on data storage and data engineering, pp 187–191
– reference: TanZHeCFangYTitle-based extraction of news contents for text miningIEEE Access20186640856409510.1109/ACCESS.2018.2877592
– reference: Samuel MO, Tolulope AI, Oyejoke OO (2019) A systematic review of current trends in web content mining. In: Proceedings of the 3th international conference on science and sustainable development, vol. 1299, p 012040
– reference: LiangDYangYWeiZInformation extraction of web pages based on support vector machineComput Mod201892126
– reference: ZhangHLiLHuWVisualization of location-referenced web textual information based on map mashupsIEEE Access20197404754048710.1109/ACCESS.2019.2907570
– ident: 9076_CR9
  doi: 10.1145/1497308.1497418
– volume: 8
  start-page: 779
  issue: 7
  year: 1993
  ident: 9076_CR23
  publication-title: Int J Intell Syst
  doi: 10.1002/int.4550080704
– ident: 9076_CR19
– ident: 9076_CR3
  doi: 10.1145/2034691.2034720
– ident: 9076_CR18
  doi: 10.1109/ICDE.2000.839475
– volume: 6
  start-page: 64085
  year: 2018
  ident: 9076_CR26
  publication-title: IEEE Access
  doi: 10.1109/ACCESS.2018.2877592
– volume: 35
  start-page: 327
  year: 2014
  ident: 9076_CR10
  publication-title: Comput Eng Des
– ident: 9076_CR13
– volume: 70
  start-page: 301
  year: 2014
  ident: 9076_CR6
  publication-title: Knowl-Based Syst
  doi: 10.1016/j.knosys.2014.07.007
– ident: 9076_CR21
  doi: 10.1088/1742-6596/1299/1/012040
– volume: 44
  start-page: 1123
  issue: 12
  year: 2022
  ident: 9076_CR14
  publication-title: Int J Comput Appl
– volume: 719
  start-page: 51
  year: 2018
  ident: 9076_CR22
  publication-title: Adv Intell Syst Comput
  doi: 10.1007/978-981-10-3376-6_6
– ident: 9076_CR24
  doi: 10.1145/2009916.2009952
– ident: 9076_CR11
  doi: 10.1145/775152.775182
– ident: 9076_CR20
  doi: 10.1109/DSDE.2010.53
– volume: 11
  start-page: 41
  issue: 2
  year: 2019
  ident: 9076_CR15
  publication-title: Int J Web Port
  doi: 10.4018/IJWP.2019070103
– ident: 9076_CR28
  doi: 10.1145/3485447.3512032
– ident: 9076_CR12
  doi: 10.14236/ewic/ADBIS1997.22
– volume: 9
  start-page: 21
  year: 2018
  ident: 9076_CR17
  publication-title: Comput Mod
– ident: 9076_CR4
– volume: 342
  start-page: 132
  year: 2016
  ident: 9076_CR30
  publication-title: Inf Sci
  doi: 10.1016/j.ins.2015.12.025
– volume: 7
  start-page: 40475
  year: 2019
  ident: 9076_CR33
  publication-title: IEEE Access
  doi: 10.1109/ACCESS.2019.2907570
– ident: 9076_CR34
  doi: 10.1145/3543507.3583387
– volume: 18
  start-page: 17
  issue: 5
  year: 2004
  ident: 9076_CR25
  publication-title: J Chin Inf Process
– volume: 240
  start-page: 106660
  year: 2023
  ident: 9076_CR7
  publication-title: Ocean Coast Manag
  doi: 10.1016/j.ocecoaman.2023.106660
– ident: 9076_CR29
  doi: 10.1145/1772690.1772789
– ident: 9076_CR8
  doi: 10.1145/1062745.1062763
– ident: 9076_CR16
  doi: 10.1145/1645953.1646204
– volume: 25
  start-page: 974
  issue: 4
  year: 2005
  ident: 9076_CR31
  publication-title: J Comput Appl
– volume: 35
  start-page: 427
  issue: 4
  year: 2017
  ident: 9076_CR27
  publication-title: Soc Sci Comput Rev
  doi: 10.1177/0894439316643050
– ident: 9076_CR1
– volume-title: Content extraction from webpages using machine learning
  year: 2016
  ident: 9076_CR32
– ident: 9076_CR2
  doi: 10.1007/3-540-36901-5_42
– ident: 9076_CR5
SSID ssj0021753
Score 2.349304
Snippet World Wide Web is a vast information resource that can be used in a broad range of applications. Web content is an efficient way to derive valuable information...
SourceID proquest
crossref
springer
SourceType Aggregation Database
Enrichment Source
Index Database
Publisher
StartPage 14631
SubjectTerms Accuracy
Algorithms
Artificial Intelligence
Computational Intelligence
Control
Engineering
Extractors
Information resources
Information retrieval
Information sources
Internet
Mathematical Logic and Foundations
Mathematical Methods in Data Science
Mechatronics
Methods
Multimedia
Natural language processing
Neural networks
Noise
Ontology
Readability
Robotics
SummonAdditionalLinks – databaseName: SpringerLink Journals (ICM)
  dbid: U2A
  link: http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV09T8MwELVQWWDgo4AoFOSBDSw1cZzEY4WoKgYWqNQtsp1LB1BAbZH4-dw5TvgQIDFlsOPh7PN7tu_eMXaRpuAyA5GIKwCBu18pdGpBSEqdgVIqU_oA2bt0Oktu52oeksJWbbR7-yTpd-ou2Y2kSkYCMYbuFrJUIHPcVHh2J3ecxePumBW0J5EIIHdEwA2pMj-P8RWOPjjmt2dRjzaTPbYTaCIfN_O6zzag7rPdtgQDDx7ZZ9uf9AQP2P245uA1IRBKOEWh0xe332WTvsCbetEciSr3skoL4ARjJce2tVkI4pzCIr49chPUSg7ZbHLzcD0VoWqCcOhOa6HpJscpl8RglTNKjyqppaxy47TVFSnSawnoyblWZYbuSCJcNjGlSkymKimPWK9-ruGYcY3GTEuXA2nOOEhsBKBS55R1NpG5GrCoNV7hgqQ4VbZ4KjoxZG_wAg1eeIMXbwN22f3z0ghq_Nl72M5JEZxrVcQ6ykmSKU4G7Kqdp4_m30c7-V_3U7ZFxeWb0L0h662Xr3CGFGRtz_2KewcihNGV
  priority: 102
  providerName: Springer Nature
Title An efficient content extraction method for webpage based on tag-line-block analysis
URI https://link.springer.com/article/10.1007/s00500-023-09076-x
https://www.proquest.com/docview/2918043724
Volume 27
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwfV1NT8MwDI1gXLjwjRgMlAM3iKBN0zUntKENBNKEgElwqprU3QG0ASvSfj52mm4CiZ1ySJODE_u5jv3M2Gkcg21nEIiwABBo_XKhYwNCUukM5FJluUuQHcS3w-juRb34gNvUp1XWNtEZ6nxiKUZ-EeogIR6eMLr6-BTUNYpeV30LjVW2FiDS0D1P-jfzHy7PQokuAXqRCL2-aMaVzhHxyaVAxKJIRTsWs9_AtPA2_zyQOtzpb7EN7zDyTnXC22wFxjtss27GwL1u7rKnzpiD44NAGOGUgU4jmt6vqnSBV72iOTqp3FEqjYAThOUc58psJMjfFAax7Y1nnqlkjw37vefrW-E7JgiLqlQKTVEcq2wUglE2U_qykFrKIsmsNrogNnotAbU40SpvoyoSAZeJslxFWVsVUu6zxngyhgPGNYovzm0CxDdjITIBgIqtVcaaSCaqyYJaXKn1dOLU1eI9nRMhOxGnKOLUiTidNdnZfM1HRaax9OtWfQqpV6xpurgGTXZen8xi-v_dDpfvdsTWqZF8labXYo3y6xuO0d0ozYm7UydsrdPvdgc03rze93Ds9gYPjzg7DDs_PdrVPA
linkProvider ProQuest
linkToHtml http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwtV07T8MwED6hMsDCG1GeHmACC5rEaTwgxKOo0FIhHhJbiJ0LA6i8ioA_xW_kLnFagUQ3pgxOPJy_3Mt33wGshyHaeoI16WWIkrRfKnVoUPrcOoOpr5I0L5DthM3r4PRG3YzAV9kLw2WVpU7MFXX6aDlHvu3pWsQ8PF6w9_QseWoU366WIzQKWLTw851CttfdkyM63w3PO25cHTalmyogLcGtJzVnOqyygYdG2UTpncynqD6LEquNzpixXftISI-0SusEVyapMkGSqiCpq4wToKTyRwPuaK3A6EGjc37RD_Ec7yU5IeS3krF3bTp5sx5TrexIspGcG6mH8uOnKRz4t7-uZHNLdzwFE85FFfsFpqZhBLszMFmOfxBOG8zC5X5XYM5AQYZLcM07P0nZvxTNEqKYTi3ILRY5idMdCjaaqaC1XnIn2cOVhqzpvUgcN8ocXP-LNOeh0n3s4gIITeILUxshM9xYDEwNUYXWKmNN4EeqCrVSXLF1BOY8R-Mh7lMv5yKOScRxLuL4owqb_W-eCvqOoW8vl6cQu1_5NR4Arwpb5ckMlv_ebXH4bmsw1rw6a8ftk05rCcZ5jH1RJLgMld7LG66Qs9Mzqw5hAm7_G9TfFDAMKw
linkToPdf http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlZ3fS8MwEMeDTBB98MdUnE7Ng28atjZN2zwOdcwfDEEHeytNct2DUses4J_vpU23KSr4nDSQSy73bZv7HCFnYQg6SsFjfgbA8PQzTIYKGLepM2C4SE15QXYYDkbB7ViMl7L4y9vu9S_JKqfBUpryojM1WWee-GaxJV2G8cZ-Z4hChipyFY9jz-70kd-bv3I5DiWKAtSRGHxd2szPY3wNTQu9-e0XaRl5-ttk00lG2qvWeIesQN4kW3U5Buq8s0k2ltiCu-Sxl1Mo-RA4LeqmR_EonlWpDLSqHU1RtNISsTQBakOaodhWpBNm9SdTGOueaerIJXtk1L9-uhwwV0GBaXStgkn7VUcLHfighE6F7GZccp7FqZZKZpZOLzmgV8dSmAhd0wK5VJAaEaSRyDjfJ438NYcDQiUaMzQ6Bsuf0RAoD0CEWgulVcBj0SJebbxEO7y4rXLxkszByKXBEzR4Uho8-WiR8_kz0wqu8Wfvdr0miXO0t8SXXmzxTH7QIhf1Oi2afx_t8H_dT8naw1U_ub8Z3h2RdVtzvrrR1yaNYvYOx6hMCnVSbr5Prp3YxA
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=An+efficient+content+extraction+method+for+webpage+based+on+tag-line-block+analysis&rft.jtitle=Soft+computing+%28Berlin%2C+Germany%29&rft.au=Chen%2C+Zeqiu&rft.au=Zhou%2C+Jianghui&rft.au=Sun%2C+Ruizhi&rft.date=2023-10-01&rft.issn=1432-7643&rft.eissn=1433-7479&rft.volume=27&rft.issue=20&rft.spage=14631&rft.epage=14645&rft_id=info:doi/10.1007%2Fs00500-023-09076-x&rft.externalDBID=n%2Fa&rft.externalDocID=10_1007_s00500_023_09076_x
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1432-7643&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1432-7643&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1432-7643&client=summon