An efficient content extraction method for webpage based on tag-line-block analysis

World Wide Web is a vast information resource that can be used in a broad range of applications. Web content is an efficient way to derive valuable information from webpages, and many efforts have been made on this subject. However, due to the increasing complexity of webpage technology, the existin...

Full description

Saved in:

Bibliographic Details
Published in	Soft computing (Berlin, Germany) Vol. 27; no. 20; pp. 14631 - 14645
Main Authors	Chen, Zeqiu, Zhou, Jianghui, Sun, Ruizhi
Format	Journal Article
Language	English
Published	Berlin/Heidelberg Springer Berlin Heidelberg 01.10.2023 Springer Nature B.V
Subjects	Accuracy Algorithms Artificial Intelligence Computational Intelligence Control Engineering Extractors Information resources Information retrieval Information sources Internet Mathematical Logic and Foundations Mathematical Methods in Data Science Mechatronics Methods Multimedia Natural language processing Neural networks Noise Ontology Readability Robotics Tag semantic information Automatic threshold setting Tag-line-block distribution function Web content extraction
Online Access	Get full text
ISSN	1432-7643 1433-7479
DOI	10.1007/s00500-023-09076-x

Cover

Abstract	World Wide Web is a vast information resource that can be used in a broad range of applications. Web content is an efficient way to derive valuable information from webpages, and many efforts have been made on this subject. However, due to the increasing complexity of webpage technology, the existing methods cannot match quite well the requirements for the content extraction of webpages. This paper proposed an improved content extraction method for webpage based on Cx-Extractor, which is capable of dealing with content extraction for different types of webpages. Several improvements have been made for the proposed method: (1) The hyperlink tags are not removed directly to avoid mistaking the dense hyperlink groups for the main content. (2) The starting point of the main content is taken as the line number of tag-line-block whose size exceeds the threshold and thus the first few short texts of the main content can be retained. (3) The threshold value of tag-line-block for the main content is calculated automatically instead of being set manually. The above can improve the accuracy of the extracted content. Moreover, (4) the blank spaces in the original text of webpage are retained, which can increase the readability of the extracted content by avoiding connecting English words into pieces. (5) The multimedia information (e.g., pictures and videos) can be selectively retained by users, allowing for maximum flexibility and usage in multiple industries. The experimental results conducted on real-world webpages show that the proposed content extraction method works well for both single-content and multi-content webpages. Furthermore, the performance of the proposed content extraction method was compared with the Chinese extraction method called Cx-Extractor and the English extraction method called Readability. It is found that the proposed method in this study outperforms these two methods in precision, recall, and readability. In addition, the extraction efficiency of the proposed method is superior to that of the Readability method.
AbstractList	World Wide Web is a vast information resource that can be used in a broad range of applications. Web content is an efficient way to derive valuable information from webpages, and many efforts have been made on this subject. However, due to the increasing complexity of webpage technology, the existing methods cannot match quite well the requirements for the content extraction of webpages. This paper proposed an improved content extraction method for webpage based on Cx-Extractor, which is capable of dealing with content extraction for different types of webpages. Several improvements have been made for the proposed method: (1) The hyperlink tags are not removed directly to avoid mistaking the dense hyperlink groups for the main content. (2) The starting point of the main content is taken as the line number of tag-line-block whose size exceeds the threshold and thus the first few short texts of the main content can be retained. (3) The threshold value of tag-line-block for the main content is calculated automatically instead of being set manually. The above can improve the accuracy of the extracted content. Moreover, (4) the blank spaces in the original text of webpage are retained, which can increase the readability of the extracted content by avoiding connecting English words into pieces. (5) The multimedia information (e.g., pictures and videos) can be selectively retained by users, allowing for maximum flexibility and usage in multiple industries. The experimental results conducted on real-world webpages show that the proposed content extraction method works well for both single-content and multi-content webpages. Furthermore, the performance of the proposed content extraction method was compared with the Chinese extraction method called Cx-Extractor and the English extraction method called Readability. It is found that the proposed method in this study outperforms these two methods in precision, recall, and readability. In addition, the extraction efficiency of the proposed method is superior to that of the Readability method.
Author	Sun, Ruizhi Chen, Zeqiu Zhou, Jianghui
Author_xml	– sequence: 1 givenname: Zeqiu surname: Chen fullname: Chen, Zeqiu organization: College of Information and Electrical Engineering, China Agricultural University – sequence: 2 givenname: Jianghui surname: Zhou fullname: Zhou, Jianghui organization: JD Tech – sequence: 3 givenname: Ruizhi orcidid: 0000-0001-7267-5283 surname: Sun fullname: Sun, Ruizhi email: sunruizhi@cau.edu.cn organization: College of Information and Electrical Engineering, China Agricultural University, Scientific Research Base for Integrated Technologies of Precision Agriculture (Animal Husbandry), The Ministry of Agriculture
BookMark	eNp9kE9LAzEQxYNUsK1-AU8LnqOTTdJsjqX4Dwoe1HPIZpO6dZvUJMX227ttBcFDT29g3m-Y90Zo4IO3CF0TuCUA4i4BcAAMJcUgQUzw9gwNCaMUCybk4DCXWEwYvUCjlJYAJRGcDtHr1BfWuda01ufCBJ_3arc5apPb4IuVzR-hKVyIxbet13phi1on2xT9LusF7lpvcd0F81lor7tdatMlOne6S_bqV8fo_eH-bfaE5y-Pz7PpHBtKZMaScgaGG1bamhvNJTgqKXWVNrKWjpUTkNSSilaSN0JIxrmAmumGMy24o3SMbo531zF8bWzKahk2sX8iqVKSChgVJetd1dFlYkgpWqdMm_U-W5-x7RQBta9QHStUfYXqUKHa9mj5D13HdqXj7jREj1DqzX5h499XJ6gfRHiFuw
CitedBy_id	crossref_primary_10_3390_info16030198
Cites_doi	10.1145/1497308.1497418 10.1002/int.4550080704 10.1145/2034691.2034720 10.1109/ICDE.2000.839475 10.1109/ACCESS.2018.2877592 10.1016/j.knosys.2014.07.007 10.1088/1742-6596/1299/1/012040 10.1007/978-981-10-3376-6_6 10.1145/2009916.2009952 10.1145/775152.775182 10.1109/DSDE.2010.53 10.4018/IJWP.2019070103 10.1145/3485447.3512032 10.14236/ewic/ADBIS1997.22 10.1016/j.ins.2015.12.025 10.1109/ACCESS.2019.2907570 10.1145/3543507.3583387 10.1016/j.ocecoaman.2023.106660 10.1145/1772690.1772789 10.1145/1062745.1062763 10.1145/1645953.1646204 10.1177/0894439316643050 10.1007/3-540-36901-5_42
ContentType	Journal Article
Copyright	The Author(s), under exclusive licence to Springer-Verlag GmbH Germany, part of Springer Nature 2023. Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
Copyright_xml	– notice: The Author(s), under exclusive licence to Springer-Verlag GmbH Germany, part of Springer Nature 2023. Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
DBID	AAYXX CITATION 8FE 8FG AFKRA ARAPS AZQEC BENPR BGLVJ CCPQU DWQXO GNUQQ HCIFZ JQ2 K7- P5Z P62 PHGZM PHGZT PKEHL PQEST PQGLB PQQKQ PQUKI
DOI	10.1007/s00500-023-09076-x
DatabaseName	CrossRef ProQuest SciTech Collection ProQuest Technology Collection ProQuest Central UK/Ireland Advanced Technologies & Aerospace Collection ProQuest Central Essentials ProQuest Central Technology Collection ProQuest One ProQuest Central Korea ProQuest Central Student SciTech Collection (ProQuest) ProQuest Computer Science Collection Computer Science Database Advanced Technologies & Aerospace Database ProQuest Advanced Technologies & Aerospace Collection ProQuest Central Premium ProQuest One Academic (New) ProQuest One Academic Middle East (New) ProQuest One Academic Eastern Edition (DO NOT USE) ProQuest One Applied & Life Sciences ProQuest One Academic ProQuest One Academic UKI Edition
DatabaseTitle	CrossRef Advanced Technologies & Aerospace Collection Computer Science Database ProQuest Central Student Technology Collection ProQuest One Academic Middle East (New) ProQuest Advanced Technologies & Aerospace Collection ProQuest Central Essentials ProQuest Computer Science Collection ProQuest One Academic Eastern Edition SciTech Premium Collection ProQuest One Community College ProQuest Technology Collection ProQuest SciTech Collection ProQuest Central Advanced Technologies & Aerospace Database ProQuest One Applied & Life Sciences ProQuest One Academic UKI Edition ProQuest Central Korea ProQuest Central (New) ProQuest One Academic ProQuest One Academic (New)
DatabaseTitleList	Advanced Technologies & Aerospace Collection
Database_xml	– sequence: 1 dbid: 8FG name: ProQuest Technology Collection url: https://search.proquest.com/technologycollection1 sourceTypes: Aggregation Database
DeliveryMethod	fulltext_linktorsrc
Discipline	Engineering Computer Science
EISSN	1433-7479
EndPage	14645
ExternalDocumentID	10_1007_s00500_023_09076_x
GrantInformation_xml	– fundername: National Key Research and Development Program of China grantid: 2021YFD1300101
GroupedDBID	-5B -5G -BR -EM -Y2 -~C .86 .VR 06D 0R~ 0VY 1N0 1SB 203 29~ 2J2 2JN 2JY 2KG 2LR 2P1 2VQ 2~H 30V 4.4 406 408 409 40D 40E 5VS 67Z 6NX 8TC 8UJ 95- 95. 95~ 96X AAAVM AABHQ AACDK AAHNG AAIAL AAJBT AAJKR AANZL AARHV AARTL AASML AATNV AATVU AAUYE AAWCG AAYIU AAYQN AAYTO AAYZH ABAKF ABBBX ABBXA ABDZT ABECU ABFTD ABFTV ABHLI ABHQN ABJNI ABJOX ABKCH ABKTR ABMNI ABMQK ABNWP ABQBU ABQSL ABSXP ABTEG ABTHY ABTKH ABTMW ABULA ABWNU ABXPI ACAOD ACBXY ACDTI ACGFS ACHSB ACHXU ACKNC ACMDZ ACMLO ACOKC ACOMO ACPIV ACSNA ACZOJ ADHHG ADHIR ADINQ ADKNI ADKPE ADRFC ADTPH ADURQ ADYFF ADZKW AEBTG AEFQL AEGAL AEGNC AEJHL AEJRE AEKMD AEMSY AENEX AEOHA AEPYU AESKC AETLH AEVLU AEXYK AFBBN AFGCZ AFKRA AFLOW AFQWF AFWTZ AFZKB AGAYW AGDGC AGGDS AGJBK AGMZJ AGQEE AGQMX AGRTI AGWIL AGWZB AGYKE AHAVH AHBYD AHSBF AHYZX AIAKS AIGIU AIIXL AILAN AITGF AJBLW AJRNO AJZVZ ALMA_UNASSIGNED_HOLDINGS ALWAN AMKLP AMXSW AMYLF AMYQR AOCGG ARAPS ARMRJ ASPBG AVWKF AXYYD AYJHY AZFZN B-. BA0 BDATZ BENPR BGLVJ BGNMA BSONS CAG CCPQU COF CS3 CSCUP DDRTE DL5 DNIVK DPUIP DU5 EBLON EBS EIOEI EJD ESBYG FEDTE FERAY FFXSO FIGPU FINBP FNLPD FRRFC FSGXE FWDCC GGCAI GGRSB GJIRD GNWQR GQ6 GQ7 GQ8 GXS H13 HCIFZ HF~ HG5 HG6 HMJXF HQYDN HRMNR HVGLF HZ~ I09 IHE IJ- IKXTQ IWAJR IXC IXD IXE IZIGR IZQ I~X I~Z J-C J0Z JBSCW JCJTX JZLTJ K7- KDC KOV LAS LLZTM M4Y MA- N2Q NB0 NPVJJ NQJWS NU0 O9- O93 O9J OAM P2P P9P PF0 PT4 PT5 QOS R89 R9I RIG RNI ROL RPX RSV RZK S16 S1Z S27 S3B SAP SDH SEG SHX SISQX SJYHP SNE SNPRN SNX SOHCF SOJ SPISZ SRMVM SSLCW STPWE SZN T13 TSG TSK TSV TUC U2A UG4 UOJIU UTJUX UZXMN VC2 VFIZW W23 W48 WK8 YLTOR Z45 Z5O Z7R Z7X Z7Y Z7Z Z81 Z83 Z88 ZMTXR AAPKM AAYXX ABBRH ABDBE ABFSG ACSTC ADHKG ADKFA AEZWR AFDZB AFHIU AFOHR AGQPQ AHPBZ AHWEU AIXLP ATHPR AYFIA CITATION PHGZM PHGZT 8FE 8FG ABRTQ AZQEC DWQXO GNUQQ JQ2 P62 PKEHL PQEST PQGLB PQQKQ PQUKI
ID	FETCH-LOGICAL-c319t-93540c5c42eb5ca590f3933f8ac9b9f426093e183895d77945570b4ad54a75f33
IEDL.DBID	8FG
ISSN	1432-7643
IngestDate	Fri Jul 25 23:38:36 EDT 2025 Thu Apr 24 23:03:59 EDT 2025 Fri Jul 04 01:04:22 EDT 2025 Fri Feb 21 02:43:28 EST 2025
IsPeerReviewed	true
IsScholarly	true
Issue	20
Keywords	Tag semantic information Automatic threshold setting Tag-line-block distribution function Web content extraction
Language	English
LinkModel	DirectLink
MergedId	FETCHMERGED-LOGICAL-c319t-93540c5c42eb5ca590f3933f8ac9b9f426093e183895d77945570b4ad54a75f33
Notes	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ORCID	0000-0001-7267-5283
PQID	2918043724
PQPubID	2043697
PageCount	15
ParticipantIDs	proquest_journals_2918043724 crossref_citationtrail_10_1007_s00500_023_09076_x crossref_primary_10_1007_s00500_023_09076_x springer_journals_10_1007_s00500_023_09076_x
PublicationCentury	2000
PublicationDate	20231000 2023-10-00 20231001
PublicationDateYYYYMMDD	2023-10-01
PublicationDate_xml	– month: 10 year: 2023 text: 20231000
PublicationDecade	2020
PublicationPlace	Berlin/Heidelberg
PublicationPlace_xml	– name: Berlin/Heidelberg – name: Heidelberg
PublicationSubtitle	A Fusion of Foundations, Methodologies and Applications
PublicationTitle	Soft computing (Berlin, Germany)
PublicationTitleAbbrev	Soft Comput
PublicationYear	2023
Publisher	Springer Berlin Heidelberg Springer Nature B.V
Publisher_xml	– name: Springer Berlin Heidelberg – name: Springer Nature B.V
References	Joe DhanithPRSurendiranBAn ontology learning based approach for focused web crawling using combined normalized pointwise mutual information and Resnik algorithmInt J Comput Appl2022441211231129 YunisHSteinBKieselJContent extraction from webpages using machine learning2016Bauhaus-Universitaet Weimar FerraraEDe MeoPFiumaraGWeb data extraction, applications and techniques: a surveyKnowl-Based Syst20147030132310.1016/j.knosys.2014.07.007 WaldherrAMaierDMiltnerPBig data, big noise: the challenge of finding issue networks on the webSoc Sci Comput Rev201735442744310.1177/0894439316643050 SandeepKSPatilNA multidimensional approach to blog mining. progress in intelligent computing techniques: theory, practice, and applicationsAdv Intell Syst Comput2018719515810.1007/978-981-10-3376-6_6 SestitoSDillonTKnowledge acquisition of conjunctive rules using multilayered neural networksInt J Intell Syst19938777980510.1002/int.4550080704 Zhang Z, Yu B, Liu T, et al. (2023) Learning structural co-occurrences for structured web data extraction in low-resource settings. In: Proceedings of the 2023 ACM web conference, pp 1683–1692 Crescenzi V, Mecca G, Merialdo P (2001) Roadrunner: towards automatic data extraction from large web sites. In: Proceedings of the 27th international conference on very large data bases, vol. 1, pp 109–118 YuMChenTXuHResearch and design of HTML parser based on page segmentationJ Comput Appl2005254974976 GuYGaoYGaoBResearch on deep web information extraction based on template and domain ontologyComput Eng Des201435327332 Liu L, Pu C, Han W (2000) XWRAP: an XML-enabled wrapper construction system for web information sources. In: Proceedings of the 16th international conference on data engineering, pp 611–621 WuYLanguage independent web news extraction system based on text detection frameworkInf Sci2016342132149346184410.1016/j.ins.2015.12.025 Cardoso E, Jabour I, Laber E, et al (2011) An efficient language-independent method to extract content from news webpages. In: Proceedings of the 11th ACM symposium on document engineering, pp 121–128 Wang Q, Fang Y, Ravula A, et al (2022) Webformer: the web-page transformer for structure information extraction. In: Proceedings of the 2022 ACM web conference, pp 3124–3133 KarthikeyanTSekaranKRanjithDPersonalized content extraction and text classification using effective web scraping techniquesInt J Web Port2019112415210.4018/IJWP.2019070103 Sun F, Song D, Liao L (2011) Dom based content extraction via text density. In: Proceedings of the 34th international ACM SIGIR conference on research and development in information retrieval, pp 245–254 Cai D, Yu S, Wen J R, et al (2003) Extracting content structure for web pages based on visual representation. In: Proceedings of the 5th Asia-pacific web conference on web technologies and applications, pp 406–417 Laber ES, de Souza CP, Jabour IV et al (2009) A fast and simple method for extracting relevant content from news webpages. In: Proceedings of the 18th ACM conference on information and knowledge management, pp 1685–1688 IDC, Statista (2022) Volume of data/information created, captured, copied, and consumed worldwide from 2010 to 2020, with forecasts from 2021 to 2025 (in zettabytes). https://www.statista.com/statistics/871513/worldwide-data-created Gottron T (2008) Combining content extraction heuristics: the CombinE system. In: Proceedings of the 10th international conference on information integration and web-based applications and services, pp 591–595 Gupta S, Kaiser G, Neistadt D et al (2003) DOM-based content extraction of html documents. In: Proceedings of the 12th international conference on World Wide Web, pp 207–214 GanLYeBHuangZKnowledge graph construction based on ship collision accident reports to improve maritime traffic safetyOcean Coast Manag202324010666010.1016/j.ocecoaman.2023.106660 Ramakrishna M, Gowdar L, Havanur MS et al (2010) Web mining: key accomplishments, applications and future directions. In: Proceedings of the 2010 international conference on data storage and data engineering, pp 187–191 SunCGuanYA statistical approach for content extraction from web pageJ Chin Inf Process20041851722 Chen X (2011) Universal web content extraction based on row block distribution function. https://code.google.com/p/cx-extractor Rahman A, Alam H, Hartono R (2001) Content extraction from html documents. In: Proceedings of the 1st international workshop on web document analysis, pp 1–4 Samuel MO, Tolulope AI, Oyejoke OO (2019) A systematic review of current trends in web content mining. In: Proceedings of the 3th international conference on science and sustainable development, vol. 1299, p 012040 Gibson D, Punera K, Tomkins A (2005) The volume and evolution of web page templates. In: Special interest tracks and posters of the 14th international conference on World Wide Web, pp 830–839 Weninger T, Hsu WH, Han J (2010) CETR: content extraction via tag ratios. In: Proceedings of the 19th international conference on World Wide Web, pp 971–980 ZhangHLiLHuWVisualization of location-referenced web textual information based on map mashupsIEEE Access20197404754048710.1109/ACCESS.2019.2907570 Hammer J, McHugh J, Garcia-Molina H (1997) Semistructured data: the TSIMMIS experience. In: Proceedings of the 1th East-European symposium on advances in databases and information systems, vol. 1, pp 1–13 TanZHeCFangYTitle-based extraction of news contents for text miningIEEE Access20186640856409510.1109/ACCESS.2018.2877592 LiangDYangYWeiZInformation extraction of web pages based on support vector machineComput Mod201892126 Baroni M, Chantree F, Kilgarriff A et al (2008) Cleaneval: a competition for cleaning web pages. In: Proceedings of the 6th international conference on language resources and evaluation, pp 638–643 KS Sandeep (9076_CR22) 2018; 719 Y Gu (9076_CR10) 2014; 35 M Yu (9076_CR31) 2005; 25 L Gan (9076_CR7) 2023; 240 9076_CR5 9076_CR4 9076_CR13 Y Wu (9076_CR30) 2016; 342 9076_CR12 9076_CR34 9076_CR11 9076_CR1 9076_CR18 9076_CR3 9076_CR16 9076_CR2 E Ferrara (9076_CR6) 2014; 70 H Yunis (9076_CR32) 2016 H Zhang (9076_CR33) 2019; 7 PR Joe Dhanith (9076_CR14) 2022; 44 A Waldherr (9076_CR27) 2017; 35 9076_CR19 Z Tan (9076_CR26) 2018; 6 9076_CR9 9076_CR8 9076_CR24 S Sestito (9076_CR23) 1993; 8 9076_CR29 9076_CR28 9076_CR21 C Sun (9076_CR25) 2004; 18 9076_CR20 T Karthikeyan (9076_CR15) 2019; 11 D Liang (9076_CR17) 2018; 9
References_xml	– reference: Wang Q, Fang Y, Ravula A, et al (2022) Webformer: the web-page transformer for structure information extraction. In: Proceedings of the 2022 ACM web conference, pp 3124–3133 – reference: Rahman A, Alam H, Hartono R (2001) Content extraction from html documents. In: Proceedings of the 1st international workshop on web document analysis, pp 1–4 – reference: Hammer J, McHugh J, Garcia-Molina H (1997) Semistructured data: the TSIMMIS experience. In: Proceedings of the 1th East-European symposium on advances in databases and information systems, vol. 1, pp 1–13 – reference: IDC, Statista (2022) Volume of data/information created, captured, copied, and consumed worldwide from 2010 to 2020, with forecasts from 2021 to 2025 (in zettabytes). https://www.statista.com/statistics/871513/worldwide-data-created/ – reference: YuMChenTXuHResearch and design of HTML parser based on page segmentationJ Comput Appl2005254974976 – reference: WuYLanguage independent web news extraction system based on text detection frameworkInf Sci2016342132149346184410.1016/j.ins.2015.12.025 – reference: Joe DhanithPRSurendiranBAn ontology learning based approach for focused web crawling using combined normalized pointwise mutual information and Resnik algorithmInt J Comput Appl2022441211231129 – reference: WaldherrAMaierDMiltnerPBig data, big noise: the challenge of finding issue networks on the webSoc Sci Comput Rev201735442744310.1177/0894439316643050 – reference: GanLYeBHuangZKnowledge graph construction based on ship collision accident reports to improve maritime traffic safetyOcean Coast Manag202324010666010.1016/j.ocecoaman.2023.106660 – reference: Gupta S, Kaiser G, Neistadt D et al (2003) DOM-based content extraction of html documents. In: Proceedings of the 12th international conference on World Wide Web, pp 207–214 – reference: KarthikeyanTSekaranKRanjithDPersonalized content extraction and text classification using effective web scraping techniquesInt J Web Port2019112415210.4018/IJWP.2019070103 – reference: Liu L, Pu C, Han W (2000) XWRAP: an XML-enabled wrapper construction system for web information sources. In: Proceedings of the 16th international conference on data engineering, pp 611–621 – reference: Crescenzi V, Mecca G, Merialdo P (2001) Roadrunner: towards automatic data extraction from large web sites. In: Proceedings of the 27th international conference on very large data bases, vol. 1, pp 109–118 – reference: Cardoso E, Jabour I, Laber E, et al (2011) An efficient language-independent method to extract content from news webpages. In: Proceedings of the 11th ACM symposium on document engineering, pp 121–128 – reference: FerraraEDe MeoPFiumaraGWeb data extraction, applications and techniques: a surveyKnowl-Based Syst20147030132310.1016/j.knosys.2014.07.007 – reference: Laber ES, de Souza CP, Jabour IV et al (2009) A fast and simple method for extracting relevant content from news webpages. In: Proceedings of the 18th ACM conference on information and knowledge management, pp 1685–1688 – reference: Cai D, Yu S, Wen J R, et al (2003) Extracting content structure for web pages based on visual representation. In: Proceedings of the 5th Asia-pacific web conference on web technologies and applications, pp 406–417 – reference: Gottron T (2008) Combining content extraction heuristics: the CombinE system. In: Proceedings of the 10th international conference on information integration and web-based applications and services, pp 591–595 – reference: SunCGuanYA statistical approach for content extraction from web pageJ Chin Inf Process20041851722 – reference: Gibson D, Punera K, Tomkins A (2005) The volume and evolution of web page templates. In: Special interest tracks and posters of the 14th international conference on World Wide Web, pp 830–839 – reference: Zhang Z, Yu B, Liu T, et al. (2023) Learning structural co-occurrences for structured web data extraction in low-resource settings. In: Proceedings of the 2023 ACM web conference, pp 1683–1692 – reference: Baroni M, Chantree F, Kilgarriff A et al (2008) Cleaneval: a competition for cleaning web pages. In: Proceedings of the 6th international conference on language resources and evaluation, pp 638–643 – reference: GuYGaoYGaoBResearch on deep web information extraction based on template and domain ontologyComput Eng Des201435327332 – reference: SandeepKSPatilNA multidimensional approach to blog mining. progress in intelligent computing techniques: theory, practice, and applicationsAdv Intell Syst Comput2018719515810.1007/978-981-10-3376-6_6 – reference: Chen X (2011) Universal web content extraction based on row block distribution function. https://code.google.com/p/cx-extractor – reference: YunisHSteinBKieselJContent extraction from webpages using machine learning2016Bauhaus-Universitaet Weimar – reference: Weninger T, Hsu WH, Han J (2010) CETR: content extraction via tag ratios. In: Proceedings of the 19th international conference on World Wide Web, pp 971–980 – reference: SestitoSDillonTKnowledge acquisition of conjunctive rules using multilayered neural networksInt J Intell Syst19938777980510.1002/int.4550080704 – reference: Sun F, Song D, Liao L (2011) Dom based content extraction via text density. In: Proceedings of the 34th international ACM SIGIR conference on research and development in information retrieval, pp 245–254 – reference: Ramakrishna M, Gowdar L, Havanur MS et al (2010) Web mining: key accomplishments, applications and future directions. In: Proceedings of the 2010 international conference on data storage and data engineering, pp 187–191 – reference: TanZHeCFangYTitle-based extraction of news contents for text miningIEEE Access20186640856409510.1109/ACCESS.2018.2877592 – reference: Samuel MO, Tolulope AI, Oyejoke OO (2019) A systematic review of current trends in web content mining. In: Proceedings of the 3th international conference on science and sustainable development, vol. 1299, p 012040 – reference: LiangDYangYWeiZInformation extraction of web pages based on support vector machineComput Mod201892126 – reference: ZhangHLiLHuWVisualization of location-referenced web textual information based on map mashupsIEEE Access20197404754048710.1109/ACCESS.2019.2907570 – ident: 9076_CR9 doi: 10.1145/1497308.1497418 – volume: 8 start-page: 779 issue: 7 year: 1993 ident: 9076_CR23 publication-title: Int J Intell Syst doi: 10.1002/int.4550080704 – ident: 9076_CR19 – ident: 9076_CR3 doi: 10.1145/2034691.2034720 – ident: 9076_CR18 doi: 10.1109/ICDE.2000.839475 – volume: 6 start-page: 64085 year: 2018 ident: 9076_CR26 publication-title: IEEE Access doi: 10.1109/ACCESS.2018.2877592 – volume: 35 start-page: 327 year: 2014 ident: 9076_CR10 publication-title: Comput Eng Des – ident: 9076_CR13 – volume: 70 start-page: 301 year: 2014 ident: 9076_CR6 publication-title: Knowl-Based Syst doi: 10.1016/j.knosys.2014.07.007 – ident: 9076_CR21 doi: 10.1088/1742-6596/1299/1/012040 – volume: 44 start-page: 1123 issue: 12 year: 2022 ident: 9076_CR14 publication-title: Int J Comput Appl – volume: 719 start-page: 51 year: 2018 ident: 9076_CR22 publication-title: Adv Intell Syst Comput doi: 10.1007/978-981-10-3376-6_6 – ident: 9076_CR24 doi: 10.1145/2009916.2009952 – ident: 9076_CR11 doi: 10.1145/775152.775182 – ident: 9076_CR20 doi: 10.1109/DSDE.2010.53 – volume: 11 start-page: 41 issue: 2 year: 2019 ident: 9076_CR15 publication-title: Int J Web Port doi: 10.4018/IJWP.2019070103 – ident: 9076_CR28 doi: 10.1145/3485447.3512032 – ident: 9076_CR12 doi: 10.14236/ewic/ADBIS1997.22 – volume: 9 start-page: 21 year: 2018 ident: 9076_CR17 publication-title: Comput Mod – ident: 9076_CR4 – volume: 342 start-page: 132 year: 2016 ident: 9076_CR30 publication-title: Inf Sci doi: 10.1016/j.ins.2015.12.025 – volume: 7 start-page: 40475 year: 2019 ident: 9076_CR33 publication-title: IEEE Access doi: 10.1109/ACCESS.2019.2907570 – ident: 9076_CR34 doi: 10.1145/3543507.3583387 – volume: 18 start-page: 17 issue: 5 year: 2004 ident: 9076_CR25 publication-title: J Chin Inf Process – volume: 240 start-page: 106660 year: 2023 ident: 9076_CR7 publication-title: Ocean Coast Manag doi: 10.1016/j.ocecoaman.2023.106660 – ident: 9076_CR29 doi: 10.1145/1772690.1772789 – ident: 9076_CR8 doi: 10.1145/1062745.1062763 – ident: 9076_CR16 doi: 10.1145/1645953.1646204 – volume: 25 start-page: 974 issue: 4 year: 2005 ident: 9076_CR31 publication-title: J Comput Appl – volume: 35 start-page: 427 issue: 4 year: 2017 ident: 9076_CR27 publication-title: Soc Sci Comput Rev doi: 10.1177/0894439316643050 – ident: 9076_CR1 – volume-title: Content extraction from webpages using machine learning year: 2016 ident: 9076_CR32 – ident: 9076_CR2 doi: 10.1007/3-540-36901-5_42 – ident: 9076_CR5
SSID	ssj0021753
Score	2.349304
Snippet	World Wide Web is a vast information resource that can be used in a broad range of applications. Web content is an efficient way to derive valuable information...
SourceID	proquest crossref springer
SourceType	Aggregation Database Enrichment Source Index Database Publisher
StartPage	14631
SubjectTerms	Accuracy Algorithms Artificial Intelligence Computational Intelligence Control Engineering Extractors Information resources Information retrieval Information sources Internet Mathematical Logic and Foundations Mathematical Methods in Data Science Mechatronics Methods Multimedia Natural language processing Neural networks Noise Ontology Readability Robotics
SummonAdditionalLinks	– databaseName: SpringerLink Journals (ICM) dbid: U2A link: http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV09T8MwELVQWWDgo4AoFOSBDSw1cZzEY4WoKgYWqNQtsp1LB1BAbZH4-dw5TvgQIDFlsOPh7PN7tu_eMXaRpuAyA5GIKwCBu18pdGpBSEqdgVIqU_oA2bt0Oktu52oeksJWbbR7-yTpd-ou2Y2kSkYCMYbuFrJUIHPcVHh2J3ecxePumBW0J5EIIHdEwA2pMj-P8RWOPjjmt2dRjzaTPbYTaCIfN_O6zzag7rPdtgQDDx7ZZ9uf9AQP2P245uA1IRBKOEWh0xe332WTvsCbetEciSr3skoL4ARjJce2tVkI4pzCIr49chPUSg7ZbHLzcD0VoWqCcOhOa6HpJscpl8RglTNKjyqppaxy47TVFSnSawnoyblWZYbuSCJcNjGlSkymKimPWK9-ruGYcY3GTEuXA2nOOEhsBKBS55R1NpG5GrCoNV7hgqQ4VbZ4KjoxZG_wAg1eeIMXbwN22f3z0ghq_Nl72M5JEZxrVcQ6ykmSKU4G7Kqdp4_m30c7-V_3U7ZFxeWb0L0h662Xr3CGFGRtz_2KewcihNGV priority: 102 providerName: Springer Nature
Title	An efficient content extraction method for webpage based on tag-line-block analysis
URI	https://link.springer.com/article/10.1007/s00500-023-09076-x https://www.proquest.com/docview/2918043724
Volume	27
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
link	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwfV1NT8MwDI1gXLjwjRgMlAM3iKBN0zUntKENBNKEgElwqprU3QG0ASvSfj52mm4CiZ1ySJODE_u5jv3M2Gkcg21nEIiwABBo_XKhYwNCUukM5FJluUuQHcS3w-juRb34gNvUp1XWNtEZ6nxiKUZ-EeogIR6eMLr6-BTUNYpeV30LjVW2FiDS0D1P-jfzHy7PQokuAXqRCL2-aMaVzhHxyaVAxKJIRTsWs9_AtPA2_zyQOtzpb7EN7zDyTnXC22wFxjtss27GwL1u7rKnzpiD44NAGOGUgU4jmt6vqnSBV72iOTqp3FEqjYAThOUc58psJMjfFAax7Y1nnqlkjw37vefrW-E7JgiLqlQKTVEcq2wUglE2U_qykFrKIsmsNrogNnotAbU40SpvoyoSAZeJslxFWVsVUu6zxngyhgPGNYovzm0CxDdjITIBgIqtVcaaSCaqyYJaXKn1dOLU1eI9nRMhOxGnKOLUiTidNdnZfM1HRaax9OtWfQqpV6xpurgGTXZen8xi-v_dDpfvdsTWqZF8labXYo3y6xuO0d0ozYm7UydsrdPvdgc03rze93Ds9gYPjzg7DDs_PdrVPA
linkProvider	ProQuest
linkToHtml	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwtV07T8MwED6hMsDCG1GeHmACC5rEaTwgxKOo0FIhHhJbiJ0LA6i8ioA_xW_kLnFagUQ3pgxOPJy_3Mt33wGshyHaeoI16WWIkrRfKnVoUPrcOoOpr5I0L5DthM3r4PRG3YzAV9kLw2WVpU7MFXX6aDlHvu3pWsQ8PF6w9_QseWoU366WIzQKWLTw851CttfdkyM63w3PO25cHTalmyogLcGtJzVnOqyygYdG2UTpncynqD6LEquNzpixXftISI-0SusEVyapMkGSqiCpq4wToKTyRwPuaK3A6EGjc37RD_Ec7yU5IeS3krF3bTp5sx5TrexIspGcG6mH8uOnKRz4t7-uZHNLdzwFE85FFfsFpqZhBLszMFmOfxBOG8zC5X5XYM5AQYZLcM07P0nZvxTNEqKYTi3ILRY5idMdCjaaqaC1XnIn2cOVhqzpvUgcN8ocXP-LNOeh0n3s4gIITeILUxshM9xYDEwNUYXWKmNN4EeqCrVSXLF1BOY8R-Mh7lMv5yKOScRxLuL4owqb_W-eCvqOoW8vl6cQu1_5NR4Arwpb5ckMlv_ebXH4bmsw1rw6a8ftk05rCcZ5jH1RJLgMld7LG66Qs9Mzqw5hAm7_G9TfFDAMKw
linkToPdf	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlZ3fS8MwEMeDTBB98MdUnE7Ng28atjZN2zwOdcwfDEEHeytNct2DUses4J_vpU23KSr4nDSQSy73bZv7HCFnYQg6SsFjfgbA8PQzTIYKGLepM2C4SE15QXYYDkbB7ViMl7L4y9vu9S_JKqfBUpryojM1WWee-GaxJV2G8cZ-Z4hChipyFY9jz-70kd-bv3I5DiWKAtSRGHxd2szPY3wNTQu9-e0XaRl5-ttk00lG2qvWeIesQN4kW3U5Buq8s0k2ltiCu-Sxl1Mo-RA4LeqmR_EonlWpDLSqHU1RtNISsTQBakOaodhWpBNm9SdTGOueaerIJXtk1L9-uhwwV0GBaXStgkn7VUcLHfighE6F7GZccp7FqZZKZpZOLzmgV8dSmAhd0wK5VJAaEaSRyDjfJ438NYcDQiUaMzQ6Bsuf0RAoD0CEWgulVcBj0SJebbxEO7y4rXLxkszByKXBEzR4Uho8-WiR8_kz0wqu8Wfvdr0miXO0t8SXXmzxTH7QIhf1Oi2afx_t8H_dT8naw1U_ub8Z3h2RdVtzvrrR1yaNYvYOx6hMCnVSbr5Prp3YxA
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=An+efficient+content+extraction+method+for+webpage+based+on+tag-line-block+analysis&rft.jtitle=Soft+computing+%28Berlin%2C+Germany%29&rft.au=Chen%2C+Zeqiu&rft.au=Zhou%2C+Jianghui&rft.au=Sun%2C+Ruizhi&rft.date=2023-10-01&rft.issn=1432-7643&rft.eissn=1433-7479&rft.volume=27&rft.issue=20&rft.spage=14631&rft.epage=14645&rft_id=info:doi/10.1007%2Fs00500-023-09076-x&rft.externalDBID=n%2Fa&rft.externalDocID=10_1007_s00500_023_09076_x
thumbnail_l	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1432-7643&client=summon
thumbnail_m	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1432-7643&client=summon
thumbnail_s	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1432-7643&client=summon