An efficient content extraction method for webpage based on tag-line-block analysis

World Wide Web is a vast information resource that can be used in a broad range of applications. Web content is an efficient way to derive valuable information from webpages, and many efforts have been made on this subject. However, due to the increasing complexity of webpage technology, the existin...

Full description

Saved in:

Bibliographic Details
Published in	Soft computing (Berlin, Germany) Vol. 27; no. 20; pp. 14631 - 14645
Main Authors	Chen, Zeqiu, Zhou, Jianghui, Sun, Ruizhi
Format	Journal Article
Language	English
Published	Berlin/Heidelberg Springer Berlin Heidelberg 01.10.2023 Springer Nature B.V
Subjects	Accuracy Algorithms Artificial Intelligence Computational Intelligence Control Engineering Extractors Information resources Information retrieval Information sources Internet Mathematical Logic and Foundations Mathematical Methods in Data Science Mechatronics Methods Multimedia Natural language processing Neural networks Noise Ontology Readability Robotics Tag semantic information Automatic threshold setting Tag-line-block distribution function Web content extraction
Online Access	Get full text
ISSN	1432-7643 1433-7479
DOI	10.1007/s00500-023-09076-x

Cover

Loading…

More Information
Summary:	World Wide Web is a vast information resource that can be used in a broad range of applications. Web content is an efficient way to derive valuable information from webpages, and many efforts have been made on this subject. However, due to the increasing complexity of webpage technology, the existing methods cannot match quite well the requirements for the content extraction of webpages. This paper proposed an improved content extraction method for webpage based on Cx-Extractor, which is capable of dealing with content extraction for different types of webpages. Several improvements have been made for the proposed method: (1) The hyperlink tags are not removed directly to avoid mistaking the dense hyperlink groups for the main content. (2) The starting point of the main content is taken as the line number of tag-line-block whose size exceeds the threshold and thus the first few short texts of the main content can be retained. (3) The threshold value of tag-line-block for the main content is calculated automatically instead of being set manually. The above can improve the accuracy of the extracted content. Moreover, (4) the blank spaces in the original text of webpage are retained, which can increase the readability of the extracted content by avoiding connecting English words into pieces. (5) The multimedia information (e.g., pictures and videos) can be selectively retained by users, allowing for maximum flexibility and usage in multiple industries. The experimental results conducted on real-world webpages show that the proposed content extraction method works well for both single-content and multi-content webpages. Furthermore, the performance of the proposed content extraction method was compared with the Chinese extraction method called Cx-Extractor and the English extraction method called Readability. It is found that the proposed method in this study outperforms these two methods in precision, recall, and readability. In addition, the extraction efficiency of the proposed method is superior to that of the Readability method.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	1432-7643 1433-7479
DOI:	10.1007/s00500-023-09076-x