An efficient content extraction method for webpage based on tag-line-block analysis
World Wide Web is a vast information resource that can be used in a broad range of applications. Web content is an efficient way to derive valuable information from webpages, and many efforts have been made on this subject. However, due to the increasing complexity of webpage technology, the existin...
Saved in:
Published in | Soft computing (Berlin, Germany) Vol. 27; no. 20; pp. 14631 - 14645 |
---|---|
Main Authors | , , |
Format | Journal Article |
Language | English |
Published |
Berlin/Heidelberg
Springer Berlin Heidelberg
01.10.2023
Springer Nature B.V |
Subjects | |
Online Access | Get full text |
ISSN | 1432-7643 1433-7479 |
DOI | 10.1007/s00500-023-09076-x |
Cover
Loading…
Summary: | World Wide Web is a vast information resource that can be used in a broad range of applications. Web content is an efficient way to derive valuable information from webpages, and many efforts have been made on this subject. However, due to the increasing complexity of webpage technology, the existing methods cannot match quite well the requirements for the content extraction of webpages. This paper proposed an improved content extraction method for webpage based on Cx-Extractor, which is capable of dealing with content extraction for different types of webpages. Several improvements have been made for the proposed method: (1) The hyperlink tags are not removed directly to avoid mistaking the dense hyperlink groups for the main content. (2) The starting point of the main content is taken as the line number of tag-line-block whose size exceeds the threshold and thus the first few short texts of the main content can be retained. (3) The threshold value of tag-line-block for the main content is calculated automatically instead of being set manually. The above can improve the accuracy of the extracted content. Moreover, (4) the blank spaces in the original text of webpage are retained, which can increase the readability of the extracted content by avoiding connecting English words into pieces. (5) The multimedia information (e.g., pictures and videos) can be selectively retained by users, allowing for maximum flexibility and usage in multiple industries. The experimental results conducted on real-world webpages show that the proposed content extraction method works well for both single-content and multi-content webpages. Furthermore, the performance of the proposed content extraction method was compared with the Chinese extraction method called Cx-Extractor and the English extraction method called Readability. It is found that the proposed method in this study outperforms these two methods in precision, recall, and readability. In addition, the extraction efficiency of the proposed method is superior to that of the Readability method. |
---|---|
Bibliography: | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 |
ISSN: | 1432-7643 1433-7479 |
DOI: | 10.1007/s00500-023-09076-x |