Statistical rule-based ethnic group webpage text extraction method and system

The invention provides an ethnic group webpage text extraction method and system based on a statistical rule, and the method comprises the steps: obtaining a group of to-be-processed webpages in a webpage ethnic group form, and obtaining a webpage ethnic group list; traversing the webpage group list...

Full description

Saved in:
Bibliographic Details
Main Authors YANG CHUN, ZHAN YIMING, JI LIPING, CHEN TONG, WANG RUISHUANG, LI XIAO
Format Patent
LanguageChinese
English
Published 23.12.2022
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:The invention provides an ethnic group webpage text extraction method and system based on a statistical rule, and the method comprises the steps: obtaining a group of to-be-processed webpages in a webpage ethnic group form, and obtaining a webpage ethnic group list; traversing the webpage group list, and extracting an original HTML code of each webpage to form an HTML code list; traversing the HTML code list, extracting all text contents in each webpage, converting each long text of all webpages into a short text character string list according to an HTML structure, and reserving a text sequence; wherein each short text character string list belongs to a text list set of the whole webpage cluster; traversing the text list set and positioning a starting position and an ending position for each short text character string list; selecting texts from the starting position to the ending position, and outputting a text list; according to the method, the webpage texts in different forms can be extracted without manu
Bibliography:Application Number: CN202211200790