Statistical rule-based ethnic group webpage text extraction method and system
The invention provides an ethnic group webpage text extraction method and system based on a statistical rule, and the method comprises the steps: obtaining a group of to-be-processed webpages in a webpage ethnic group form, and obtaining a webpage ethnic group list; traversing the webpage group list...
Saved in:
Main Authors | , , , , , |
---|---|
Format | Patent |
Language | Chinese English |
Published |
23.12.2022
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | The invention provides an ethnic group webpage text extraction method and system based on a statistical rule, and the method comprises the steps: obtaining a group of to-be-processed webpages in a webpage ethnic group form, and obtaining a webpage ethnic group list; traversing the webpage group list, and extracting an original HTML code of each webpage to form an HTML code list; traversing the HTML code list, extracting all text contents in each webpage, converting each long text of all webpages into a short text character string list according to an HTML structure, and reserving a text sequence; wherein each short text character string list belongs to a text list set of the whole webpage cluster; traversing the text list set and positioning a starting position and an ending position for each short text character string list; selecting texts from the starting position to the ending position, and outputting a text list; according to the method, the webpage texts in different forms can be extracted without manu |
---|---|
Bibliography: | Application Number: CN202211200790 |