태그 경로 및 텍스트 출현 빈도를 이용한 HTML 본문 추출

In order to accurately extract the necessary text from the web page, the method of specifying the tag and style attributes where the main contents exist to the web crawler has a problem in that the logic for extracting the main contents. This method needs to be modified whenever the web page configu...

Full description

Saved in:

Bibliographic Details
Published in	한국정보통신학회논문지 Vol. 25; no. 12; pp. 1709 - 1715
Main Authors	김진환(Jin-Hwan Kim), 김은경(Eun-Gyung Kim)
Format	Journal Article
Language	Korean
Published	한국정보통신학회 2021
Subjects	전자/정보통신공학 텍스트 빈도 분석 Text frequency analysis Web crawling Web scrapping 웹 크롤링 빅데이터 수집 Big data collection 웹 스크래이핑 태그 경로 분석 Tag path analysis
Online Access	Get full text
ISSN	2234-4772 2288-4165
DOI	10.6109/jkiice.2021.25.12.1709

Cover

More Information
Summary:	In order to accurately extract the necessary text from the web page, the method of specifying the tag and style attributes where the main contents exist to the web crawler has a problem in that the logic for extracting the main contents. This method needs to be modified whenever the web page configuration is changed. In order to solve this problem, the method of extracting the text by analyzing the frequency of appearance of the text proposed in the previous study had a limitation in that the performance deviation was large depending on the collection channel of the web page. Therefore, in this paper, we proposed a method of extracting texts with high accuracy from various collection channels by analyzing not only the frequency of appearance of text but also parent tag paths of text nodes extracted from the DOM tree of web pages. 웹 페이지에서 필요한 텍스트를 정확하게 추출하기 위해 본문이 존재하는 곳의 태그와 스타일 속성을 웹 크롤러에 명시하는 방법은 웹 페이지 구성이 변경될 때마다 본문을 추출하는 로직을 수정해야 하는 문제가 있다. 이러한 문제점을 해결하기 위해 이전 연구에서 제안한 텍스트의 출현 빈도를 분석하여 본문을 추출하는 방법은 웹 페이지의 수집 채널에 따라 성능 편차가 크다는 한계점이 있었다. 따라서 본 논문에서는 텍스트의 출현 빈도뿐만 아니라 웹 페이지의 DOM 트리로부터 추출된 텍스트 노드의 부모 태그 경로를 분석하여 다양한 수집 채널에서 높은 정확도로 본문을 추출하는 방법을 제안하였다.
Bibliography:	KISTI1.1003/JNL.JAKO202102661348107 http://jkiice.org
ISSN:	2234-4772 2288-4165
DOI:	10.6109/jkiice.2021.25.12.1709