WCTT: HTML 문서 정형화 기반 웹 크롤링 시스템

Web crawler, which is mainly used to collect text on the web today, is difficult to maintain and expand because researchers must implement different collection logic by collection channel after analyzing tags and styles of HTML documents. To solve this problem, the web crawler should be able to coll...

Full description

Saved in:

Bibliographic Details
Published in	한국정보통신학회논문지 Vol. 26; no. 4; pp. 495 - 502
Main Authors	김진환(Jin-Hwan Kim), 김은경(Eun-Gyung Kim)
Format	Journal Article
Language	Korean
Published	한국정보통신학회 2022
Subjects	전자/정보통신공학 텍스트 빈도 분석 Tag Path Analysis Text Frequency Analysis 웹 크롤링 HTML 문서 정형화 Web Crawling 태그 경로 분석 HTML Document Formalization HTML
Online Access	Get full text
ISSN	2234-4772 2288-4165
DOI	10.6109/jkiice.2022.26.4.495

Cover

Abstract	Web crawler, which is mainly used to collect text on the web today, is difficult to maintain and expand because researchers must implement different collection logic by collection channel after analyzing tags and styles of HTML documents. To solve this problem, the web crawler should be able to collect text by formalizing HTML documents to the same structure. In this paper, we designed and implemented WCTT(Web Crawling system based on Tag path and Text appearance frequency), a web crawling system that collects text with a single collection logic by formalizing HTML documents based on tag path and text appearance frequency. Because WCTT collects texts with the same logic for all collection channels, it is easy to maintain and expand the collection channel. In addition, it provides the preprocessing function that removes stopwords and extracts only nouns for keyword network analysis and so on. 오늘날 웹상의 본문 수집에 주로 이용되는 웹 크롤러는 연구자가 직접 HTML 문서의 태그와 스타일을 분석한 후 수집 채널마다 다른 수집 로직을 구현해야 하므로 유지 관리 및 확장이 어렵다. 이러한 문제점을 해결하려면 웹 크롤러는 구조가 서로 다른 HTML 문서를 동일한 구조로 정형화하여 본문을 수집할 수 있어야 한다. 따라서 본 논문에서는 태그 경로 및 텍스트 출현 빈도를 기반으로 HTML 문서를 정형화하여 하나의 수집 로직으로 본문을 수집하는 웹크롤링 시스템인 WCTT(Web Crawling system based on Tag path and Text appearance frequency)를 설계 및 구현하였다. WCTT는 모든 수집 채널에서 동일한 로직으로 본문을 수집하므로 유지 관리 및 수집 채널의 확장이 용이하다. 또한, 키워드 네트워크 분석 등을 위해 불용어를 제거하고 명사만 추출하는 전처리 기능도 제공한다.
AbstractList	Web crawler, which is mainly used to collect text on the web today, is difficult to maintain and expand because researchers must implement different collection logic by collection channel after analyzing tags and styles of HTML documents. To solve this problem, the web crawler should be able to collect text by formalizing HTML documents to the same structure. In this paper, we designed and implemented WCTT(Web Crawling system based on Tag path and Text appearance frequency), a web crawling system that collects text with a single collection logic by formalizing HTML documents based on tag path and text appearance frequency. Because WCTT collects texts with the same logic for all collection channels, it is easy to maintain and expand the collection channel. In addition, it provides the preprocessing function that removes stopwords and extracts only nouns for keyword network analysis and so on. 오늘날 웹상의 본문 수집에 주로 이용되는 웹 크롤러는 연구자가 직접 HTML 문서의 태그와 스타일을 분석한 후 수집 채널마다 다른 수집 로직을 구현해야 하므로 유지 관리 및 확장이 어렵다. 이러한 문제점을 해결하려면 웹 크롤러는 구조가 서로 다른 HTML 문서를 동일한 구조로 정형화하여 본문을 수집할 수 있어야 한다. 따라서 본 논문에서는 태그 경로 및 텍스트 출현 빈도를 기반으로 HTML 문서를 정형화하여 하나의 수집 로직으로 본문을 수집하는 웹크롤링 시스템인 WCTT(Web Crawling system based on Tag path and Text appearance frequency)를 설계 및 구현하였다. WCTT는 모든 수집 채널에서 동일한 로직으로 본문을 수집하므로 유지 관리 및 수집 채널의 확장이 용이하다. 또한, 키워드 네트워크 분석 등을 위해 불용어를 제거하고 명사만 추출하는 전처리 기능도 제공한다. 오늘날 웹상의 본문 수집에 주로 이용되는 웹 크롤러는 연구자가 직접 HTML 문서의 태그와 스타일을 분석한 후 수집 채널마다 다른 수집 로직을 구현해야 하므로 유지 관리 및 확장이 어렵다. 이러한 문제점을 해결하려면 웹 크롤러는 구조가 서로 다른 HTML 문서를 동일한 구조로 정형화하여 본문을 수집할 수 있어야 한다. 따라서 본 논문에서는 태그 경로 및 텍스트 출현 빈도를 기반으로 HTML 문서를 정형화하여 하나의 수집 로직으로 본문을 수집하는 웹 크롤링 시스템인 WCTT(Web Crawling system based on Tag path and Text appearance frequency)를 설계 및 구현하였다. WCTT는 모든 수집 채널에서 동일한 로직으로 본문을 수집하므로 유지 관리 및 수집 채널의 확장이 용이하다. 또한, 키워드 네트워크 분석 등을 위해 불용어를 제거하고 명사만 추출하는 전처리 기능도 제공한다. Web crawler, which is mainly used to collect text on the web today, is difficult to maintain and expand because researchers must implement different collection logic by collection channel after analyzing tags and styles of HTML documents. To solve this problem, the web crawler should be able to collect text by formalizing HTML documents to the same structure. In this paper, we designed and implemented WCTT(Web Crawling system based on Tag path and Text appearance frequency), a web crawling system that collects text with a single collection logic by formalizing HTML documents based on tag path and text appearance frequency. Because WCTT collects texts with the same logic for all collection channels, it is easy to maintain and expand the collection channel. In addition, it provides the preprocessing function that removes stopwords and extracts only nouns for keyword network analysis and so on. KCI Citation Count: 0
Author	김은경(Eun-Gyung Kim) 김진환(Jin-Hwan Kim)
Author_xml	– sequence: 1 fullname: 김진환(Jin-Hwan Kim) – sequence: 2 fullname: 김은경(Eun-Gyung Kim)
BackLink	https://www.kci.go.kr/kciportal/ci/sereArticleSearch/ciSereArtiView.kci?sereArticleSearchBean.artiId=ART002839010$$DAccess content in National Research Foundation of Korea (NRF)
BookMark	eNpFkD9PwkAAxS8GExH5Bg5dHBxa71_vemwEUVCUxDRxvJzlas5iMa0OjiS66CoJRgZNSFwYGPEr2fIdBDFxer_hl_eStwkKcS_WAGwj6DAExd5VZEygHQwxdjBzqEOFuwaKGHueTRFzC0sm1Kac4w1QTlNzAQnDXCDCiqByXvP9itXwT1pWNpnlDyMrfx_Mh4P564v1PZtm06GVv31Z8_4k-xhnn30rfx7lT-P542gLrIeqm-ryX5aAf1D3aw271T5s1qotOxIU2khrArnuqAAKQQgUajGMNWWE05B7jFAaCKGChYQFRR7SwsOuckMqkIDKJSWwu6qNk1BGgZE9ZX7zsiejRFbP_KYUghPo4YW7s3Ijk94aGXfSrjyqHreX5yBMGWaMcUr-vfguMde6Y5S8WYBK7uVpe7-OEHQ5xpD8AOpdcGI
ContentType	Journal Article
DBID	DBRKI TDB JDI ACYCR
DEWEY	003.5
DOI	10.6109/jkiice.2022.26.4.495
DatabaseName	DBPIA - 디비피아 Nurimedia DBPIA Journals KoreaScience Korean Citation Index
DatabaseTitleList
DeliveryMethod	fulltext_linktorsrc
Discipline	Applied Sciences Mathematics
DocumentTitleAlternate	WCTT: Web Crawling System based on HTML Document Formalization
DocumentTitle_FL	WCTT: Web Crawling System based on HTML Document Formalization
EISSN	2288-4165
EndPage	502
ExternalDocumentID	oai_kci_go_kr_ARTI_9973082 JAKO202212462666743 NODE11057220
GroupedDBID	.UV ALMA_UNASSIGNED_HOLDINGS DBRKI TDB JDI ACYCR
ID	FETCH-LOGICAL-k940-1ee307edac0993309a9132e46374f786344c99ac307294181e9825a5f49190a53
ISSN	2234-4772
IngestDate	Sun Mar 09 07:53:19 EDT 2025 Fri Dec 22 12:02:20 EST 2023 Thu Feb 06 13:36:51 EST 2025
IsDoiOpenAccess	true
IsOpenAccess	true
IsPeerReviewed	true
IsScholarly	true
Issue	4
Keywords	텍스트 빈도 분석 Tag Path Analysis Text Frequency Analysis 웹 크롤링 HTML 문서 정형화 Web Crawling 태그 경로 분석 HTML Document Formalization HTML
Language	Korean
LinkModel	OpenURL
MergedId	FETCHMERGED-LOGICAL-k940-1ee307edac0993309a9132e46374f786344c99ac307294181e9825a5f49190a53
Notes	KISTI1.1003/JNL.JAKO202212462666743 http://jkiice.org
OpenAccessLink	http://click.ndsl.kr/servlet/LinkingDetailView?cn=JAKO202212462666743&dbt=JAKO&org_code=O481&site_code=SS1481&service_code=01
PageCount	8
ParticipantIDs	nrf_kci_oai_kci_go_kr_ARTI_9973082 kisti_ndsl_JAKO202212462666743 nurimedia_primary_NODE11057220
PublicationCentury	2000
PublicationDate	2022
PublicationDateYYYYMMDD	2022-01-01
PublicationDate_xml	– year: 2022 text: 2022
PublicationDecade	2020
PublicationTitle	한국정보통신학회논문지
PublicationTitleAlternate	Journal of the Korea Institute of Information and Communication Engineering
PublicationYear	2022
Publisher	한국정보통신학회
Publisher_xml	– name: 한국정보통신학회
SSID	ssib036279136 ssib053377456 ssib044738262 ssib015937029 ssib023393675 ssib012146319
Score	2.165488
Snippet	Web crawler, which is mainly used to collect text on the web today, is difficult to maintain and expand because researchers must implement different collection... 오늘날 웹상의 본문 수집에 주로 이용되는 웹 크롤러는 연구자가 직접 HTML 문서의 태그와 스타일을 분석한 후 수집 채널마다 다른 수집 로직을 구현해야 하므로 유지 관리 및 확장이 어렵다. 이러한 문제점을 해결하려면 웹 크롤러는 구조가 서로 다른 HTML 문서를 동일한 구조로...
SourceID	nrf kisti nurimedia
SourceType	Open Website Open Access Repository Publisher
StartPage	495
SubjectTerms	전자/정보통신공학
Title	WCTT: HTML 문서 정형화 기반 웹 크롤링 시스템
URI	https://www.dbpia.co.kr/journal/articleDetail?nodeId=NODE11057220 http://click.ndsl.kr/servlet/LinkingDetailView?cn=JAKO202212462666743&dbt=JAKO&org_code=O481&site_code=SS1481&service_code=01 https://www.kci.go.kr/kciportal/ci/sereArticleSearch/ciSereArtiView.kci?sereArticleSearchBean.artiId=ART002839010
Volume	26
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
ispartofPNX	한국정보통신학회논문지, 2022, 26(4), , pp.495-502
link	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwrV1Pb9MwFLe2cYALAgFi_JkihE9VShPbsb1b0haVwbZLEbtFSZpOpahFpb1wQJoEF7gyaYgeQJrEZYcdx1ei7XfgPadrMyjSQFyiF_v52c8vcX4vid8j5L7DYuVIGdvNJG3Y3EmadtwoKbuh4kTzVKooMtE-t7zaU76xI3aWlr3cX0uDflxMXi_cV_IvVoUysCvukv0Ly86EQgHQYF84goXheC4bPyvX6-jS1-qbTwq0GlC_TANFq2WqONXlAlJ-iWpBqxWq1SmhqeZQ5yNvUMJ2cNTKsGs40QVkUw6KQ6FAcENIKDRcKgDxhvBNHXALKMkjXdOTMGzQkaB-JTca6ITRIGvoYW0mE2qnrbQhfKrMCEB4UD6jHw6lhJcH5o4EZefvM0x3GmunbPxUZ7WQRVcMASUuDSr5lyDu3Fn-X8rMl13AS9zmMksoVEynZXDlAXQVuaWeZ8lBp6hBmH3jvz2QvCye6_N2CyNE4ciLrlfkxVnjfPzvX57LZyKAt5NWuNsN270Q_JxHodYS4wwtkwuulA7-yrr5pnq6jjqYrJ3Nw_QBYmUy9_XVZUyzXAwfwDBSO2wGSzmXTOXCSIJHAE6CSYQ8m5xsFyqq92CRcuDqof_TAsTW6QHQu9gZYLYKWPJy6K1-hVyeul2Wn91DV8lSu3uNrOP9s27h3WONjk7Gb4fW-Mv-5GB_8umj9ePkeHR8YI0_f7cme0ejr4ejb3vW-MNw_P5w8m54ndQfVuvlmj3NJWK3Nf69kKbwMEsbUQIeEWMlHYHCbgrTJHlTKo9xnmgdJQwD6XNAvalWrohEk2tAzJFgN8hKp9tJbxIrFm7i6igWThM_erNYJ46QrAGOR5MpzlbJmlE97DRevQg3_MfbODMAoz3AwrjlZ5XcgzkxFv2zZUHKbMrCl1ngmXBru1J1MDG365ZunUfKbXIJO89eGd4hK_3eIL0LILofr5kr5idMgZNi
linkProvider	ISSN International Centre
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=WCTT%3A+HTML+%EB%AC%B8%EC%84%9C+%EC%A0%95%ED%98%95%ED%99%94+%EA%B8%B0%EB%B0%98+%EC%9B%B9+%ED%81%AC%EB%A1%A4%EB%A7%81+%EC%8B%9C%EC%8A%A4%ED%85%9C&rft.jtitle=%ED%95%9C%EA%B5%AD%EC%A0%95%EB%B3%B4%ED%86%B5%EC%8B%A0%ED%95%99%ED%9A%8C%EB%85%BC%EB%AC%B8%EC%A7%80%2C+26%284%29&rft.au=%EA%B9%80%EC%A7%84%ED%99%98&rft.au=%EA%B9%80%EC%9D%80%EA%B2%BD&rft.date=2022&rft.pub=%ED%95%9C%EA%B5%AD%EC%A0%95%EB%B3%B4%ED%86%B5%EC%8B%A0%ED%95%99%ED%9A%8C&rft.issn=2234-4772&rft.eissn=2288-4165&rft.spage=495&rft.epage=502&rft_id=info:doi/10.6109%2Fjkiice.2022.26.4.495&rft.externalDBID=n%2Fa&rft.externalDocID=oai_kci_go_kr_ARTI_9973082
thumbnail_l	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=2234-4772&client=summon
thumbnail_m	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=2234-4772&client=summon
thumbnail_s	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=2234-4772&client=summon