An Approach to Improving Quality of Crawlers Using Naïve Bayes for Classifier and Hyperlink Filter

Nowadays, most of search engines rely on keywords provided by users. However, keywords may not be sufficiently representative for the main topic of a web page. When searching for a topic, users input their desirable topic in terms of keywords. Keyword-based search engines will return pages that cont...

Full description

Saved in:
Bibliographic Details
Published inComputational Collective Intelligence. Technologies and Applications pp. 525 - 535
Main Authors Nguyen, Huu-Thien-Tan, Le, Duy-Khanh
Format Book Chapter
LanguageEnglish
Published Berlin, Heidelberg Springer Berlin Heidelberg
SeriesLecture Notes in Computer Science
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Nowadays, most of search engines rely on keywords provided by users. However, keywords may not be sufficiently representative for the main topic of a web page. When searching for a topic, users input their desirable topic in terms of keywords. Keyword-based search engines will return pages that contain the keywords even though these pages are not about the topic. This limits the efficiency of these engines as they may return undesirable result. In this paper, we present an approach to improve the quality of search engines by focusing on web pages related to specific topics. Our system includes three main components: a crawler for gathering web pages, a classifier for classifying web pages by topics, and a hyperlink filter (or distiller) for filtering hyperlinks. We propose Naïve Bayes algorithms for classifier and distiller to enhance the accuracy of the system. We also implement and examine the efficiency of our system by gathering web pages in two topics: Artificial Intelligence and Motorcycle. The result shows that our crawler achieves performance improvements in efficiency over the ones that search by keywords.
ISBN:9783642346293
3642346294
ISSN:0302-9743
1611-3349
DOI:10.1007/978-3-642-34630-9_54