An improved random forest approach for detection of hidden web search interfaces

Search interface detection is an essential technique for extracting information from the hidden Web. The challenge for this task is search interface data that is represented in high dimensional and sparse features with many missing values. This paper presents a new multi-classifier ensemble approach...

Full description

Saved in:
Bibliographic Details
Published in2008 International Conference on Machine Learning and Cybernetics Vol. 3; pp. 1586 - 1591
Main Authors Xiao-Bai Deng, Yun-Ming Ye, Hong-Bo Li, Huang, J.Z.
Format Conference Proceeding
LanguageEnglish
Published IEEE 01.07.2008
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Search interface detection is an essential technique for extracting information from the hidden Web. The challenge for this task is search interface data that is represented in high dimensional and sparse features with many missing values. This paper presents a new multi-classifier ensemble approach to solving this problem. In this approach, we have extended the random forest algorithm with a weighted feature selection method to build individual classifiers. With this improved random forest algorithm (IRFA), each classifier can be learnt from a weighted subset of the feature space so that the ensemble of decision trees can fully exploit the useful features of search interface patterns. We have compared our ensemble approach with other well-known classification algorithms, such as SVM and C4.5. The experimental results have shown that our method is more effective in detecting search interfaces of the hidden Web.
ISBN:1424420954
9781424420957
ISSN:2160-133X
DOI:10.1109/ICMLC.2008.4620659