An Efficient Clustering Approach for Large Document Collections

A vast amount of unstructured text data, such as scientific publications, commercial reports and webpages are required to be quickly categorized into different semantic groups for facilitating online information query. However, the state-of-the art clustering methods are suffered from the huge size...

Full description

Saved in:

Bibliographic Details
Published in	Advanced Data Mining and Applications pp. 240 - 247
Main Authors	Han, Bo, Kang, Lishan, Song, Huazhu
Format	Book Chapter Conference Proceeding
Language	English
Published	Berlin, Heidelberg Springer Berlin Heidelberg 2005 Springer
Series	Lecture Notes in Computer Science
Subjects	Applied sciences Cluster Quality Computer science; control theory; systems Data processing. List processing. Character string processing Document Cluster Exact sciences and technology Initial Cluster Memory organisation. Data processing Normalize Mutual Information Software Vocabulary Size Cluster analysis Electronic discussion group Data analysis Database query Permutation Semantics Classification Text Data mining
Online Access	Get full text

Cover

Loading…

More Information
Summary:	A vast amount of unstructured text data, such as scientific publications, commercial reports and webpages are required to be quickly categorized into different semantic groups for facilitating online information query. However, the state-of-the art clustering methods are suffered from the huge size of documents with high-dimensional text features. In this paper, we propose an efficient clustering algorithm for large document collections, which performs clustering in three stages: 1) by using permutation test, the informative topic words are identified so as to reduce feature dimension; 2) selecting a small number of most typical documents to perform initial clustering 3) refining clustering on all documents. The algorithm was tested by the 20 newsgroup data and experimental results showed that, comparing with the methods which cluster corpus based on all document samples and full features directly, this approach significantly reduced the time cost in an order while slightly improving the clustering quality.
ISBN:	354027894X 9783540278948
ISSN:	0302-9743 1611-3349
DOI:	10.1007/11527503_29