Novel similarity measure for document clustering based on topic phrases
Document clustering is a subset of the data clustering field which categorizes large set of documents into similar and related groups. In the traditional vector space model (VSM) researchers have considered the unique word which occurs in the document set as the candidate feature. Recently a new tre...
Saved in:
Published in | 2009 International Conference on Networking and Media Convergence pp. 92 - 96 |
---|---|
Main Authors | , , |
Format | Conference Proceeding |
Language | English |
Published |
IEEE
01.03.2009
|
Subjects | |
Online Access | Get full text |
ISBN | 9781424437764 1424437768 |
DOI | 10.1109/ICNM.2009.4907196 |
Cover
Loading…
Summary: | Document clustering is a subset of the data clustering field which categorizes large set of documents into similar and related groups. In the traditional vector space model (VSM) researchers have considered the unique word which occurs in the document set as the candidate feature. Recently a new trend which considered the phrase to be a more informative feature has taken place; the matter which contributes in improving the document clustering accuracy and effectiveness. This paper proposes a new approach for computing the similarity measure of the traditional VSM by considering the topic phrases of the document as the constituting terms for the VSM instead of the traditional term ldquowordrdquo and applying the new approach to the Buckshot method, which is a mix of the Hierarchical Agglomerative Clustering (HAC) algorithm and the K-means partitioning algorithm. Such a mechanism may raise the effectiveness of the clustering by increasing the evaluation metrics values. |
---|---|
ISBN: | 9781424437764 1424437768 |
DOI: | 10.1109/ICNM.2009.4907196 |