Novel similarity measure for document clustering based on topic phrases

Document clustering is a subset of the data clustering field which categorizes large set of documents into similar and related groups. In the traditional vector space model (VSM) researchers have considered the unique word which occurs in the document set as the candidate feature. Recently a new tre...

Full description

Saved in:
Bibliographic Details
Published in2009 International Conference on Networking and Media Convergence pp. 92 - 96
Main Authors ELdesoky, A.E., Saleh, M., Sakr, N.A.
Format Conference Proceeding
LanguageEnglish
Published IEEE 01.03.2009
Subjects
Online AccessGet full text
ISBN9781424437764
1424437768
DOI10.1109/ICNM.2009.4907196

Cover

Loading…
More Information
Summary:Document clustering is a subset of the data clustering field which categorizes large set of documents into similar and related groups. In the traditional vector space model (VSM) researchers have considered the unique word which occurs in the document set as the candidate feature. Recently a new trend which considered the phrase to be a more informative feature has taken place; the matter which contributes in improving the document clustering accuracy and effectiveness. This paper proposes a new approach for computing the similarity measure of the traditional VSM by considering the topic phrases of the document as the constituting terms for the VSM instead of the traditional term ldquowordrdquo and applying the new approach to the Buckshot method, which is a mix of the Hierarchical Agglomerative Clustering (HAC) algorithm and the K-means partitioning algorithm. Such a mechanism may raise the effectiveness of the clustering by increasing the evaluation metrics values.
ISBN:9781424437764
1424437768
DOI:10.1109/ICNM.2009.4907196