An EM Clustering Algorithm which Produces a Dual Representation
Clustering text documents is an important step in mining useful information on the Web or other text-based resources. The common task in text clustering is to handle text in a multi-dimensional space, and to partition documents into groups, where each group contains documents that are similar to eac...
Saved in:
Published in | 2011 Tenth International Conference on Machine Learning and Applications Vol. 2; pp. 90 - 95 |
---|---|
Main Authors | , |
Format | Conference Proceeding |
Language | English |
Published |
IEEE
01.12.2011
|
Subjects | |
Online Access | Get full text |
ISBN | 9781457721342 1457721341 |
DOI | 10.1109/ICMLA.2011.29 |
Cover
Loading…
Summary: | Clustering text documents is an important step in mining useful information on the Web or other text-based resources. The common task in text clustering is to handle text in a multi-dimensional space, and to partition documents into groups, where each group contains documents that are similar to each other. However, this strategy lacks a comprehensive view for humans since it cannot explain the subject of each cluster. Utilizing semantic information such as an ontology can solve this problem, but it needs a well-defined database or pre-labeled gold standard set. In this paper, we present a theme-based clustering algorithm for text documents. Given text, subject terms are extracted and used for clustering documents in a probabilistic framework. An EM approach is used to ensure documents are assigned to correct themes, hence it converges to an optimal solution. The proposed method is distinctive because its results are sufficiently explanatory for human understanding as well as efficient for usual clustering performance. The experimental results show that the proposed method provides competitive performance compared to other state-of-the-art approaches. In addition, the extracted themes represent well the topics of clusters on the MEDLINE dataset. |
---|---|
ISBN: | 9781457721342 1457721341 |
DOI: | 10.1109/ICMLA.2011.29 |