Formal concept analysis for topic detection: A clustering quality experimental analysis

•We propose a novel application of FCA-based methods for Topic Detection, overcoming traditional problems of the clustering and classification techniques.•We achieve state-of-the-art results for the topic detection task at Replab 2013.•We propose an evaluation framework to measure the quality of the...

Full description

Saved in:
Bibliographic Details
Published inInformation systems (Oxford) Vol. 66; pp. 24 - 42
Main Authors Castellanos, A., Cigarrán, J., García-Serrano, A.
Format Journal Article
LanguageEnglish
Published Elsevier Ltd 01.06.2017
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:•We propose a novel application of FCA-based methods for Topic Detection, overcoming traditional problems of the clustering and classification techniques.•We achieve state-of-the-art results for the topic detection task at Replab 2013.•We propose an evaluation framework to measure the quality of the topic detection algorithms, including an external and an internal (quality based) evaluation methodology.•We conduct an extensive analysis of the performance for the topic detection task of Hierarchical Agglomerative Clustering and Latent Dirichlet Allocation in comparison to FCA.•We prove that the proposed FCA-based approach is better, in terms of clustering quality, than the two others. The Topic Detection task is focused on discovering the main topics addressed by a series of documents (e.g., news reports, e-mails, tweets). Topics, defined in this way, are expected to be thematically similar, cohesive and self-contained. This task has been broadly studied from the point of view of clustering and probabilistic techniques. In this work, we propose for this task the application of Formal Concept Analysis (FCA), an exploratory technique for data analysis and organization. In particular, we propose an extension of FCA-based methods for topic detection applied in the literature by applying the stability concept for the topic selection. The hypothesis is that FCA will enable the better organization of the data and stability the better selection of topics based on this data organization, thus better fulfilling the task requirements by improving the quality and accuracy of the topic detection process. In addition, the proposed FCA-based methodology is able to cope with some well-known drawbacks that clustering and probabilistic methodologies present, such as: the need to set a predefined number of clusters or the difficulty in dealing with topics with complex generalization-specialization relationships. In order to prove this hypothesis, the FCA operation is compared to other established techniques — Hierarchical Agglomerative Clustering (HAC) and Latent Dirichlet Allocation (LDA). To allow this comparison, these approaches have been implemented by the authors in a novel experimental framework. The quality of the topics detected by the different approaches in terms of their suitability for the topic detection task is evaluated by means of internal clustering validity metrics. This evaluation demonstrates that FCA generates cohesive clusters, which are less subject to changes in cluster granularity. Driven by the quality of the detected topics, FCA achieves the best general outcome, improving the experimental results for Topic Detection Task at the 2013 Replab Campaign.
ISSN:0306-4379
1873-6076
DOI:10.1016/j.is.2017.01.008