Applying a dynamic threshold to improve cluster detection of LSI

Latent Semantic Indexing (LSI) is a standard approach for extracting and representing the meaning of words in a large set of documents. Recently it has been shown that it is also useful for identifying concerns in source code. The tree cutting strategy plays an important role in obtaining the cluste...

Full description

Saved in:
Bibliographic Details
Published inScience of computer programming Vol. 76; no. 12; pp. 1261 - 1274
Main Authors van der Spek, Pieter, Klusener, Steven
Format Journal Article
LanguageEnglish
Published Elsevier B.V 01.12.2011
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Latent Semantic Indexing (LSI) is a standard approach for extracting and representing the meaning of words in a large set of documents. Recently it has been shown that it is also useful for identifying concerns in source code. The tree cutting strategy plays an important role in obtaining the clusters, which identify the concerns. In this contribution the authors compare two tree cutting strategies: the Dynamic Hybrid cut and the commonly used fixed height threshold. Two case studies have been performed on the source code of Philips Healthcare to compare the results using both approaches. While some of the settings are particular to the Philips-case, the results show that applying a dynamic threshold, implemented by the Dynamic Hybrid cut, is an improvement over the fixed height threshold in the detection of clusters representing relevant concerns. This makes the approach as a whole more usable in practice. ► We examine two dendrogram cutting algorithms for Latent Semantic Indexing. ► We discuss the limitations of the most used cutting algorithm, the fixed height cut. ► We present an alternative, the Dynamic Hybrid cut, which cuts at flexible heights. ► We present the results from two case studies performed at Philips Healthcare. ► From these case studies we conclude that the Dynamic Hybrid cut performs better.
ISSN:0167-6423
1872-7964
DOI:10.1016/j.scico.2010.12.004