A Cluster Based Feature Selection Method for Cross-Project Software Defect Prediction

Cross-project defect prediction （CPDP） uses the labeled data from external source software projects to com- pensate the shortage of useful data in the target project, in order to build a meaningful classification model. However, the distribution gap between software features extracted from the sourc...

Full description

Saved in:

Bibliographic Details
Published in	Journal of computer science and technology Vol. 32; no. 6; pp. 1090 - 1107
Main Authors	Ni, Chao, Liu, Wang-Shu, Chen, Xiang, Gu, Qing, Chen, Dao-Xu, Huang, Qi-Guo
Format	Journal Article
Language	English
Published	New York Springer US 01.11.2017 Springer Nature B.V School of Computer Science and Technology, Nantong University, Nantong 226019, China State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210023, China%State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210023, China
Subjects	Artificial Intelligence Classifiers Clustering Clusters Computer Science Data Structures and Information Theory Defects Feature extraction Information Systems Applications (incl.Internet) Ranking Regular Paper Software Software Engineering Theory of Computation feature clustering density-based clustering cross-project defect prediction software defect prediction feature selection
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Cross-project defect prediction （CPDP） uses the labeled data from external source software projects to com- pensate the shortage of useful data in the target project, in order to build a meaningful classification model. However, the distribution gap between software features extracted from the source and the target projects may be too large to make the mixed data useful for training. In this paper, we propose a cluster-based novel method FeSCH （Feature Selection Using Clusters of Hybrid-Data） to alleviate the distribution differences by feature selection. FeSCH includes two phases. Tile feature clustering phase clusters features using a density-based clustering method, and the feature selection phase selects features from each cluster using a ranking strategy. For CPDP, we design three different heuristic ranking strategies in the second phase. To investigate the prediction performance of FeSCH, we design experiments based on real-world software projects, and study the effects of design options in FeSCH （such as ranking strategy, feature selection ratio, and classifiers）. The experimental results prove the effectiveness of FeSCH. Firstly, compared with the state-of-the-art baseline methods, FeSCH achieves better performance and its performance is less affected by the classifiers used. Secondly, FeSCH enhances the performance by effectively selecting features across feature categories, and provides guidelines for selecting useful features for defect prediction.
Bibliography:	software defect prediction, cross-project defect prediction, feature selection, feature clustering, density-basedclustering 11-2296/TP Cross-project defect prediction （CPDP） uses the labeled data from external source software projects to com- pensate the shortage of useful data in the target project, in order to build a meaningful classification model. However, the distribution gap between software features extracted from the source and the target projects may be too large to make the mixed data useful for training. In this paper, we propose a cluster-based novel method FeSCH （Feature Selection Using Clusters of Hybrid-Data） to alleviate the distribution differences by feature selection. FeSCH includes two phases. Tile feature clustering phase clusters features using a density-based clustering method, and the feature selection phase selects features from each cluster using a ranking strategy. For CPDP, we design three different heuristic ranking strategies in the second phase. To investigate the prediction performance of FeSCH, we design experiments based on real-world software projects, and study the effects of design options in FeSCH （such as ranking strategy, feature selection ratio, and classifiers）. The experimental results prove the effectiveness of FeSCH. Firstly, compared with the state-of-the-art baseline methods, FeSCH achieves better performance and its performance is less affected by the classifiers used. Secondly, FeSCH enhances the performance by effectively selecting features across feature categories, and provides guidelines for selecting useful features for defect prediction. ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	1000-9000 1860-4749
DOI:	10.1007/s11390-017-1785-0