Software defect number prediction: Unsupervised vs supervised methods

Context: Software defect number prediction (SDNP) can rank the program modules according to the prediction results and is helpful for the optimization of testing resource allocation. Objective: In previous studies, supervised methods vs unsupervised methods is an active issue for just-in-time defect...

Full description

Saved in:
Bibliographic Details
Published inInformation and software technology Vol. 106; pp. 161 - 181
Main Authors Chen, Xiang, Zhang, Dun, Zhao, Yingquan, Cui, Zhanqi, Ni, Chao
Format Journal Article
LanguageEnglish
Published Elsevier B.V 01.02.2019
Subjects
Online AccessGet full text
ISSN0950-5849
1873-6025
DOI10.1016/j.infsof.2018.10.003

Cover

Loading…
More Information
Summary:Context: Software defect number prediction (SDNP) can rank the program modules according to the prediction results and is helpful for the optimization of testing resource allocation. Objective: In previous studies, supervised methods vs unsupervised methods is an active issue for just-in-time defect prediction and file-level defect prediction based on effort-aware performance measures. However, this issue has not been investigated for SDNP. To the best of our knowledge, we are the first to make a thorough comparison for these two different types of methods. Method: In our empirical studies, we consider 7 real open-source projects with 24 versions in total, use FPA and Kendall as our effort-aware performance measures, and consider three different performance evaluation scenarios (i.e., within-version scenario, cross-version scenario, and cross-project scenario). Result: We first identify two unsupervised methods with best performance. These two methods simply rank modules according to the value of metric LOC and metric RFC from large to small respectively. Then we compare 9 state-of-the-art supervised methods incorporating SMOTEND, which is used for handling class imbalance problem, with the unsupervised method based on LOC metric (i.e., LOC_D method). Final results show that LOC_D method can perform significantly better than or the same as these supervised methods. Later motivated by a recent study conducted by Agrawla and Menzies, we apply differential evolutionary (DE) to optimize parameter value of SMOTEND used by these supervised methods and find that using DE can effectively improve the performance of these supervised methods for SDNP too. Finally, we continue to compare LOC_D with these optimized supervised methods using DE, and LOC_D method still has advantages in the performance, especially in the cross-version and cross-project scenarios. Conclusion: Based on these results, we suggest that researchers need to use the unsupervised method LOC_D as the baseline method, which is used for comparing their proposed novel methods for SDNP problem in the future.
ISSN:0950-5849
1873-6025
DOI:10.1016/j.infsof.2018.10.003