DSSDPP: Data Selection and Sampling Based Domain Programming Predictor for Cross-Project Defect Prediction

Cross-project defect prediction (CPDP) refers to recognizing defective software modules in one project (i.e., target) using historical data collected from other projects (i.e., source), which can help developers find defects and prioritize their testing efforts. Unfortunately, there often exists lar...

Full description

Saved in:
Bibliographic Details
Published inIEEE transactions on software engineering Vol. 49; no. 4; pp. 1941 - 1963
Main Authors Li, Zhiqiang, Zhang, Hongyu, Jing, Xiao-Yuan, Xie, Juanying, Guo, Min, Ren, Jie
Format Journal Article
LanguageEnglish
Published New York IEEE 01.04.2023
IEEE Computer Society
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Cross-project defect prediction (CPDP) refers to recognizing defective software modules in one project (i.e., target) using historical data collected from other projects (i.e., source), which can help developers find defects and prioritize their testing efforts. Unfortunately, there often exists large distribution difference between the source and target data. Most CPDP methods neglect to select the appropriate source data for a given target at the project level. More importantly, existing CPDP models are parametric methods, which usually require intensive parameter selection and tuning to achieve better prediction performance. This would hinder wide applicability of CPDP in practice. Moreover, most CPDP methods do not address the cross-project class imbalance problem. These limitations lead to suboptimal CPDP results. In this paper, we propose a novel data selection and sampling based domain programming predictor (DSSDPP) for CPDP, which addresses the above limitations. DSSDPP is a non-parametric CPDP method, which can perform knowledge transfer across projects without the need for parameter selection and tuning. By exploiting the structures of source and target data, DSSDPP can learn a discriminative transfer classifier for identifying defects of the target project. Extensive experiments on 22 projects from four datasets indicate that DSSDPP achieves better MCC and AUC results against a range of competing methods both in the single-source and multi-source scenarios. Since DSSDPP is easy, effective, extensible, and efficient, we suggest that future work can use it with the well-chosen source data to conduct CPDP especially for the projects with limited computational budget.
ISSN:0098-5589
1939-3520
DOI:10.1109/TSE.2022.3204589