A Hybrid Instance Selection Using Nearest-Neighbor for Cross-Project Defect Prediction
Software defect prediction (SDP) is an active research field in software engineering to identify defect-prone modules. Thanks to SDP, limited testing resources can be effectively allocated to defect-prone modules. Although SDP requires sufficient local data within a company, there are cases where lo...
Saved in:
Published in | Journal of computer science and technology Vol. 30; no. 5; pp. 969 - 980 |
---|---|
Main Authors | , , |
Format | Journal Article |
Language | English |
Published |
New York
Springer US
01.09.2015
Springer Nature B.V School of Computing, Korea Advanced Institute of Science and Technology, Yuseong-gu, Daejeon 305-701, Korea |
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Software defect prediction (SDP) is an active research field in software engineering to identify defect-prone modules. Thanks to SDP, limited testing resources can be effectively allocated to defect-prone modules. Although SDP requires sufficient local data within a company, there are cases where local data are not available, e.g., pilot projects. Companies without local data can employ cross-project defect prediction (CPDP) using external data to build classifiers. The major challenge of CPDP is different distributions between training and test data. To tackle this, instances of source data similar to target data are selected to build classifiers. Software datasets have a class imbalance problem meaning the ratio of defective class to clean class is far low. It usually lowers the performance of classifiers. We propose a Hybrid Instance Selection Using Nearest-Neighbor (HISNN) method that performs a hybrid classification selectively learning local knowledge (via k-nearest neighbor) and global knowledge (via na/ve Bayes). Instances having strong local knowledge are identified via nearest-neighbors with the same class label. Previous studies showed low PD (probability of detection) or high PF (probability of false alarm) which is impractical to overall performance as well as high PD and low PF. use. The experimental results show that HISNN produces high overall performance as well as high PD and low PF. |
---|---|
Bibliography: | software defect analysis; instance-based learning; nearest-neighbor algorithm; data cleaning 11-2296/TP Duksan Ryu, Jong-In Jang, and Jongmoon Baik( School of Computing, Korea Advanced Institute of Science and Technology, Yuseong-gu, Daejeon 305-V01, Korea) Software defect prediction (SDP) is an active research field in software engineering to identify defect-prone modules. Thanks to SDP, limited testing resources can be effectively allocated to defect-prone modules. Although SDP requires sufficient local data within a company, there are cases where local data are not available, e.g., pilot projects. Companies without local data can employ cross-project defect prediction (CPDP) using external data to build classifiers. The major challenge of CPDP is different distributions between training and test data. To tackle this, instances of source data similar to target data are selected to build classifiers. Software datasets have a class imbalance problem meaning the ratio of defective class to clean class is far low. It usually lowers the performance of classifiers. We propose a Hybrid Instance Selection Using Nearest-Neighbor (HISNN) method that performs a hybrid classification selectively learning local knowledge (via k-nearest neighbor) and global knowledge (via na/ve Bayes). Instances having strong local knowledge are identified via nearest-neighbors with the same class label. Previous studies showed low PD (probability of detection) or high PF (probability of false alarm) which is impractical to overall performance as well as high PD and low PF. use. The experimental results show that HISNN produces high overall performance as well as high PD and low PF. ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 content type line 23 |
ISSN: | 1000-9000 1860-4749 |
DOI: | 10.1007/s11390-015-1575-5 |