基于PU学习的软件故障检测研究

针对软件故障数据中正例样本相对较少且大量样本标注困难的现实场景,已知未标注样本中包含用于建立故障检测模型的大量有用信息,提出仅用正例和未标注数据构建分类模型对软件开发过程中的故障进行检测的半监督学习方法。首先采用合成少数类过采样SMOTE算法对数据集中的正例样本进行过采样,平衡数据集中的类分布。在此基础上合理构建正例集合和未标注集合,采用POSC 4.5和Bagging算法构建软件故障决策树集成分类器。通过对NASA MDP数据库中的12个数据集进行对比实验,结果表明,仅用正例和未标注数据建模可以得到与有监督学习方法相近的软件故障检测率,且集成分类器方法比单分类器方法具有更高的检测率,未标注样...

Full description

Saved in:
Bibliographic Details
Published in计算机应用研究 Vol. 32; no. 11; pp. 3324 - 3327
Main Author 张荷 李梅 张阳 蔡晓妍
Format Journal Article
LanguageChinese
Published 西北农林科技大学 信息工程学院,陕西 杨凌,712100%西北农林科技大学 机电学院,陕西 杨凌,712100 2015
Subjects
Online AccessGet full text
ISSN1001-3695
DOI10.3969/j.issn.1001-3695.2015.11.028

Cover

More Information
Summary:针对软件故障数据中正例样本相对较少且大量样本标注困难的现实场景,已知未标注样本中包含用于建立故障检测模型的大量有用信息,提出仅用正例和未标注数据构建分类模型对软件开发过程中的故障进行检测的半监督学习方法。首先采用合成少数类过采样SMOTE算法对数据集中的正例样本进行过采样,平衡数据集中的类分布。在此基础上合理构建正例集合和未标注集合,采用POSC 4.5和Bagging算法构建软件故障决策树集成分类器。通过对NASA MDP数据库中的12个数据集进行对比实验,结果表明,仅用正例和未标注数据建模可以得到与有监督学习方法相近的软件故障检测率,且集成分类器方法比单分类器方法具有更高的检测率,未标注样本集大小的软件故障检测率同样有影响。
Bibliography:software fault prediction; PU learning; unbalanced data; decision tree; ensemble classifier
51-1196/TP
Zhang He, Li Mei, Zhang Yang, Cai Xiaoyan( a. College of Information & Engineering, b. College of Mechanical & Electronic Engineering, Northwest A & F University, Yangling Shaanxi 712100, China)
The software fault datasets were highly possible that there were only a small set of labeled positive data and most of the data was hard to be labeled, which contained a great deal of useful information for building a prediction model for software fault detection. This paper proposed a semi-supervised classification model to predict the faults only using the positive and unlabeled data during the software development process, The proposed method firstly used the SMOTE ( synthetic minority oversampling technique) method to balance the class distribution by oversampling on the rare positive dataset. Then partitioned the improved dataset into positive subset and unlabeled subset properly. Third used the POSC 4.5 algorithm
ISSN:1001-3695
DOI:10.3969/j.issn.1001-3695.2015.11.028