DP-Share: Privacy-Preserving Software Defect Prediction Model Sharing Through Differential Privacy

In current software defect prediction (SDP) research, most previous empirical studies only use datasets provided by PROMISE repository and this may cause a threat to the external validity of previous empirical results. Instead of SDP dataset sharing, SDP model sharing is a potential solution to alle...

Full description

Saved in:

Bibliographic Details
Published in	Journal of computer science and technology Vol. 34; no. 5; pp. 1020 - 1038
Main Authors	Chen, Xiang, Zhang, Dun, Cui, Zhan-Qi, Gu, Qing, Ju, Xiao-Lin
Format	Journal Article
Language	English
Published	New York Springer US 01.09.2019 Springer Springer Nature B.V State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210023, China School of Information Science and Technology, Nantong University, Nantong 226019, China School of Computer Science and Engineering, Nanyang Technological University, Singapore 639798, Singapore%School of Information Science and Technology, Nantong University, Nantong 226019, China%State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210023, China Computer School, Beijing Information Science and Technology University, Beijing 100101, China%State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210023, China%School of Information Science and Technology, Nantong University, Nantong 226019, China
Subjects	Artificial Intelligence Budgets Computer Science Data Structures and Information Theory Datasets Decision trees Defects Empirical analysis Information Systems Applications (incl.Internet) Prediction models Privacy Regular Paper Sampling Software Software Engineering Theory of Computation model sharing empirical study software defect prediction differential privacy cross project defect prediction
Online Access	Get full text
ISSN	1000-9000 1860-4749
DOI	10.1007/s11390-019-1958-0

Cover

More Information
Summary:	In current software defect prediction (SDP) research, most previous empirical studies only use datasets provided by PROMISE repository and this may cause a threat to the external validity of previous empirical results. Instead of SDP dataset sharing, SDP model sharing is a potential solution to alleviate this problem and can encourage researchers in the research community and practitioners in the industrial community to share more models. However, directly sharing models may result in privacy disclosure, such as model inversion attack. To the best of our knowledge, we are the first to apply differential privacy (DP) to privacy-preserving SDP model sharing and then propose a novel method DP-Share, since DP mechanisms can prevent this attack when the privacy budget is carefully selected. In particular, DP-Share first performs data preprocessing for the dataset, such as over-sampling for minority instances (i.e., defective modules) and conducting discretization for continuous features to optimize privacy budget allocation. Then, it uses a novel sampling strategy to create a set of training sets. Finally it constructs decision trees based on these training sets and these decision trees can form a random forest (i.e., model). The last phase of DP-Share uses Laplace and exponential mechanisms to satisfy the requirements of DP. In our empirical studies, we choose nine experimental subjects from real software projects. Then, we use AUC (area under ROC curve) as the performance measure and holdout as our model validation technique. After privacy and utility analysis, we find that DP-Share can achieve better performance than a baseline method DF-Enhance in most cases when using the same privacy budget. Moreover, we also provide guidelines to effectively use our proposed method. Our work attempts to fill the research gap in terms of differential privacy for SDP, which can encourage researchers and practitioners to share more SDP models and then effectively advance the state of the art of SDP.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	1000-9000 1860-4749
DOI:	10.1007/s11390-019-1958-0