Semi-supervised multitask learning using convolutional autoencoder for faulty code detection with limited data

Detecting faults in source code to fix is an important task in the software quality assurance. Building automated detectors using machine learning has been faced two big challenges of data imbalance and shortages. To address the issues, this paper proposes a deep neural network and training procedur...

Full description

Saved in:

Bibliographic Details
Published in	Applied intelligence (Dordrecht, Netherlands) Vol. 53; no. 4; pp. 3877 - 3888
Main Authors	Phan, Anh Viet, Nguyen, Khanh Duy Tung, Bui, Lam Thu
Format	Journal Article
Language	English
Published	New York Springer US 01.02.2023 Springer Nature B.V
Subjects	Algorithms Annual reports Artificial Intelligence Artificial neural networks Automation Coders Computer Science Fault detection Feature extraction Machine learning Machines Manufacturing Mechanical Engineering Neural networks Processes Quality assurance Reputations Software quality Source code Training Multitask learning Faulty code detection Semi-supervised learning Convolutional autoencoder Self-supervised learning
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Detecting faults in source code to fix is an important task in the software quality assurance. Building automated detectors using machine learning has been faced two big challenges of data imbalance and shortages. To address the issues, this paper proposes a deep neural network and training procedures to allow learning with limited annotated data. The network is composed of an unsupervised auto-encoder and a supervised classifier. The two components share some first layers that plays as a program feature extractor. Interestingly, we can leverage a large amount of unlabeled data from various sources to train the auto-encoder independently then transfer to the target domain. Additionally, sharing layers, and jointly training the reconstruction and the classification tasks stimulate the generation of the sophisticated features. We conducted the experiments on four real datasets with different amount of labeled data and with adding more unlabeled data. The results have confirmed that the multi-task outperforms the single-task and leveraging the unlabeled data is beneficial. Specifically, when reducing the labeled data from 100% to 75%, 50%, 25%, the performance of several deep networks drops sharply, while it reduces gradually for our model.
ISSN:	0924-669X 1573-7497
DOI:	10.1007/s10489-022-03663-5