Graph neural network-based long method and blob code smell detection

•We propose a graph neural network-based model for long method and blob code smell detection.•The best strategies for the class imbalance of graph data and graph pooling are determined through experiments in our method.•During model design for abstract syntax tree of code, Euclidean space and non-Eu...

Full description

Saved in:
Bibliographic Details
Published inScience of computer programming Vol. 243; p. 103284
Main Authors Zhang, Minnan, Jia, Jingdong, Capretz, Luiz Fernando, Hou, Xin, Tan, Huobin
Format Journal Article
LanguageEnglish
Published Elsevier B.V 01.07.2025
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:•We propose a graph neural network-based model for long method and blob code smell detection.•The best strategies for the class imbalance of graph data and graph pooling are determined through experiments in our method.•During model design for abstract syntax tree of code, Euclidean space and non-Euclidean space are combined.•The experiments show that our proposed method outperforms machine learning methods and deep learning methods. The concept of code smell was first proposed in the late nineties, to refer to signals that code may need refactoring. While not necessarily affecting functionality, code smell can hinder understandability and future scalability of the program. As a result, the precise detection of code smell has become an important topic in coding research. However, current detection methods are limited by imbalanced and industrial-irrelevant datasets, a lack of sufficient structural and logical information on the code, and simple model architecture. Given these limitations, this paper utilized an industry-relevant and sufficient dataset and then developed a graph neural network to better detect code smell. First, we identified Long Method and Blob as our research subjects due to their frequent occurrence and impacts on the maintainability of software. We then designed modified fuzzy sampling with focalloss to address the issue of data imbalance. Second, to deal with the large volume of data, we proposed a global and local attention scoring mechanism to extract the key information from the code. Third, in order to design a graph neural network specifically for the abstract syntax tree of code, we combined Euclidean space and non-Euclidean space. Finally, we compared our method with other machine learning methods and deep learning methods. The results demonstrate that our method outperforms the other methods on Long Method and Blob, which indicates the effectiveness of our proposed method.
ISSN:0167-6423
DOI:10.1016/j.scico.2025.103284