Improvements to code2vec: Generating path vectors using RNN

Source code analysis has many application scenarios, such as code plagiarism detection and software vulnerability search. Source code analysis can benefit from machine learning, but it typically requires a standard vector representation and cannot be directly applied to the source code. Thus, we are...

Full description

Saved in:
Bibliographic Details
Published inComputers & security Vol. 132; p. 103322
Main Authors Sun, Xuekai, Liu, Chunling, Dong, Weiyu, Liu, Tieming
Format Journal Article
LanguageEnglish
Published Elsevier Ltd 01.09.2023
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Source code analysis has many application scenarios, such as code plagiarism detection and software vulnerability search. Source code analysis can benefit from machine learning, but it typically requires a standard vector representation and cannot be directly applied to the source code. Thus, we are required to embed source code into vector representation while maintaining the semantics of the code as much as possible. Code2vec proposes a code embedding method that converts source code into code vector through Abstract Syntax Tree(AST). However, we found that code2vec uses a hashing algorithm to generate the identifier for the path in the path context, which leads to the loss of node information in the path and also causes the model training parameters to be very large. Therefore, we present a new path representation which utilizes RNN to generate vectors for paths. We also proposed alternative model designs and evaluated their impact on the model in the experiments. The results we obtained in a challenging source code classification task suggest that, compared to code2vec, the RNN-based paths representation can produce a better embedding model with fewer training parameters.
ISSN:0167-4048
1872-6208
DOI:10.1016/j.cose.2023.103322