Abstract Syntax Tree와 그래프 임베딩을 활용한 코드 유사성 탐지 모델 개발

본 연구는 코드의 구조적 특성과 순차적 특성을 동시에 고려한 새로운 코드 유사도 측정 모델을 제안한다. 기존의 자연어 처리 기반 방식의 한계를 극복하고자, Abstract Syntax Tree(AST)를 활용한 그래프 임베딩과 토큰 시퀀스 임베딩을 결합하는 접근법을 채택하였다. 모델의 주요 구성 요소는 AST 생성 및 그래프 변환, node2vec을 이용한 그래프 임베딩, 노드 중요도를 반영한 가중치 적용, Gensim word2vec을 활용한 AST 토큰 시퀀스 임베딩, 그래프 기반 임베딩과 시퀀스 기반 임베딩의 통합, 그리고 코...

Full description

Saved in:

Bibliographic Details
Published in	한국정보통신학회논문지, 29(1) pp. 55 - 60
Main Authors	오은총, 조수호, 채상미
Format	Journal Article
Language	Korean
Published	한국정보통신학회 01.01.2025
Subjects	전자/정보통신공학
Online Access	Get full text
ISSN	2234-4772 2288-4165

Cover

Loading…

More Information
Summary:	본 연구는 코드의 구조적 특성과 순차적 특성을 동시에 고려한 새로운 코드 유사도 측정 모델을 제안한다. 기존의 자연어 처리 기반 방식의 한계를 극복하고자, Abstract Syntax Tree(AST)를 활용한 그래프 임베딩과 토큰 시퀀스 임베딩을 결합하는 접근법을 채택하였다. 모델의 주요 구성 요소는 AST 생성 및 그래프 변환, node2vec을 이용한 그래프 임베딩, 노드 중요도를 반영한 가중치 적용, Gensim word2vec을 활용한 AST 토큰 시퀀스 임베딩, 그래프 기반 임베딩과 시퀀스 기반 임베딩의 통합, 그리고 코사인 유사도를 통한 최종 유사성 점수 계산이다. 성능 평가를 위해 mistral:7B 모델을 사용하여 4가지 유형의 유사 코드를 생성하고 테스트하였다. 결과적으로, 제안된 모델은 다양한 코드 변형에 대해 0.8-0.9 범위의 높은 유사도를 보이며, 서로 다른 코드들 간에는 0.5-0.7 범위의 낮은 유사도를 나타내어 효과적인 구분 능력을 입증하였다. This study proposes a code similarity measurement model that simultaneously considers both structural and sequential characteristics of code. To overcome the limitations of existing natural language processing-based approaches, we adopted an approach that combines graph embedding using Abstract Syntax Tree (AST) with token sequence embedding. The main components of the model are AST generation and graph transformation, graph embedding using node2vec, weight application reflecting node importance, AST token sequence embedding utilizing Gensim word2vec, integration of graph-based and sequence-based embeddings, and final similarity score calculation through cosine similarity. For performance evaluation, we used the mistral:7B model to generate and test four types of similar code. As a result, the proposed model demonstrated high similarity scores in the range of 0.8-0.9 for various code variations, while showing low similarity scores in the range of 0.5-0.7 between different codes, proving its effective discrimination capability. KCI Citation Count: 0
Bibliography:	http://jkiice.org
ISSN:	2234-4772 2288-4165