A Novel Neural Source Code Representation Based on Abstract Syntax Tree

Exploiting machine learning techniques for analyzing programs has attracted much attention. One key problem is how to represent code fragments well for follow-up analysis. Traditional information retrieval based methods often treat programs as natural language texts, which could miss important seman...

Full description

Saved in:
Bibliographic Details
Published inProceedings / International Conference on Software Engineering pp. 783 - 794
Main Authors Zhang, Jian, Wang, Xu, Zhang, Hongyu, Sun, Hailong, Wang, Kaixuan, Liu, Xudong
Format Conference Proceeding
LanguageEnglish
Published IEEE 01.05.2019
Subjects
Online AccessGet full text

Cover

Loading…
Abstract Exploiting machine learning techniques for analyzing programs has attracted much attention. One key problem is how to represent code fragments well for follow-up analysis. Traditional information retrieval based methods often treat programs as natural language texts, which could miss important semantic information of source code. Recently, state-of-the-art studies demonstrate that abstract syntax tree (AST) based neural models can better represent source code. However, the sizes of ASTs are usually large and the existing models are prone to the long-term dependency problem. In this paper, we propose a novel AST-based Neural Network (ASTNN) for source code representation. Unlike existing models that work on entire ASTs, ASTNN splits each large AST into a sequence of small statement trees, and encodes the statement trees to vectors by capturing the lexical and syntactical knowledge of statements. Based on the sequence of statement vectors, a bidirectional RNN model is used to leverage the naturalness of statements and finally produce the vector representation of a code fragment. We have applied our neural network based source code representation method to two common program comprehension tasks: source code classification and code clone detection. Experimental results on the two tasks indicate that our model is superior to state-of-the-art approaches.
AbstractList Exploiting machine learning techniques for analyzing programs has attracted much attention. One key problem is how to represent code fragments well for follow-up analysis. Traditional information retrieval based methods often treat programs as natural language texts, which could miss important semantic information of source code. Recently, state-of-the-art studies demonstrate that abstract syntax tree (AST) based neural models can better represent source code. However, the sizes of ASTs are usually large and the existing models are prone to the long-term dependency problem. In this paper, we propose a novel AST-based Neural Network (ASTNN) for source code representation. Unlike existing models that work on entire ASTs, ASTNN splits each large AST into a sequence of small statement trees, and encodes the statement trees to vectors by capturing the lexical and syntactical knowledge of statements. Based on the sequence of statement vectors, a bidirectional RNN model is used to leverage the naturalness of statements and finally produce the vector representation of a code fragment. We have applied our neural network based source code representation method to two common program comprehension tasks: source code classification and code clone detection. Experimental results on the two tasks indicate that our model is superior to state-of-the-art approaches.
Author Sun, Hailong
Wang, Xu
Wang, Kaixuan
Zhang, Jian
Liu, Xudong
Zhang, Hongyu
Author_xml – sequence: 1
  givenname: Jian
  surname: Zhang
  fullname: Zhang, Jian
  organization: Beihang University, China; Beijing Advanced Innovation Center for Big Data and Brain Computing, China
– sequence: 2
  givenname: Xu
  surname: Wang
  fullname: Wang, Xu
  organization: Beihang University, China; Beijing Advanced Innovation Center for Big Data and Brain Computing, China
– sequence: 3
  givenname: Hongyu
  surname: Zhang
  fullname: Zhang, Hongyu
  organization: The University of Newcastle, Australia
– sequence: 4
  givenname: Hailong
  surname: Sun
  fullname: Sun, Hailong
  organization: Beihang University, China; Beijing Advanced Innovation Center for Big Data and Brain Computing, China
– sequence: 5
  givenname: Kaixuan
  surname: Wang
  fullname: Wang, Kaixuan
  organization: Beihang University, China; Beijing Advanced Innovation Center for Big Data and Brain Computing, China
– sequence: 6
  givenname: Xudong
  surname: Liu
  fullname: Liu, Xudong
  organization: Beihang University, China; Beijing Advanced Innovation Center for Big Data and Brain Computing, China
BookMark eNotjEFLw0AUhFdRsK09e_CyfyBx32422XeModZCqWDquWx3XyASk7Kbiv33BpQ5zHzMMHN20w89MfYAIgUQ-LSp6lUqBWAqhDD5FVtiYaCQBiZCc81moLVJQEp9x-Yxfk6zPEOcsXXJd8M3dXxH52A7Xg_n4IhXgyf-TqdAkfrRju3Q82cbyfMplMc4ButGXl-m7ofvA9E9u21sF2n57wv28bLaV6_J9m29qcptYmVWjImzOkNvlEMqPDTeW6Ua64kykTfOWxLUIEitFE4yRBK0kC7DI2JutFYL9vj32xLR4RTaLxsuB2NAilyqX1O0TPA
CODEN IEEPAD
ContentType Conference Proceeding
DBID 6IE
6IH
CBEJK
RIE
RIO
DOI 10.1109/ICSE.2019.00086
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
IEEE Proceedings Order Plan (POP) 1998-present by volume
IEEE Xplore All Conference Proceedings
IEEE Electronic Library (IEL)
IEEE Proceedings Order Plans (POP) 1998-present
DatabaseTitleList
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Electronic Library (IEL)
  url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
Discipline Computer Science
EISBN 9781728108698
1728108691
EISSN 1558-1225
EndPage 794
ExternalDocumentID 8812062
Genre orig-research
GroupedDBID -~X
.4S
.DC
123
23M
29O
5VS
6IE
6IF
6IH
6IK
6IL
6IM
6IN
8US
AAJGR
AAWTH
ABLEC
ADZIZ
AFFNX
ALMA_UNASSIGNED_HOLDINGS
APO
ARCSS
AVWKF
BEFXN
BFFAM
BGNUA
BKEBE
BPEOZ
CBEJK
CHZPO
EDO
FEDTE
I-F
I07
IEGSK
IJVOP
IPLJI
M43
OCL
RIE
RIL
RIO
RNS
XOL
ID FETCH-LOGICAL-a247t-ca549d83c9e7d1fdda33fadee406fcdae0ef91253393938ee21502c49b9968553
IEDL.DBID RIE
IngestDate Wed Aug 27 02:46:33 EDT 2025
IsPeerReviewed false
IsScholarly true
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-a247t-ca549d83c9e7d1fdda33fadee406fcdae0ef91253393938ee21502c49b9968553
PageCount 12
ParticipantIDs ieee_primary_8812062
PublicationCentury 2000
PublicationDate 2019-May
PublicationDateYYYYMMDD 2019-05-01
PublicationDate_xml – month: 05
  year: 2019
  text: 2019-May
PublicationDecade 2010
PublicationTitle Proceedings / International Conference on Software Engineering
PublicationTitleAbbrev ICSE
PublicationYear 2019
Publisher IEEE
Publisher_xml – name: IEEE
SSID ssj0006499
Score 2.6005974
Snippet Exploiting machine learning techniques for analyzing programs has attracted much attention. One key problem is how to represent code fragments well for...
SourceID ieee
SourceType Publisher
StartPage 783
SubjectTerms Abstract Syntax Tree
Binary trees
Cloning
code classification
code clone detection
Natural languages
neural network
Neural networks
Semantics
source code representation
Syntactics
Task analysis
Title A Novel Neural Source Code Representation Based on Abstract Syntax Tree
URI https://ieeexplore.ieee.org/document/8812062
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3PT8IwFG6QkydUMP5ODx4djLXb2iMSEE0gRiDhRrr29SLZCBlG_et93QYa48Hs0mxJt7Rpv7637_seIbexVSqSgnuR0BigJCrylNLGC7ix2vJY8CKZM55Eozl_WoSLGrnba2EAoCCfQds1i3_5JtNblyrrCEQj3224Bxi4lVqt_a4b4dG9su7p-rLz2J8OHHHLuVH6Tij9o3ZKAR3DBhnvXloyRl7b2zxp689ffoz__aoj0voW6dHnPfwckxqkJ6Sxq9JAq0XbJA89OsneYEWdD4da0WmRrqf9zAB9KXiwlfwopfcIaYZio5e4DIjO6fQDn73T2QagRebDwaw_8qryCZ4KeJx7WmHsZwTTEmLTtcYoxqwyAIjhVhsFPliJ5xvGJF4CANHfDzSXCcZAIgzZKamnWQpnhIrAJsqGUhmluNUgYxaxKDba7wqL_Z-TphuX5bp0yFhWQ3Lx9-1LcuhmpqQNXpF6vtnCNUJ7ntwUc_oFK16lhA
linkProvider IEEE
linkToHtml http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3PT8IwGG0IHvSECsbf9uDRwbZ2W3tEAoICMQIJN9K1Xy-QzZBh1L_edhtojAezS7Ml3dKmff2-vfc-hG4jLUTIGXVCJk2AEovQEUIqx6dKS00jRvNkzmgc9mf0cR7MK-hup4UBgJx8Bk3bzP_lq1RubKqsxQwauXbD3TO4H3iFWmu374bm8F6a93gubw06k66lblk_StdKpX9UT8nBo1dDo-1rC87IsrnJ4qb8_OXI-N_vOkSNb5keft4B0BGqQHKMats6DbhctnX00Mbj9A1W2DpxiBWe5Al73EkV4JecCVsKkBJ8b0BNYdNoxzYHIjM8-TDP3vF0DdBAs1532uk7ZQEFR_g0yhwpTPSnGJEcIuVppQQhWigAg-JaKgEuaG5OOIRwczEAg_-uLymPTRTEgoCcoGqSJnCKMPN1LHTAhRKCagk8IiEJIyVdj2nT_xmq23FZvBYeGYtySM7_vn2D9vvT0XAxHIyfLtCBnaWCRHiJqtl6A1cG6LP4Op_fL_2sqM0
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=Proceedings+%2F+International+Conference+on+Software+Engineering&rft.atitle=A+Novel+Neural+Source+Code+Representation+Based+on+Abstract+Syntax+Tree&rft.au=Zhang%2C+Jian&rft.au=Wang%2C+Xu&rft.au=Zhang%2C+Hongyu&rft.au=Sun%2C+Hailong&rft.date=2019-05-01&rft.pub=IEEE&rft.eissn=1558-1225&rft.spage=783&rft.epage=794&rft_id=info:doi/10.1109%2FICSE.2019.00086&rft.externalDocID=8812062