A Novel Neural Source Code Representation Based on Abstract Syntax Tree
Exploiting machine learning techniques for analyzing programs has attracted much attention. One key problem is how to represent code fragments well for follow-up analysis. Traditional information retrieval based methods often treat programs as natural language texts, which could miss important seman...
Saved in:
Published in | Proceedings / International Conference on Software Engineering pp. 783 - 794 |
---|---|
Main Authors | , , , , , |
Format | Conference Proceeding |
Language | English |
Published |
IEEE
01.05.2019
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Abstract | Exploiting machine learning techniques for analyzing programs has attracted much attention. One key problem is how to represent code fragments well for follow-up analysis. Traditional information retrieval based methods often treat programs as natural language texts, which could miss important semantic information of source code. Recently, state-of-the-art studies demonstrate that abstract syntax tree (AST) based neural models can better represent source code. However, the sizes of ASTs are usually large and the existing models are prone to the long-term dependency problem. In this paper, we propose a novel AST-based Neural Network (ASTNN) for source code representation. Unlike existing models that work on entire ASTs, ASTNN splits each large AST into a sequence of small statement trees, and encodes the statement trees to vectors by capturing the lexical and syntactical knowledge of statements. Based on the sequence of statement vectors, a bidirectional RNN model is used to leverage the naturalness of statements and finally produce the vector representation of a code fragment. We have applied our neural network based source code representation method to two common program comprehension tasks: source code classification and code clone detection. Experimental results on the two tasks indicate that our model is superior to state-of-the-art approaches. |
---|---|
AbstractList | Exploiting machine learning techniques for analyzing programs has attracted much attention. One key problem is how to represent code fragments well for follow-up analysis. Traditional information retrieval based methods often treat programs as natural language texts, which could miss important semantic information of source code. Recently, state-of-the-art studies demonstrate that abstract syntax tree (AST) based neural models can better represent source code. However, the sizes of ASTs are usually large and the existing models are prone to the long-term dependency problem. In this paper, we propose a novel AST-based Neural Network (ASTNN) for source code representation. Unlike existing models that work on entire ASTs, ASTNN splits each large AST into a sequence of small statement trees, and encodes the statement trees to vectors by capturing the lexical and syntactical knowledge of statements. Based on the sequence of statement vectors, a bidirectional RNN model is used to leverage the naturalness of statements and finally produce the vector representation of a code fragment. We have applied our neural network based source code representation method to two common program comprehension tasks: source code classification and code clone detection. Experimental results on the two tasks indicate that our model is superior to state-of-the-art approaches. |
Author | Sun, Hailong Wang, Xu Wang, Kaixuan Zhang, Jian Liu, Xudong Zhang, Hongyu |
Author_xml | – sequence: 1 givenname: Jian surname: Zhang fullname: Zhang, Jian organization: Beihang University, China; Beijing Advanced Innovation Center for Big Data and Brain Computing, China – sequence: 2 givenname: Xu surname: Wang fullname: Wang, Xu organization: Beihang University, China; Beijing Advanced Innovation Center for Big Data and Brain Computing, China – sequence: 3 givenname: Hongyu surname: Zhang fullname: Zhang, Hongyu organization: The University of Newcastle, Australia – sequence: 4 givenname: Hailong surname: Sun fullname: Sun, Hailong organization: Beihang University, China; Beijing Advanced Innovation Center for Big Data and Brain Computing, China – sequence: 5 givenname: Kaixuan surname: Wang fullname: Wang, Kaixuan organization: Beihang University, China; Beijing Advanced Innovation Center for Big Data and Brain Computing, China – sequence: 6 givenname: Xudong surname: Liu fullname: Liu, Xudong organization: Beihang University, China; Beijing Advanced Innovation Center for Big Data and Brain Computing, China |
BookMark | eNotjEFLw0AUhFdRsK09e_CyfyBx32422XeModZCqWDquWx3XyASk7Kbiv33BpQ5zHzMMHN20w89MfYAIgUQ-LSp6lUqBWAqhDD5FVtiYaCQBiZCc81moLVJQEp9x-Yxfk6zPEOcsXXJd8M3dXxH52A7Xg_n4IhXgyf-TqdAkfrRju3Q82cbyfMplMc4ButGXl-m7ofvA9E9u21sF2n57wv28bLaV6_J9m29qcptYmVWjImzOkNvlEMqPDTeW6Ua64kykTfOWxLUIEitFE4yRBK0kC7DI2JutFYL9vj32xLR4RTaLxsuB2NAilyqX1O0TPA |
CODEN | IEEPAD |
ContentType | Conference Proceeding |
DBID | 6IE 6IH CBEJK RIE RIO |
DOI | 10.1109/ICSE.2019.00086 |
DatabaseName | IEEE Electronic Library (IEL) Conference Proceedings IEEE Proceedings Order Plan (POP) 1998-present by volume IEEE Xplore All Conference Proceedings IEEE Electronic Library (IEL) IEEE Proceedings Order Plans (POP) 1998-present |
DatabaseTitleList | |
Database_xml | – sequence: 1 dbid: RIE name: IEEE Electronic Library (IEL) url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/ sourceTypes: Publisher |
DeliveryMethod | fulltext_linktorsrc |
Discipline | Computer Science |
EISBN | 9781728108698 1728108691 |
EISSN | 1558-1225 |
EndPage | 794 |
ExternalDocumentID | 8812062 |
Genre | orig-research |
GroupedDBID | -~X .4S .DC 123 23M 29O 5VS 6IE 6IF 6IH 6IK 6IL 6IM 6IN 8US AAJGR AAWTH ABLEC ADZIZ AFFNX ALMA_UNASSIGNED_HOLDINGS APO ARCSS AVWKF BEFXN BFFAM BGNUA BKEBE BPEOZ CBEJK CHZPO EDO FEDTE I-F I07 IEGSK IJVOP IPLJI M43 OCL RIE RIL RIO RNS XOL |
ID | FETCH-LOGICAL-a247t-ca549d83c9e7d1fdda33fadee406fcdae0ef91253393938ee21502c49b9968553 |
IEDL.DBID | RIE |
IngestDate | Wed Aug 27 02:46:33 EDT 2025 |
IsPeerReviewed | false |
IsScholarly | true |
Language | English |
LinkModel | DirectLink |
MergedId | FETCHMERGED-LOGICAL-a247t-ca549d83c9e7d1fdda33fadee406fcdae0ef91253393938ee21502c49b9968553 |
PageCount | 12 |
ParticipantIDs | ieee_primary_8812062 |
PublicationCentury | 2000 |
PublicationDate | 2019-May |
PublicationDateYYYYMMDD | 2019-05-01 |
PublicationDate_xml | – month: 05 year: 2019 text: 2019-May |
PublicationDecade | 2010 |
PublicationTitle | Proceedings / International Conference on Software Engineering |
PublicationTitleAbbrev | ICSE |
PublicationYear | 2019 |
Publisher | IEEE |
Publisher_xml | – name: IEEE |
SSID | ssj0006499 |
Score | 2.6005974 |
Snippet | Exploiting machine learning techniques for analyzing programs has attracted much attention. One key problem is how to represent code fragments well for... |
SourceID | ieee |
SourceType | Publisher |
StartPage | 783 |
SubjectTerms | Abstract Syntax Tree Binary trees Cloning code classification code clone detection Natural languages neural network Neural networks Semantics source code representation Syntactics Task analysis |
Title | A Novel Neural Source Code Representation Based on Abstract Syntax Tree |
URI | https://ieeexplore.ieee.org/document/8812062 |
hasFullText | 1 |
inHoldings | 1 |
isFullTextHit | |
isPrint | |
link | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3PT8IwFG6QkydUMP5ODx4djLXb2iMSEE0gRiDhRrr29SLZCBlG_et93QYa48Hs0mxJt7Rpv7637_seIbexVSqSgnuR0BigJCrylNLGC7ix2vJY8CKZM55Eozl_WoSLGrnba2EAoCCfQds1i3_5JtNblyrrCEQj3224Bxi4lVqt_a4b4dG9su7p-rLz2J8OHHHLuVH6Tij9o3ZKAR3DBhnvXloyRl7b2zxp689ffoz__aoj0voW6dHnPfwckxqkJ6Sxq9JAq0XbJA89OsneYEWdD4da0WmRrqf9zAB9KXiwlfwopfcIaYZio5e4DIjO6fQDn73T2QagRebDwaw_8qryCZ4KeJx7WmHsZwTTEmLTtcYoxqwyAIjhVhsFPliJ5xvGJF4CANHfDzSXCcZAIgzZKamnWQpnhIrAJsqGUhmluNUgYxaxKDba7wqL_Z-TphuX5bp0yFhWQ3Lx9-1LcuhmpqQNXpF6vtnCNUJ7ntwUc_oFK16lhA |
linkProvider | IEEE |
linkToHtml | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3PT8IwGG0IHvSECsbf9uDRwbZ2W3tEAoICMQIJN9K1Xy-QzZBh1L_edhtojAezS7Ml3dKmff2-vfc-hG4jLUTIGXVCJk2AEovQEUIqx6dKS00jRvNkzmgc9mf0cR7MK-hup4UBgJx8Bk3bzP_lq1RubKqsxQwauXbD3TO4H3iFWmu374bm8F6a93gubw06k66lblk_StdKpX9UT8nBo1dDo-1rC87IsrnJ4qb8_OXI-N_vOkSNb5keft4B0BGqQHKMats6DbhctnX00Mbj9A1W2DpxiBWe5Al73EkV4JecCVsKkBJ8b0BNYdNoxzYHIjM8-TDP3vF0DdBAs1532uk7ZQEFR_g0yhwpTPSnGJEcIuVppQQhWigAg-JaKgEuaG5OOIRwczEAg_-uLymPTRTEgoCcoGqSJnCKMPN1LHTAhRKCagk8IiEJIyVdj2nT_xmq23FZvBYeGYtySM7_vn2D9vvT0XAxHIyfLtCBnaWCRHiJqtl6A1cG6LP4Op_fL_2sqM0 |
openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=Proceedings+%2F+International+Conference+on+Software+Engineering&rft.atitle=A+Novel+Neural+Source+Code+Representation+Based+on+Abstract+Syntax+Tree&rft.au=Zhang%2C+Jian&rft.au=Wang%2C+Xu&rft.au=Zhang%2C+Hongyu&rft.au=Sun%2C+Hailong&rft.date=2019-05-01&rft.pub=IEEE&rft.eissn=1558-1225&rft.spage=783&rft.epage=794&rft_id=info:doi/10.1109%2FICSE.2019.00086&rft.externalDocID=8812062 |