Semi-Supervised Code Translation Overcoming the Scarcity of Parallel Code Data
Neural code translation is the task of converting source code from one programming language to another. One of the main challenges is the scarcity of parallel code data, which hinders the ability of translation models to learn accurate cross-language alignments. In this paper, we introduce MIRACLE,...
Saved in:
Published in | IEEE/ACM International Conference on Automated Software Engineering : [proceedings] pp. 1545 - 1556 |
---|---|
Main Authors | , , , |
Format | Conference Proceeding |
Language | English |
Published |
ACM
27.10.2024
|
Subjects | |
Online Access | Get full text |
ISSN | 2643-1572 |
DOI | 10.1145/3691620.3695524 |
Cover
Loading…
Abstract | Neural code translation is the task of converting source code from one programming language to another. One of the main challenges is the scarcity of parallel code data, which hinders the ability of translation models to learn accurate cross-language alignments. In this paper, we introduce MIRACLE, a semi-supervised approach that improves code translation through synthesizing high-quality parallel code data and curriculum learning on code data with ascending alignment levels. MIRACLE leverages static analysis and compilation to generate synthetic parallel code datasets with enhanced quality and alignment to address the challenge of data scarcity. We evaluate the proposed method along with strong baselines including instruction-tuned Large Language Models (LLMs) for code. Our analysis reveals that LLMs pre-trained on open-source code data, regardless of their size, suffer from the "shallow translation" problem. This issue arises when translated code copies keywords, statements, and even code blocks from the source language, leading to compilation and runtime errors. Extensive experiments demonstrate that our method significantly mitigates this issue, enhancing code translation performance across multiple models in C++, Java, Python, and C. Remarkably, MIRACLE outperforms code LLMs that are ten times larger in size. MIRACLE also achieves up to a 43% improvement in C code translation with fewer than 150 annotated examples.CCS CONCEPTS* Computing methodologies → Machine learning; * Software and its engineering; |
---|---|
AbstractList | Neural code translation is the task of converting source code from one programming language to another. One of the main challenges is the scarcity of parallel code data, which hinders the ability of translation models to learn accurate cross-language alignments. In this paper, we introduce MIRACLE, a semi-supervised approach that improves code translation through synthesizing high-quality parallel code data and curriculum learning on code data with ascending alignment levels. MIRACLE leverages static analysis and compilation to generate synthetic parallel code datasets with enhanced quality and alignment to address the challenge of data scarcity. We evaluate the proposed method along with strong baselines including instruction-tuned Large Language Models (LLMs) for code. Our analysis reveals that LLMs pre-trained on open-source code data, regardless of their size, suffer from the "shallow translation" problem. This issue arises when translated code copies keywords, statements, and even code blocks from the source language, leading to compilation and runtime errors. Extensive experiments demonstrate that our method significantly mitigates this issue, enhancing code translation performance across multiple models in C++, Java, Python, and C. Remarkably, MIRACLE outperforms code LLMs that are ten times larger in size. MIRACLE also achieves up to a 43% improvement in C code translation with fewer than 150 annotated examples.CCS CONCEPTS* Computing methodologies → Machine learning; * Software and its engineering; |
Author | Lourentzou, Ismini Karim, Mohimenul Yao, Danfeng Daphne Zhu, Ming |
Author_xml | – sequence: 1 givenname: Ming surname: Zhu fullname: Zhu, Ming email: mingzhu@vt.edu organization: Virginia Tech,Sanghani Center for AI and Data Analytics,Blacksburg,VA,USA – sequence: 2 givenname: Mohimenul surname: Karim fullname: Karim, Mohimenul email: mohimenul@vt.edu organization: Virginia Tech,Sanghani Center for AI and Data Analytics,Blacksburg,VA,USA – sequence: 3 givenname: Ismini surname: Lourentzou fullname: Lourentzou, Ismini email: lourent2@illinois.edu organization: Virginia Tech,Sanghani Center for AI and Data Analytics,Blacksburg,VA,USA – sequence: 4 givenname: Danfeng Daphne surname: Yao fullname: Yao, Danfeng Daphne email: danfeng@vt.edu organization: Virginia Tech,Sanghani Center for AI and Data Analytics,Blacksburg,VA,USA |
BookMark | eNotj8tKA0EURFtRMMZZu3HRPzCx34-lRKNCMMLEdbg9ua0N8wg9YyB_70BcnYKiiqpbctX1HRJyz9mCc6UfpfHcCLaYqLVQF6Tw1jvFmOVCOXtJZsIoWXJtxQ0phiEFNkltODcz8lFhm8rq94D5mAbc02W_R7rN0A0NjKnv6OaIue7b1H3T8QdpVUOu03iifaSfkKFpsDmHnmGEO3IdoRmw-OecfK1etsu3cr15fV8-rUsQzo9l1BhckIGBFjVTapoOgFrIyBmLzgqugw_G43QthlgbjEFGVTvNnZgcOScP596EiLtDTi3k044za5S3Rv4B6pBQmA |
CODEN | IEEPAD |
ContentType | Conference Proceeding |
DBID | 6IE 6IL CBEJK RIE RIL |
DOI | 10.1145/3691620.3695524 |
DatabaseName | IEEE Electronic Library (IEL) Conference Proceedings IEEE Proceedings Order Plan All Online (POP All Online) 1998-present by volume IEEE Xplore All Conference Proceedings IEEE Electronic Library (IEL) IEEE Proceedings Order Plans (POP All) 1998-Present |
DatabaseTitleList | |
Database_xml | – sequence: 1 dbid: RIE name: IEEE Electronic Library (IEL) url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/ sourceTypes: Publisher |
DeliveryMethod | fulltext_linktorsrc |
Discipline | Computer Science |
EISBN | 9798400712487 |
EISSN | 2643-1572 |
EndPage | 1556 |
ExternalDocumentID | 10764976 |
Genre | orig-research |
GrantInformation_xml | – fundername: Office of Naval Research funderid: 10.13039/100000006 – fundername: National Science Foundation funderid: 10.13039/100000001 |
GroupedDBID | 6IE 6IF 6IH 6IK 6IL 6IM 6IN 6J9 AAJGR AAWTH ABLEC ACREN ADYOE ADZIZ AFYQB ALMA_UNASSIGNED_HOLDINGS AMTXH BEFXN BFFAM BGNUA BKEBE BPEOZ CBEJK CHZPO IEGSK IPLJI M43 OCL RIE RIL |
ID | FETCH-LOGICAL-a289t-f5eb8b3b0a52c044369aae523f100f87215b9b69e691fbfc6efb3f4c85182b9b3 |
IEDL.DBID | RIE |
IngestDate | Wed Jan 15 06:20:39 EST 2025 |
IsDoiOpenAccess | false |
IsOpenAccess | true |
IsPeerReviewed | false |
IsScholarly | true |
Language | English |
LinkModel | DirectLink |
MergedId | FETCHMERGED-LOGICAL-a289t-f5eb8b3b0a52c044369aae523f100f87215b9b69e691fbfc6efb3f4c85182b9b3 |
OpenAccessLink | https://doi.org/10.1145/3691620.3695524 |
PageCount | 12 |
ParticipantIDs | ieee_primary_10764976 |
PublicationCentury | 2000 |
PublicationDate | 2024-Oct.-27 |
PublicationDateYYYYMMDD | 2024-10-27 |
PublicationDate_xml | – month: 10 year: 2024 text: 2024-Oct.-27 day: 27 |
PublicationDecade | 2020 |
PublicationTitle | IEEE/ACM International Conference on Automated Software Engineering : [proceedings] |
PublicationTitleAbbrev | ASE |
PublicationYear | 2024 |
Publisher | ACM |
Publisher_xml | – name: ACM |
SSID | ssib057256116 ssj0051577 |
Score | 2.2996306 |
Snippet | Neural code translation is the task of converting source code from one programming language to another. One of the main challenges is the scarcity of parallel... |
SourceID | ieee |
SourceType | Publisher |
StartPage | 1545 |
SubjectTerms | Accuracy Codes Computational modeling Cross-Language Code Alignment Curriculum Learning Machine learning Neural Code Translation Python Runtime Semi-Supervised Learning Software Source coding Static analysis Training |
Title | Semi-Supervised Code Translation Overcoming the Scarcity of Parallel Code Data |
URI | https://ieeexplore.ieee.org/document/10764976 |
hasFullText | 1 |
inHoldings | 1 |
isFullTextHit | |
isPrint | |
link | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1JS8NAFB60J091qbgzB69Tk9mSnKulCNZCLfRWZiZvQNSm1OTir_dNFhVB8BYSkgwvb-b7vsxbCLlOdM4dB85MnHOGTqGZQZrBdGZ5pHwsZN2m82GqJwt5v1TLNlm9zoUBgDr4DIbhsN7LzwtXhV9lOMMTLRE_d8kuKrcmWatzHpXge-LAdZplGHE6SdpaPrFUN0IjEeKoUXWmVMhv_9FMpcaScZ9Mu1E0ISQvw6q0Q_fxq0Djv4e5TwbfaXt09gVIB2QH1oek3_VtoO00PiLTObw9s3m1CevEO-R0VORAa9RqIuPoIzo4uiI-hSJBpGGXxiFdp4WnM7MN7Vdem5tuTWkGZDG-expNWNtXgRmUVyXzCmxqhY2M4i6SEk1iDKAi9XEU-RQ1obKZ1Rmgybz1ToO3wkuH5CzleEUck966WMMJoQbVRpoLL7hFKgaZUZHhhotcZsZnTp-SQbDPatOUzlh1pjn74_w52ePIGgI48OSC9MptBZeI-qW9qr_2J1Qdqrg |
linkProvider | IEEE |
linkToHtml | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV07T8MwELagDDCVRxFvPLCmJH4lnguoQBsqtZW6VbZjSwhoqpIs_HrOSQoICYktcpTYOp_9fWffA6GrWGTEEEsCFWUkAKUQgQKaEQipSchdRFlVpnOYiv6UPcz4rAlWr2JhrLWV85nt-sfqLj_LTemPymCFx4IBfm6iLQB-JutwrbX68Bh6ijzbqTdiQOo4brL5RIxfUwFUiICVKiTnPsL9RzmVCk3u2ihdj6N2InnploXumo9fKRr_PdBd1PkO3MOjL0jaQxt2sY_a68oNuFnIBygd27fnYFwu_U7xbjPcyzOLK9yqfePwE6g4KCP8BQNFxP6exgBhx7nDI7XyBVhe649uVKE6aHp3O-n1g6ayQqDAwCoCx61ONNWh4sSEjIFIlLJgk7ooDF0CViHXUgtpQWROOyOs09QxA_QsIfCGHqLWIl_YI4QV2BtJRh0lGubESsVDRRShGZPKSSOOUcfLZ76sk2fM16I5-aP9Em33J8PBfHCfPp6iHQIcwkMFic9Qq1iV9hw4QKEvqpn_BHqwrgg |
openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=IEEE%2FACM+International+Conference+on+Automated+Software+Engineering+%3A+%5Bproceedings%5D&rft.atitle=Semi-Supervised+Code+Translation+Overcoming+the+Scarcity+of+Parallel+Code+Data&rft.au=Zhu%2C+Ming&rft.au=Karim%2C+Mohimenul&rft.au=Lourentzou%2C+Ismini&rft.au=Yao%2C+Danfeng+Daphne&rft.date=2024-10-27&rft.pub=ACM&rft.eissn=2643-1572&rft.spage=1545&rft.epage=1556&rft_id=info:doi/10.1145%2F3691620.3695524&rft.externalDocID=10764976 |