Semi-Supervised Code Translation Overcoming the Scarcity of Parallel Code Data

Neural code translation is the task of converting source code from one programming language to another. One of the main challenges is the scarcity of parallel code data, which hinders the ability of translation models to learn accurate cross-language alignments. In this paper, we introduce MIRACLE,...

Full description

Saved in:
Bibliographic Details
Published inIEEE/ACM International Conference on Automated Software Engineering : [proceedings] pp. 1545 - 1556
Main Authors Zhu, Ming, Karim, Mohimenul, Lourentzou, Ismini, Yao, Danfeng Daphne
Format Conference Proceeding
LanguageEnglish
Published ACM 27.10.2024
Subjects
Online AccessGet full text
ISSN2643-1572
DOI10.1145/3691620.3695524

Cover

Loading…
Abstract Neural code translation is the task of converting source code from one programming language to another. One of the main challenges is the scarcity of parallel code data, which hinders the ability of translation models to learn accurate cross-language alignments. In this paper, we introduce MIRACLE, a semi-supervised approach that improves code translation through synthesizing high-quality parallel code data and curriculum learning on code data with ascending alignment levels. MIRACLE leverages static analysis and compilation to generate synthetic parallel code datasets with enhanced quality and alignment to address the challenge of data scarcity. We evaluate the proposed method along with strong baselines including instruction-tuned Large Language Models (LLMs) for code. Our analysis reveals that LLMs pre-trained on open-source code data, regardless of their size, suffer from the "shallow translation" problem. This issue arises when translated code copies keywords, statements, and even code blocks from the source language, leading to compilation and runtime errors. Extensive experiments demonstrate that our method significantly mitigates this issue, enhancing code translation performance across multiple models in C++, Java, Python, and C. Remarkably, MIRACLE outperforms code LLMs that are ten times larger in size. MIRACLE also achieves up to a 43% improvement in C code translation with fewer than 150 annotated examples.CCS CONCEPTS* Computing methodologies → Machine learning; * Software and its engineering;
AbstractList Neural code translation is the task of converting source code from one programming language to another. One of the main challenges is the scarcity of parallel code data, which hinders the ability of translation models to learn accurate cross-language alignments. In this paper, we introduce MIRACLE, a semi-supervised approach that improves code translation through synthesizing high-quality parallel code data and curriculum learning on code data with ascending alignment levels. MIRACLE leverages static analysis and compilation to generate synthetic parallel code datasets with enhanced quality and alignment to address the challenge of data scarcity. We evaluate the proposed method along with strong baselines including instruction-tuned Large Language Models (LLMs) for code. Our analysis reveals that LLMs pre-trained on open-source code data, regardless of their size, suffer from the "shallow translation" problem. This issue arises when translated code copies keywords, statements, and even code blocks from the source language, leading to compilation and runtime errors. Extensive experiments demonstrate that our method significantly mitigates this issue, enhancing code translation performance across multiple models in C++, Java, Python, and C. Remarkably, MIRACLE outperforms code LLMs that are ten times larger in size. MIRACLE also achieves up to a 43% improvement in C code translation with fewer than 150 annotated examples.CCS CONCEPTS* Computing methodologies → Machine learning; * Software and its engineering;
Author Lourentzou, Ismini
Karim, Mohimenul
Yao, Danfeng Daphne
Zhu, Ming
Author_xml – sequence: 1
  givenname: Ming
  surname: Zhu
  fullname: Zhu, Ming
  email: mingzhu@vt.edu
  organization: Virginia Tech,Sanghani Center for AI and Data Analytics,Blacksburg,VA,USA
– sequence: 2
  givenname: Mohimenul
  surname: Karim
  fullname: Karim, Mohimenul
  email: mohimenul@vt.edu
  organization: Virginia Tech,Sanghani Center for AI and Data Analytics,Blacksburg,VA,USA
– sequence: 3
  givenname: Ismini
  surname: Lourentzou
  fullname: Lourentzou, Ismini
  email: lourent2@illinois.edu
  organization: Virginia Tech,Sanghani Center for AI and Data Analytics,Blacksburg,VA,USA
– sequence: 4
  givenname: Danfeng Daphne
  surname: Yao
  fullname: Yao, Danfeng Daphne
  email: danfeng@vt.edu
  organization: Virginia Tech,Sanghani Center for AI and Data Analytics,Blacksburg,VA,USA
BookMark eNotj8tKA0EURFtRMMZZu3HRPzCx34-lRKNCMMLEdbg9ua0N8wg9YyB_70BcnYKiiqpbctX1HRJyz9mCc6UfpfHcCLaYqLVQF6Tw1jvFmOVCOXtJZsIoWXJtxQ0phiEFNkltODcz8lFhm8rq94D5mAbc02W_R7rN0A0NjKnv6OaIue7b1H3T8QdpVUOu03iifaSfkKFpsDmHnmGEO3IdoRmw-OecfK1etsu3cr15fV8-rUsQzo9l1BhckIGBFjVTapoOgFrIyBmLzgqugw_G43QthlgbjEFGVTvNnZgcOScP596EiLtDTi3k044za5S3Rv4B6pBQmA
CODEN IEEPAD
ContentType Conference Proceeding
DBID 6IE
6IL
CBEJK
RIE
RIL
DOI 10.1145/3691620.3695524
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
IEEE Proceedings Order Plan All Online (POP All Online) 1998-present by volume
IEEE Xplore All Conference Proceedings
IEEE Electronic Library (IEL)
IEEE Proceedings Order Plans (POP All) 1998-Present
DatabaseTitleList
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Electronic Library (IEL)
  url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
Discipline Computer Science
EISBN 9798400712487
EISSN 2643-1572
EndPage 1556
ExternalDocumentID 10764976
Genre orig-research
GrantInformation_xml – fundername: Office of Naval Research
  funderid: 10.13039/100000006
– fundername: National Science Foundation
  funderid: 10.13039/100000001
GroupedDBID 6IE
6IF
6IH
6IK
6IL
6IM
6IN
6J9
AAJGR
AAWTH
ABLEC
ACREN
ADYOE
ADZIZ
AFYQB
ALMA_UNASSIGNED_HOLDINGS
AMTXH
BEFXN
BFFAM
BGNUA
BKEBE
BPEOZ
CBEJK
CHZPO
IEGSK
IPLJI
M43
OCL
RIE
RIL
ID FETCH-LOGICAL-a289t-f5eb8b3b0a52c044369aae523f100f87215b9b69e691fbfc6efb3f4c85182b9b3
IEDL.DBID RIE
IngestDate Wed Jan 15 06:20:39 EST 2025
IsDoiOpenAccess false
IsOpenAccess true
IsPeerReviewed false
IsScholarly true
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-a289t-f5eb8b3b0a52c044369aae523f100f87215b9b69e691fbfc6efb3f4c85182b9b3
OpenAccessLink https://doi.org/10.1145/3691620.3695524
PageCount 12
ParticipantIDs ieee_primary_10764976
PublicationCentury 2000
PublicationDate 2024-Oct.-27
PublicationDateYYYYMMDD 2024-10-27
PublicationDate_xml – month: 10
  year: 2024
  text: 2024-Oct.-27
  day: 27
PublicationDecade 2020
PublicationTitle IEEE/ACM International Conference on Automated Software Engineering : [proceedings]
PublicationTitleAbbrev ASE
PublicationYear 2024
Publisher ACM
Publisher_xml – name: ACM
SSID ssib057256116
ssj0051577
Score 2.2996306
Snippet Neural code translation is the task of converting source code from one programming language to another. One of the main challenges is the scarcity of parallel...
SourceID ieee
SourceType Publisher
StartPage 1545
SubjectTerms Accuracy
Codes
Computational modeling
Cross-Language Code Alignment
Curriculum Learning
Machine learning
Neural Code Translation
Python
Runtime
Semi-Supervised Learning
Software
Source coding
Static analysis
Training
Title Semi-Supervised Code Translation Overcoming the Scarcity of Parallel Code Data
URI https://ieeexplore.ieee.org/document/10764976
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1JS8NAFB60J091qbgzB69Tk9mSnKulCNZCLfRWZiZvQNSm1OTir_dNFhVB8BYSkgwvb-b7vsxbCLlOdM4dB85MnHOGTqGZQZrBdGZ5pHwsZN2m82GqJwt5v1TLNlm9zoUBgDr4DIbhsN7LzwtXhV9lOMMTLRE_d8kuKrcmWatzHpXge-LAdZplGHE6SdpaPrFUN0IjEeKoUXWmVMhv_9FMpcaScZ9Mu1E0ISQvw6q0Q_fxq0Djv4e5TwbfaXt09gVIB2QH1oek3_VtoO00PiLTObw9s3m1CevEO-R0VORAa9RqIuPoIzo4uiI-hSJBpGGXxiFdp4WnM7MN7Vdem5tuTWkGZDG-expNWNtXgRmUVyXzCmxqhY2M4i6SEk1iDKAi9XEU-RQ1obKZ1Rmgybz1ToO3wkuH5CzleEUck966WMMJoQbVRpoLL7hFKgaZUZHhhotcZsZnTp-SQbDPatOUzlh1pjn74_w52ePIGgI48OSC9MptBZeI-qW9qr_2J1Qdqrg
linkProvider IEEE
linkToHtml http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV07T8MwELagDDCVRxFvPLCmJH4lnguoQBsqtZW6VbZjSwhoqpIs_HrOSQoICYktcpTYOp_9fWffA6GrWGTEEEsCFWUkAKUQgQKaEQipSchdRFlVpnOYiv6UPcz4rAlWr2JhrLWV85nt-sfqLj_LTemPymCFx4IBfm6iLQB-JutwrbX68Bh6ijzbqTdiQOo4brL5RIxfUwFUiICVKiTnPsL9RzmVCk3u2ihdj6N2InnploXumo9fKRr_PdBd1PkO3MOjL0jaQxt2sY_a68oNuFnIBygd27fnYFwu_U7xbjPcyzOLK9yqfePwE6g4KCP8BQNFxP6exgBhx7nDI7XyBVhe649uVKE6aHp3O-n1g6ayQqDAwCoCx61ONNWh4sSEjIFIlLJgk7ooDF0CViHXUgtpQWROOyOs09QxA_QsIfCGHqLWIl_YI4QV2BtJRh0lGubESsVDRRShGZPKSSOOUcfLZ76sk2fM16I5-aP9Em33J8PBfHCfPp6iHQIcwkMFic9Qq1iV9hw4QKEvqpn_BHqwrgg
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=IEEE%2FACM+International+Conference+on+Automated+Software+Engineering+%3A+%5Bproceedings%5D&rft.atitle=Semi-Supervised+Code+Translation+Overcoming+the+Scarcity+of+Parallel+Code+Data&rft.au=Zhu%2C+Ming&rft.au=Karim%2C+Mohimenul&rft.au=Lourentzou%2C+Ismini&rft.au=Yao%2C+Danfeng+Daphne&rft.date=2024-10-27&rft.pub=ACM&rft.eissn=2643-1572&rft.spage=1545&rft.epage=1556&rft_id=info:doi/10.1145%2F3691620.3695524&rft.externalDocID=10764976