Learning to mine aligned code and natural language pairs from stack overflow

For tasks like code synthesis from natural language, code retrieval, and code summarization, data-driven models have shown great promise. However, creating these models require parallel data between natural language (NL) and code with fine-grained alignments. Stack Overflow (SO) is a promising sourc...

Full description

Saved in:
Bibliographic Details
Published in2018 IEEE/ACM 15th International Conference on Mining Software Repositories (MSR) pp. 476 - 486
Main Authors Yin, Pengcheng, Deng, Bowen, Chen, Edgar, Vasilescu, Bogdan, Neubig, Graham
Format Conference Proceeding
LanguageEnglish
Published New York, NY, USA ACM 28.05.2018
SeriesACM Conferences
Subjects
Online AccessGet full text
ISBN9781450357166
1450357164
ISSN2574-3864
DOI10.1145/3196398.3196408

Cover

Loading…
Abstract For tasks like code synthesis from natural language, code retrieval, and code summarization, data-driven models have shown great promise. However, creating these models require parallel data between natural language (NL) and code with fine-grained alignments. Stack Overflow (SO) is a promising source to create such a data set: the questions are diverse and most of them have corresponding answers with high quality code snippets. However, existing heuristic methods (e.g., pairing the title of a post with the code in the accepted answer) are limited both in their coverage and the correctness of the NL-code pairs obtained. In this paper, we propose a novel method to mine high-quality aligned data from SO using two sets of features: hand-crafted features considering the structure of the extracted snippets, and correspondence features obtained by training a probabilistic model to capture the correlation between NL and code using neural networks. These features are fed into a classifier that determines the quality of mined NL-code pairs. Experiments using Python and Java as test beds show that the proposed method greatly expands coverage and accuracy over existing mining methods, even when using only a small number of labeled examples. Further, we find that reasonable results are achieved even when training the classifier on one language and testing on another, showing promise for scaling NL-code mining to a wide variety of programming languages beyond those for which we are able to annotate data.
AbstractList For tasks like code synthesis from natural language, code retrieval, and code summarization, data-driven models have shown great promise. However, creating these models require parallel data between natural language (NL) and code with fine-grained alignments. Stack Overflow (SO) is a promising source to create such a data set: the questions are diverse and most of them have corresponding answers with high quality code snippets. However, existing heuristic methods (e.g., pairing the title of a post with the code in the accepted answer) are limited both in their coverage and the correctness of the NL-code pairs obtained. In this paper, we propose a novel method to mine high-quality aligned data from SO using two sets of features: hand-crafted features considering the structure of the extracted snippets, and correspondence features obtained by training a probabilistic model to capture the correlation between NL and code using neural networks. These features are fed into a classifier that determines the quality of mined NL-code pairs. Experiments using Python and Java as test beds show that the proposed method greatly expands coverage and accuracy over existing mining methods, even when using only a small number of labeled examples. Further, we find that reasonable results are achieved even when training the classifier on one language and testing on another, showing promise for scaling NL-code mining to a wide variety of programming languages beyond those for which we are able to annotate data.
For tasks like code synthesis from natural language, code retrieval, and code summarization, data-driven models have shown great promise. However, creating these models require parallel data between natural language (NL) and code with fine-grained alignments. StackOverflow (SO) is a promising source to create such a data set: the questions are diverse and most of them have corresponding answers with high-quality code snippets. However, existing heuristic methods (e.g. pairing the title of a post with the code in the accepted answer) are limited both in their coverage and the correctness of the NL-code pairs obtained. In this paper, we propose a novel method to mine high-quality aligned data from SO using two sets of features: hand-crafted features considering the structure of the extracted snippets, and correspondence features obtained by training a probabilistic model to capture the correlation between NL and code using neural networks. These features are fed into a classifier that determines the quality of mined NL-code pairs. Experiments using Python and Java as test beds show that the proposed method greatly expands coverage and accuracy over existing mining methods, even when using only a small number of labeled examples. Further, we find that reasonable results are achieved even when training the classifier on one language and testing on another, showing promise for scaling NL-code mining to a wide variety of programming languages beyond those for which we are able to annotate data.
Author Deng, Bowen
Yin, Pengcheng
Vasilescu, Bogdan
Neubig, Graham
Chen, Edgar
Author_xml – sequence: 1
  givenname: Pengcheng
  surname: Yin
  fullname: Yin, Pengcheng
  email: pcyin@cs.cmu.edu
  organization: Carnegie Mellon University
– sequence: 2
  givenname: Bowen
  surname: Deng
  fullname: Deng, Bowen
  email: bdeng1@cs.cmu.edu
  organization: Carnegie Mellon University
– sequence: 3
  givenname: Edgar
  surname: Chen
  fullname: Chen, Edgar
  email: edgarc@cs.cmu.edu
  organization: Carnegie Mellon University
– sequence: 4
  givenname: Bogdan
  surname: Vasilescu
  fullname: Vasilescu, Bogdan
  email: bogdanv@cs.cmu.edu
  organization: Carnegie Mellon University
– sequence: 5
  givenname: Graham
  surname: Neubig
  fullname: Neubig, Graham
  email: gneubig@cs.cmu.edu
  organization: Carnegie Mellon University
BookMark eNqNkDtPwzAURs1LopTODCweWRLs2I7jEVW8pEgsMFs3fkShiV05KYh_T6p2YmL6dHU-3Xt1rtB5iMEhdENJTikX94yqkqkq3ycn1QlaKVnNgDAhaVmeokUhJM9YVfKzP-wSrcbxkxBSlBWnVC5QXTtIoQstniIeuuAw9F0bnMUm2nkIFgeYdgl63ENod9A6vIUujdinOOBxArPB8csl38fva3ThoR_d6phL9PH0-L5-yeq359f1Q50Bo2zKfKMkN4RZb0rPC-MkkwUUUBEL81ukUCAY585y2khqCTOceGu8UEwoZyVbotvD3s45p7epGyD96EooUcwHlujuQMEMuolxM2pK9F6dPqrTR3VzNf9nVTepc579Aijqa0U
CODEN IEEPAD
ContentType Conference Proceeding
Copyright 2018 ACM
Copyright_xml – notice: 2018 ACM
DBID 6IE
6IL
CBEJK
RIE
RIL
DOI 10.1145/3196398.3196408
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
IEEE Xplore POP ALL
IEEE Xplore All Conference Proceedings
IEEE Electronic Library (IEL)
IEEE Proceedings Order Plans (POP All) 1998-Present
DatabaseTitleList

Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Electronic Library (IEL)
  url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
Discipline Computer Science
EISBN 9781450357166
1450357164
EISSN 2574-3864
EndPage 486
ExternalDocumentID 8595231
Genre orig-research
GroupedDBID 6IE
6IF
6IL
6IN
AAJGR
ABLEC
ACM
ADPZR
ALMA_UNASSIGNED_HOLDINGS
APO
BEFXN
BFFAM
BGNUA
BKEBE
BPEOZ
CBEJK
GUFHI
IEGSK
LHSKQ
OCL
RIB
RIC
RIE
RIL
AAWTH
ADZIZ
CHZPO
ID FETCH-LOGICAL-a313t-fb974c03dfc6f42ce7372a2a80da841029a5344ed41b71d03c40fdcf59359ed73
IEDL.DBID RIE
ISBN 9781450357166
1450357164
IngestDate Wed Aug 27 02:59:18 EDT 2025
Fri Sep 13 11:04:49 EDT 2024
IsPeerReviewed false
IsScholarly true
Language English
License Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from Permissions@acm.org
LinkModel DirectLink
MeetingName ICSE '18: 40th International Conference on Software Engineering
MergedId FETCHMERGED-LOGICAL-a313t-fb974c03dfc6f42ce7372a2a80da841029a5344ed41b71d03c40fdcf59359ed73
PageCount 11
ParticipantIDs acm_books_10_1145_3196398_3196408
ieee_primary_8595231
acm_books_10_1145_3196398_3196408_brief
PublicationCentury 2000
PublicationDate 2018-05-28
PublicationDateYYYYMMDD 2018-05-28
PublicationDate_xml – month: 05
  year: 2018
  text: 2018-05-28
  day: 28
PublicationDecade 2010
PublicationPlace New York, NY, USA
PublicationPlace_xml – name: New York, NY, USA
PublicationSeriesTitle ACM Conferences
PublicationTitle 2018 IEEE/ACM 15th International Conference on Mining Software Repositories (MSR)
PublicationTitleAbbrev MSR
PublicationYear 2018
Publisher ACM
Publisher_xml – name: ACM
SSID ssj0002684117
ssj0003211714
Score 2.5492244
Snippet For tasks like code synthesis from natural language, code retrieval, and code summarization, data-driven models have shown great promise. However, creating...
SourceID ieee
acm
SourceType Publisher
StartPage 476
SubjectTerms Code Mining
Data mining
Feature extraction
Java
Natural languages
Neural Networks
Python
Stack Overflow
Training
Title Learning to mine aligned code and natural language pairs from stack overflow
URI https://ieeexplore.ieee.org/document/8595231
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3NS8MwFA9zJ0_zY-L8IoLgxW5pmnbtUYZjyDZ3cLBbycfLGJutzA7Bv94k7SaKoLe25BDee2ne1-_3ELqJjOEA6MSTviAe4yrxOLWElyYYCpSOSMxdg-w4GkzZ4yyc1dDdDgsDAK75DNr20dXyVS43NlXWsVxc1IKm94yZlVitXT7FspZsMZP2PTCRTddnFZuPz8KOM7YkbjsOKuKoVeXLt6Eq7k7pN9Bou5uylWTZ3hSiLT9-EDX-d7sHqPmF3sOT3b10iGqQHaHGdnwDrk7zMRpW3KpzXOR4ZLxNfL9azM1vF_dyBZhnCo-5o-XAwyqriSe2_IMtJgUbN1Uu8ZM5CnqVvzfRtP_w3Bt41XAFjwd-UHhamEhCEqMPGWlGJdh5NZzymChuJElowsOAMVDMF11fkUAyopXUoYXyguoGJ6ie5RmcIswT4EEkYi0lZUBEwjQNAZg0BmIcHtpC10bSqY0a3tISCB2mlTbSShstdPvnmlSsF6Bb6NiKOn0t2TjSSspnv38-R_vGxYltvZ_GF6herDdwadyIQlw5-_kECUvAkg
linkProvider IEEE
linkToHtml http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3NS8MwFH_IPOjJb5yfEQQvdkvTtGuPMhxTt-lhg91KmrzIUFuZHYJ_vUnWTRRBb23JIby89H3-fg_gPDKKg6gTT_oZ9bhQiSeYJbw0wVCgdERj4RpkB1F3xG_H4XgFLpdYGER0zWfYsI-ulq8KObOpsqbl4mIWNL1q7D4P52itZUbF8pYsUJP2PTCxTcvnFZ-Pz8OmU7ckbjgWKurIVeXLt7Eqzqp0NqC_2M-8meSpMSuzhvz4QdX43w1vwu4Xfo88LC3TFqxgvg0biwEOpLrPO9Cr2FUfSVmQvvE3ydXz5NH8eEm7UEhErshAOGIO0qvymuTBFoCIRaUQ46jKJ3JvLoN-Lt53YdS5Hra7XjVewROBH5SezkwsIak5ERlpziTaiTWCiZgqYSRJWSLCgHNU3M9avqKB5FQrqUML5kXVCvaglhc57gMRCYogymItJeNIs4RrFiJyaVTEuDysDmdG0qmNG97SORQ6TKvTSKvTqMPFn2vSbDpBXYcdK-r0dc7HkVZSPvj98ymsdYf9Xtq7GdwdwrpxeGJb_WfxEdTK6QyPjVNRZidOlz4BUCLD3w
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2018+IEEE%2FACM+15th+International+Conference+on+Mining+Software+Repositories+%28MSR%29&rft.atitle=Learning+to+Mine+Aligned+Code+and+Natural+Language+Pairs+from+Stack+Overflow&rft.au=Yin%2C+Pengcheng&rft.au=Deng%2C+Bowen&rft.au=Chen%2C+Edgar&rft.au=Vasilescu%2C+Bogdan&rft.date=2018-05-28&rft.pub=ACM&rft.eissn=2574-3864&rft.spage=476&rft.epage=486&rft_id=info:doi/10.1145%2F3196398.3196408&rft.externalDocID=8595231
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=9781450357166/lc.gif&client=summon&freeimage=true
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=9781450357166/mc.gif&client=summon&freeimage=true
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=9781450357166/sc.gif&client=summon&freeimage=true