Learning to mine aligned code and natural language pairs from stack overflow
For tasks like code synthesis from natural language, code retrieval, and code summarization, data-driven models have shown great promise. However, creating these models require parallel data between natural language (NL) and code with fine-grained alignments. Stack Overflow (SO) is a promising sourc...
Saved in:
Published in | 2018 IEEE/ACM 15th International Conference on Mining Software Repositories (MSR) pp. 476 - 486 |
---|---|
Main Authors | , , , , |
Format | Conference Proceeding |
Language | English |
Published |
New York, NY, USA
ACM
28.05.2018
|
Series | ACM Conferences |
Subjects | |
Online Access | Get full text |
ISBN | 9781450357166 1450357164 |
ISSN | 2574-3864 |
DOI | 10.1145/3196398.3196408 |
Cover
Loading…
Abstract | For tasks like code synthesis from natural language, code retrieval, and code summarization, data-driven models have shown great promise. However, creating these models require parallel data between natural language (NL) and code with fine-grained alignments. Stack Overflow (SO) is a promising source to create such a data set: the questions are diverse and most of them have corresponding answers with high quality code snippets. However, existing heuristic methods (e.g., pairing the title of a post with the code in the accepted answer) are limited both in their coverage and the correctness of the NL-code pairs obtained. In this paper, we propose a novel method to mine high-quality aligned data from SO using two sets of features: hand-crafted features considering the structure of the extracted snippets, and correspondence features obtained by training a probabilistic model to capture the correlation between NL and code using neural networks. These features are fed into a classifier that determines the quality of mined NL-code pairs. Experiments using Python and Java as test beds show that the proposed method greatly expands coverage and accuracy over existing mining methods, even when using only a small number of labeled examples. Further, we find that reasonable results are achieved even when training the classifier on one language and testing on another, showing promise for scaling NL-code mining to a wide variety of programming languages beyond those for which we are able to annotate data. |
---|---|
AbstractList | For tasks like code synthesis from natural language, code retrieval, and code summarization, data-driven models have shown great promise. However, creating these models require parallel data between natural language (NL) and code with fine-grained alignments. Stack Overflow (SO) is a promising source to create such a data set: the questions are diverse and most of them have corresponding answers with high quality code snippets. However, existing heuristic methods (e.g., pairing the title of a post with the code in the accepted answer) are limited both in their coverage and the correctness of the NL-code pairs obtained. In this paper, we propose a novel method to mine high-quality aligned data from SO using two sets of features: hand-crafted features considering the structure of the extracted snippets, and correspondence features obtained by training a probabilistic model to capture the correlation between NL and code using neural networks. These features are fed into a classifier that determines the quality of mined NL-code pairs. Experiments using Python and Java as test beds show that the proposed method greatly expands coverage and accuracy over existing mining methods, even when using only a small number of labeled examples. Further, we find that reasonable results are achieved even when training the classifier on one language and testing on another, showing promise for scaling NL-code mining to a wide variety of programming languages beyond those for which we are able to annotate data. For tasks like code synthesis from natural language, code retrieval, and code summarization, data-driven models have shown great promise. However, creating these models require parallel data between natural language (NL) and code with fine-grained alignments. StackOverflow (SO) is a promising source to create such a data set: the questions are diverse and most of them have corresponding answers with high-quality code snippets. However, existing heuristic methods (e.g. pairing the title of a post with the code in the accepted answer) are limited both in their coverage and the correctness of the NL-code pairs obtained. In this paper, we propose a novel method to mine high-quality aligned data from SO using two sets of features: hand-crafted features considering the structure of the extracted snippets, and correspondence features obtained by training a probabilistic model to capture the correlation between NL and code using neural networks. These features are fed into a classifier that determines the quality of mined NL-code pairs. Experiments using Python and Java as test beds show that the proposed method greatly expands coverage and accuracy over existing mining methods, even when using only a small number of labeled examples. Further, we find that reasonable results are achieved even when training the classifier on one language and testing on another, showing promise for scaling NL-code mining to a wide variety of programming languages beyond those for which we are able to annotate data. |
Author | Deng, Bowen Yin, Pengcheng Vasilescu, Bogdan Neubig, Graham Chen, Edgar |
Author_xml | – sequence: 1 givenname: Pengcheng surname: Yin fullname: Yin, Pengcheng email: pcyin@cs.cmu.edu organization: Carnegie Mellon University – sequence: 2 givenname: Bowen surname: Deng fullname: Deng, Bowen email: bdeng1@cs.cmu.edu organization: Carnegie Mellon University – sequence: 3 givenname: Edgar surname: Chen fullname: Chen, Edgar email: edgarc@cs.cmu.edu organization: Carnegie Mellon University – sequence: 4 givenname: Bogdan surname: Vasilescu fullname: Vasilescu, Bogdan email: bogdanv@cs.cmu.edu organization: Carnegie Mellon University – sequence: 5 givenname: Graham surname: Neubig fullname: Neubig, Graham email: gneubig@cs.cmu.edu organization: Carnegie Mellon University |
BookMark | eNqNkDtPwzAURs1LopTODCweWRLs2I7jEVW8pEgsMFs3fkShiV05KYh_T6p2YmL6dHU-3Xt1rtB5iMEhdENJTikX94yqkqkq3ycn1QlaKVnNgDAhaVmeokUhJM9YVfKzP-wSrcbxkxBSlBWnVC5QXTtIoQstniIeuuAw9F0bnMUm2nkIFgeYdgl63ENod9A6vIUujdinOOBxArPB8csl38fva3ThoR_d6phL9PH0-L5-yeq359f1Q50Bo2zKfKMkN4RZb0rPC-MkkwUUUBEL81ukUCAY585y2khqCTOceGu8UEwoZyVbotvD3s45p7epGyD96EooUcwHlujuQMEMuolxM2pK9F6dPqrTR3VzNf9nVTepc579Aijqa0U |
CODEN | IEEPAD |
ContentType | Conference Proceeding |
Copyright | 2018 ACM |
Copyright_xml | – notice: 2018 ACM |
DBID | 6IE 6IL CBEJK RIE RIL |
DOI | 10.1145/3196398.3196408 |
DatabaseName | IEEE Electronic Library (IEL) Conference Proceedings IEEE Xplore POP ALL IEEE Xplore All Conference Proceedings IEEE Electronic Library (IEL) IEEE Proceedings Order Plans (POP All) 1998-Present |
DatabaseTitleList | |
Database_xml | – sequence: 1 dbid: RIE name: IEEE Electronic Library (IEL) url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/ sourceTypes: Publisher |
DeliveryMethod | fulltext_linktorsrc |
Discipline | Computer Science |
EISBN | 9781450357166 1450357164 |
EISSN | 2574-3864 |
EndPage | 486 |
ExternalDocumentID | 8595231 |
Genre | orig-research |
GroupedDBID | 6IE 6IF 6IL 6IN AAJGR ABLEC ACM ADPZR ALMA_UNASSIGNED_HOLDINGS APO BEFXN BFFAM BGNUA BKEBE BPEOZ CBEJK GUFHI IEGSK LHSKQ OCL RIB RIC RIE RIL AAWTH ADZIZ CHZPO |
ID | FETCH-LOGICAL-a313t-fb974c03dfc6f42ce7372a2a80da841029a5344ed41b71d03c40fdcf59359ed73 |
IEDL.DBID | RIE |
ISBN | 9781450357166 1450357164 |
IngestDate | Wed Aug 27 02:59:18 EDT 2025 Fri Sep 13 11:04:49 EDT 2024 |
IsPeerReviewed | false |
IsScholarly | true |
Language | English |
License | Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from Permissions@acm.org |
LinkModel | DirectLink |
MeetingName | ICSE '18: 40th International Conference on Software Engineering |
MergedId | FETCHMERGED-LOGICAL-a313t-fb974c03dfc6f42ce7372a2a80da841029a5344ed41b71d03c40fdcf59359ed73 |
PageCount | 11 |
ParticipantIDs | acm_books_10_1145_3196398_3196408 ieee_primary_8595231 acm_books_10_1145_3196398_3196408_brief |
PublicationCentury | 2000 |
PublicationDate | 2018-05-28 |
PublicationDateYYYYMMDD | 2018-05-28 |
PublicationDate_xml | – month: 05 year: 2018 text: 2018-05-28 day: 28 |
PublicationDecade | 2010 |
PublicationPlace | New York, NY, USA |
PublicationPlace_xml | – name: New York, NY, USA |
PublicationSeriesTitle | ACM Conferences |
PublicationTitle | 2018 IEEE/ACM 15th International Conference on Mining Software Repositories (MSR) |
PublicationTitleAbbrev | MSR |
PublicationYear | 2018 |
Publisher | ACM |
Publisher_xml | – name: ACM |
SSID | ssj0002684117 ssj0003211714 |
Score | 2.5492244 |
Snippet | For tasks like code synthesis from natural language, code retrieval, and code summarization, data-driven models have shown great promise. However, creating... |
SourceID | ieee acm |
SourceType | Publisher |
StartPage | 476 |
SubjectTerms | Code Mining Data mining Feature extraction Java Natural languages Neural Networks Python Stack Overflow Training |
Title | Learning to mine aligned code and natural language pairs from stack overflow |
URI | https://ieeexplore.ieee.org/document/8595231 |
hasFullText | 1 |
inHoldings | 1 |
isFullTextHit | |
isPrint | |
link | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3NS8MwFA9zJ0_zY-L8IoLgxW5pmnbtUYZjyDZ3cLBbycfLGJutzA7Bv94k7SaKoLe25BDee2ne1-_3ELqJjOEA6MSTviAe4yrxOLWElyYYCpSOSMxdg-w4GkzZ4yyc1dDdDgsDAK75DNr20dXyVS43NlXWsVxc1IKm94yZlVitXT7FspZsMZP2PTCRTddnFZuPz8KOM7YkbjsOKuKoVeXLt6Eq7k7pN9Bou5uylWTZ3hSiLT9-EDX-d7sHqPmF3sOT3b10iGqQHaHGdnwDrk7zMRpW3KpzXOR4ZLxNfL9azM1vF_dyBZhnCo-5o-XAwyqriSe2_IMtJgUbN1Uu8ZM5CnqVvzfRtP_w3Bt41XAFjwd-UHhamEhCEqMPGWlGJdh5NZzymChuJElowsOAMVDMF11fkUAyopXUoYXyguoGJ6ie5RmcIswT4EEkYi0lZUBEwjQNAZg0BmIcHtpC10bSqY0a3tISCB2mlTbSShstdPvnmlSsF6Bb6NiKOn0t2TjSSspnv38-R_vGxYltvZ_GF6herDdwadyIQlw5-_kECUvAkg |
linkProvider | IEEE |
linkToHtml | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3NS8MwFH_IPOjJb5yfEQQvdkvTtGuPMhxTt-lhg91KmrzIUFuZHYJ_vUnWTRRBb23JIby89H3-fg_gPDKKg6gTT_oZ9bhQiSeYJbw0wVCgdERj4RpkB1F3xG_H4XgFLpdYGER0zWfYsI-ulq8KObOpsqbl4mIWNL1q7D4P52itZUbF8pYsUJP2PTCxTcvnFZ-Pz8OmU7ckbjgWKurIVeXLt7Eqzqp0NqC_2M-8meSpMSuzhvz4QdX43w1vwu4Xfo88LC3TFqxgvg0biwEOpLrPO9Cr2FUfSVmQvvE3ydXz5NH8eEm7UEhErshAOGIO0qvymuTBFoCIRaUQ46jKJ3JvLoN-Lt53YdS5Hra7XjVewROBH5SezkwsIak5ERlpziTaiTWCiZgqYSRJWSLCgHNU3M9avqKB5FQrqUML5kXVCvaglhc57gMRCYogymItJeNIs4RrFiJyaVTEuDysDmdG0qmNG97SORQ6TKvTSKvTqMPFn2vSbDpBXYcdK-r0dc7HkVZSPvj98ymsdYf9Xtq7GdwdwrpxeGJb_WfxEdTK6QyPjVNRZidOlz4BUCLD3w |
openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2018+IEEE%2FACM+15th+International+Conference+on+Mining+Software+Repositories+%28MSR%29&rft.atitle=Learning+to+Mine+Aligned+Code+and+Natural+Language+Pairs+from+Stack+Overflow&rft.au=Yin%2C+Pengcheng&rft.au=Deng%2C+Bowen&rft.au=Chen%2C+Edgar&rft.au=Vasilescu%2C+Bogdan&rft.date=2018-05-28&rft.pub=ACM&rft.eissn=2574-3864&rft.spage=476&rft.epage=486&rft_id=info:doi/10.1145%2F3196398.3196408&rft.externalDocID=8595231 |
thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=9781450357166/lc.gif&client=summon&freeimage=true |
thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=9781450357166/mc.gif&client=summon&freeimage=true |
thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=9781450357166/sc.gif&client=summon&freeimage=true |