Learning to mine aligned code and natural language pairs from stack overflow

For tasks like code synthesis from natural language, code retrieval, and code summarization, data-driven models have shown great promise. However, creating these models require parallel data between natural language (NL) and code with fine-grained alignments. Stack Overflow (SO) is a promising sourc...

Full description

Saved in:

Bibliographic Details
Published in	2018 IEEE/ACM 15th International Conference on Mining Software Repositories (MSR) pp. 476 - 486
Main Authors	Yin, Pengcheng, Deng, Bowen, Chen, Edgar, Vasilescu, Bogdan, Neubig, Graham
Format	Conference Proceeding
Language	English
Published	New York, NY, USA ACM 28.05.2018
Series	ACM Conferences
Subjects	Code Mining Data mining Feature extraction Java Natural languages Neural Networks Python Stack Overflow Training
Online Access	Get full text
ISBN	9781450357166 1450357164
ISSN	2574-3864
DOI	10.1145/3196398.3196408

Cover

Loading…

Abstract	For tasks like code synthesis from natural language, code retrieval, and code summarization, data-driven models have shown great promise. However, creating these models require parallel data between natural language (NL) and code with fine-grained alignments. Stack Overflow (SO) is a promising source to create such a data set: the questions are diverse and most of them have corresponding answers with high quality code snippets. However, existing heuristic methods (e.g., pairing the title of a post with the code in the accepted answer) are limited both in their coverage and the correctness of the NL-code pairs obtained. In this paper, we propose a novel method to mine high-quality aligned data from SO using two sets of features: hand-crafted features considering the structure of the extracted snippets, and correspondence features obtained by training a probabilistic model to capture the correlation between NL and code using neural networks. These features are fed into a classifier that determines the quality of mined NL-code pairs. Experiments using Python and Java as test beds show that the proposed method greatly expands coverage and accuracy over existing mining methods, even when using only a small number of labeled examples. Further, we find that reasonable results are achieved even when training the classifier on one language and testing on another, showing promise for scaling NL-code mining to a wide variety of programming languages beyond those for which we are able to annotate data.
AbstractList	For tasks like code synthesis from natural language, code retrieval, and code summarization, data-driven models have shown great promise. However, creating these models require parallel data between natural language (NL) and code with fine-grained alignments. Stack Overflow (SO) is a promising source to create such a data set: the questions are diverse and most of them have corresponding answers with high quality code snippets. However, existing heuristic methods (e.g., pairing the title of a post with the code in the accepted answer) are limited both in their coverage and the correctness of the NL-code pairs obtained. In this paper, we propose a novel method to mine high-quality aligned data from SO using two sets of features: hand-crafted features considering the structure of the extracted snippets, and correspondence features obtained by training a probabilistic model to capture the correlation between NL and code using neural networks. These features are fed into a classifier that determines the quality of mined NL-code pairs. Experiments using Python and Java as test beds show that the proposed method greatly expands coverage and accuracy over existing mining methods, even when using only a small number of labeled examples. Further, we find that reasonable results are achieved even when training the classifier on one language and testing on another, showing promise for scaling NL-code mining to a wide variety of programming languages beyond those for which we are able to annotate data. For tasks like code synthesis from natural language, code retrieval, and code summarization, data-driven models have shown great promise. However, creating these models require parallel data between natural language (NL) and code with fine-grained alignments. StackOverflow (SO) is a promising source to create such a data set: the questions are diverse and most of them have corresponding answers with high-quality code snippets. However, existing heuristic methods (e.g. pairing the title of a post with the code in the accepted answer) are limited both in their coverage and the correctness of the NL-code pairs obtained. In this paper, we propose a novel method to mine high-quality aligned data from SO using two sets of features: hand-crafted features considering the structure of the extracted snippets, and correspondence features obtained by training a probabilistic model to capture the correlation between NL and code using neural networks. These features are fed into a classifier that determines the quality of mined NL-code pairs. Experiments using Python and Java as test beds show that the proposed method greatly expands coverage and accuracy over existing mining methods, even when using only a small number of labeled examples. Further, we find that reasonable results are achieved even when training the classifier on one language and testing on another, showing promise for scaling NL-code mining to a wide variety of programming languages beyond those for which we are able to annotate data.
Author	Deng, Bowen Yin, Pengcheng Vasilescu, Bogdan Neubig, Graham Chen, Edgar
Author_xml	– sequence: 1 givenname: Pengcheng surname: Yin fullname: Yin, Pengcheng email: pcyin@cs.cmu.edu organization: Carnegie Mellon University – sequence: 2 givenname: Bowen surname: Deng fullname: Deng, Bowen email: bdeng1@cs.cmu.edu organization: Carnegie Mellon University – sequence: 3 givenname: Edgar surname: Chen fullname: Chen, Edgar email: edgarc@cs.cmu.edu organization: Carnegie Mellon University – sequence: 4 givenname: Bogdan surname: Vasilescu fullname: Vasilescu, Bogdan email: bogdanv@cs.cmu.edu organization: Carnegie Mellon University – sequence: 5 givenname: Graham surname: Neubig fullname: Neubig, Graham email: gneubig@cs.cmu.edu organization: Carnegie Mellon University
BookMark	eNqNkDtPwzAURs1LopTODCweWRLs2I7jEVW8pEgsMFs3fkShiV05KYh_T6p2YmL6dHU-3Xt1rtB5iMEhdENJTikX94yqkqkq3ycn1QlaKVnNgDAhaVmeokUhJM9YVfKzP-wSrcbxkxBSlBWnVC5QXTtIoQstniIeuuAw9F0bnMUm2nkIFgeYdgl63ENod9A6vIUujdinOOBxArPB8csl38fva3ThoR_d6phL9PH0-L5-yeq359f1Q50Bo2zKfKMkN4RZb0rPC-MkkwUUUBEL81ukUCAY585y2khqCTOceGu8UEwoZyVbotvD3s45p7epGyD96EooUcwHlujuQMEMuolxM2pK9F6dPqrTR3VzNf9nVTepc579Aijqa0U
CODEN	IEEPAD
ContentType	Conference Proceeding
Copyright	2018 ACM
Copyright_xml	– notice: 2018 ACM
DBID	6IE 6IL CBEJK RIE RIL
DOI	10.1145/3196398.3196408
DatabaseName	IEEE Electronic Library (IEL) Conference Proceedings IEEE Xplore POP ALL IEEE Xplore All Conference Proceedings IEEE Electronic Library (IEL) IEEE Proceedings Order Plans (POP All) 1998-Present
DatabaseTitleList
Database_xml	– sequence: 1 dbid: RIE name: IEEE Electronic Library (IEL) url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/ sourceTypes: Publisher
DeliveryMethod	fulltext_linktorsrc
Discipline	Computer Science
EISBN	9781450357166 1450357164
EISSN	2574-3864
EndPage	486
ExternalDocumentID	8595231
Genre	orig-research
GroupedDBID	6IE 6IF 6IL 6IN AAJGR ABLEC ACM ADPZR ALMA_UNASSIGNED_HOLDINGS APO BEFXN BFFAM BGNUA BKEBE BPEOZ CBEJK GUFHI IEGSK LHSKQ OCL RIB RIC RIE RIL AAWTH ADZIZ CHZPO
ID	FETCH-LOGICAL-a313t-fb974c03dfc6f42ce7372a2a80da841029a5344ed41b71d03c40fdcf59359ed73
IEDL.DBID	RIE
ISBN	9781450357166 1450357164
IngestDate	Wed Aug 27 02:59:18 EDT 2025 Fri Sep 13 11:04:49 EDT 2024
IsPeerReviewed	false
IsScholarly	true
Language	English
License	Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from Permissions@acm.org
LinkModel	DirectLink
MeetingName	ICSE '18: 40th International Conference on Software Engineering
MergedId	FETCHMERGED-LOGICAL-a313t-fb974c03dfc6f42ce7372a2a80da841029a5344ed41b71d03c40fdcf59359ed73
PageCount	11
ParticipantIDs	acm_books_10_1145_3196398_3196408 ieee_primary_8595231 acm_books_10_1145_3196398_3196408_brief
PublicationCentury	2000
PublicationDate	2018-05-28
PublicationDateYYYYMMDD	2018-05-28
PublicationDate_xml	– month: 05 year: 2018 text: 2018-05-28 day: 28
PublicationDecade	2010
PublicationPlace	New York, NY, USA
PublicationPlace_xml	– name: New York, NY, USA
PublicationSeriesTitle	ACM Conferences
PublicationTitle	2018 IEEE/ACM 15th International Conference on Mining Software Repositories (MSR)
PublicationTitleAbbrev	MSR
PublicationYear	2018
Publisher	ACM
Publisher_xml	– name: ACM
SSID	ssj0002684117 ssj0003211714
Score	2.5492244
Snippet	For tasks like code synthesis from natural language, code retrieval, and code summarization, data-driven models have shown great promise. However, creating...
SourceID	ieee acm
SourceType	Publisher
StartPage	476
SubjectTerms	Code Mining Data mining Feature extraction Java Natural languages Neural Networks Python Stack Overflow Training
Title	Learning to mine aligned code and natural language pairs from stack overflow
URI	https://ieeexplore.ieee.org/document/8595231
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
link	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3NS8MwFA9zJ0_zY-L8IoLgxW5pmnbtUYZjyDZ3cLBbycfLGJutzA7Bv94k7SaKoLe25BDee2ne1-_3ELqJjOEA6MSTviAe4yrxOLWElyYYCpSOSMxdg-w4GkzZ4yyc1dDdDgsDAK75DNr20dXyVS43NlXWsVxc1IKm94yZlVitXT7FspZsMZP2PTCRTddnFZuPz8KOM7YkbjsOKuKoVeXLt6Eq7k7pN9Bou5uylWTZ3hSiLT9-EDX-d7sHqPmF3sOT3b10iGqQHaHGdnwDrk7zMRpW3KpzXOR4ZLxNfL9azM1vF_dyBZhnCo-5o-XAwyqriSe2_IMtJgUbN1Uu8ZM5CnqVvzfRtP_w3Bt41XAFjwd-UHhamEhCEqMPGWlGJdh5NZzymChuJElowsOAMVDMF11fkUAyopXUoYXyguoGJ6ie5RmcIswT4EEkYi0lZUBEwjQNAZg0BmIcHtpC10bSqY0a3tISCB2mlTbSShstdPvnmlSsF6Bb6NiKOn0t2TjSSspnv38-R_vGxYltvZ_GF6herDdwadyIQlw5-_kECUvAkg
linkProvider	IEEE
linkToHtml	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3NS8MwFH_IPOjJb5yfEQQvdkvTtGuPMhxTt-lhg91KmrzIUFuZHYJ_vUnWTRRBb23JIby89H3-fg_gPDKKg6gTT_oZ9bhQiSeYJbw0wVCgdERj4RpkB1F3xG_H4XgFLpdYGER0zWfYsI-ulq8KObOpsqbl4mIWNL1q7D4P52itZUbF8pYsUJP2PTCxTcvnFZ-Pz8OmU7ckbjgWKurIVeXLt7Eqzqp0NqC_2M-8meSpMSuzhvz4QdX43w1vwu4Xfo88LC3TFqxgvg0biwEOpLrPO9Cr2FUfSVmQvvE3ydXz5NH8eEm7UEhErshAOGIO0qvymuTBFoCIRaUQ46jKJ3JvLoN-Lt53YdS5Hra7XjVewROBH5SezkwsIak5ERlpziTaiTWCiZgqYSRJWSLCgHNU3M9avqKB5FQrqUML5kXVCvaglhc57gMRCYogymItJeNIs4RrFiJyaVTEuDysDmdG0qmNG97SORQ6TKvTSKvTqMPFn2vSbDpBXYcdK-r0dc7HkVZSPvj98ymsdYf9Xtq7GdwdwrpxeGJb_WfxEdTK6QyPjVNRZidOlz4BUCLD3w
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2018+IEEE%2FACM+15th+International+Conference+on+Mining+Software+Repositories+%28MSR%29&rft.atitle=Learning+to+Mine+Aligned+Code+and+Natural+Language+Pairs+from+Stack+Overflow&rft.au=Yin%2C+Pengcheng&rft.au=Deng%2C+Bowen&rft.au=Chen%2C+Edgar&rft.au=Vasilescu%2C+Bogdan&rft.date=2018-05-28&rft.pub=ACM&rft.eissn=2574-3864&rft.spage=476&rft.epage=486&rft_id=info:doi/10.1145%2F3196398.3196408&rft.externalDocID=8595231
thumbnail_l	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=9781450357166/lc.gif&client=summon&freeimage=true
thumbnail_m	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=9781450357166/mc.gif&client=summon&freeimage=true
thumbnail_s	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=9781450357166/sc.gif&client=summon&freeimage=true