Learning to mine aligned code and natural language pairs from stack overflow

For tasks like code synthesis from natural language, code retrieval, and code summarization, data-driven models have shown great promise. However, creating these models require parallel data between natural language (NL) and code with fine-grained alignments. Stack Overflow (SO) is a promising sourc...

Full description

Saved in:

Bibliographic Details
Published in	2018 IEEE/ACM 15th International Conference on Mining Software Repositories (MSR) pp. 476 - 486
Main Authors	Yin, Pengcheng, Deng, Bowen, Chen, Edgar, Vasilescu, Bogdan, Neubig, Graham
Format	Conference Proceeding
Language	English
Published	New York, NY, USA ACM 28.05.2018
Series	ACM Conferences
Subjects	Code Mining Data mining Feature extraction Java Natural languages Neural Networks Python Stack Overflow Training
Online Access	Get full text
ISBN	9781450357166 1450357164
ISSN	2574-3864
DOI	10.1145/3196398.3196408

Cover

Loading…

More Information
Summary:	For tasks like code synthesis from natural language, code retrieval, and code summarization, data-driven models have shown great promise. However, creating these models require parallel data between natural language (NL) and code with fine-grained alignments. Stack Overflow (SO) is a promising source to create such a data set: the questions are diverse and most of them have corresponding answers with high quality code snippets. However, existing heuristic methods (e.g., pairing the title of a post with the code in the accepted answer) are limited both in their coverage and the correctness of the NL-code pairs obtained. In this paper, we propose a novel method to mine high-quality aligned data from SO using two sets of features: hand-crafted features considering the structure of the extracted snippets, and correspondence features obtained by training a probabilistic model to capture the correlation between NL and code using neural networks. These features are fed into a classifier that determines the quality of mined NL-code pairs. Experiments using Python and Java as test beds show that the proposed method greatly expands coverage and accuracy over existing mining methods, even when using only a small number of labeled examples. Further, we find that reasonable results are achieved even when training the classifier on one language and testing on another, showing promise for scaling NL-code mining to a wide variety of programming languages beyond those for which we are able to annotate data.
ISBN:	9781450357166 1450357164
ISSN:	2574-3864
DOI:	10.1145/3196398.3196408