MulCS: Towards a Unified Deep Representation for Multilingual Code Search

Code search aims to search for relevant code snippets through queries, which has become an essential requirement to assist programmers in software development. With the availability of large and rapidly growing source code repositories covering various languages, multilingual code search can leverag...

Full description

Saved in:

Bibliographic Details
Published in	2023 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER) pp. 120 - 131
Main Authors	Ma, Yingwei, Yu, Yue, Li, Shanshan, Jia, Zhouyang, Ma, Jun, Xu, Rulin, Dong, Wei, Liao, Xiangke
Format	Conference Proceeding
Language	English
Published	IEEE 01.03.2023
Subjects	Code search Codes contrastive learning intermediate representation multi-language Programming Semantics Source coding Syntactics Training Training data
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Code search aims to search for relevant code snippets through queries, which has become an essential requirement to assist programmers in software development. With the availability of large and rapidly growing source code repositories covering various languages, multilingual code search can leverage more training data to learn complementary information across languages. Contrastive learning can naturally understand the similarity between functionally equivalent code across different languages by narrowing the distance between objects with the same function while keeping dissimilar objects further apart. Some works exist addressing monolingual code search problems with contrastive learning, however, they mainly exploit every specific programming language's textual semantics or syntactic structures for code representation. Due to the high diversity of different languages in terms of syntax, format, and structure, these methods limit the performance of contrastive learning in multilingual training. To bridge this gap, we propose a unified semantic graph representation approach toward multilingual code search called MulCS. Specifically, we first design a general semantic graph construction strategy across different languages by Intermediate Representation (IR). Furthermore, we introduce the contrastive learning module integrated into a gated graph neural network (GGNN) to enhance query-multilingual code matching. The extensive experiments on three representative languages illustrate that our method outperforms state-of-the-art models by 10.7% to 77.5% in terms of MRR on average.
ISSN:	2640-7574
DOI:	10.1109/SANER56733.2023.00021