MulCS: Towards a Unified Deep Representation for Multilingual Code Search

Code search aims to search for relevant code snippets through queries, which has become an essential requirement to assist programmers in software development. With the availability of large and rapidly growing source code repositories covering various languages, multilingual code search can leverag...

Full description

Saved in:
Bibliographic Details
Published in2023 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER) pp. 120 - 131
Main Authors Ma, Yingwei, Yu, Yue, Li, Shanshan, Jia, Zhouyang, Ma, Jun, Xu, Rulin, Dong, Wei, Liao, Xiangke
Format Conference Proceeding
LanguageEnglish
Published IEEE 01.03.2023
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Code search aims to search for relevant code snippets through queries, which has become an essential requirement to assist programmers in software development. With the availability of large and rapidly growing source code repositories covering various languages, multilingual code search can leverage more training data to learn complementary information across languages. Contrastive learning can naturally understand the similarity between functionally equivalent code across different languages by narrowing the distance between objects with the same function while keeping dissimilar objects further apart. Some works exist addressing monolingual code search problems with contrastive learning, however, they mainly exploit every specific programming language's textual semantics or syntactic structures for code representation. Due to the high diversity of different languages in terms of syntax, format, and structure, these methods limit the performance of contrastive learning in multilingual training. To bridge this gap, we propose a unified semantic graph representation approach toward multilingual code search called MulCS. Specifically, we first design a general semantic graph construction strategy across different languages by Intermediate Representation (IR). Furthermore, we introduce the contrastive learning module integrated into a gated graph neural network (GGNN) to enhance query-multilingual code matching. The extensive experiments on three representative languages illustrate that our method outperforms state-of-the-art models by 10.7% to 77.5% in terms of MRR on average.
ISSN:2640-7574
DOI:10.1109/SANER56733.2023.00021