MulCS: Towards a Unified Deep Representation for Multilingual Code Search
Code search aims to search for relevant code snippets through queries, which has become an essential requirement to assist programmers in software development. With the availability of large and rapidly growing source code repositories covering various languages, multilingual code search can leverag...
Saved in:
Published in | 2023 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER) pp. 120 - 131 |
---|---|
Main Authors | , , , , , , , |
Format | Conference Proceeding |
Language | English |
Published |
IEEE
01.03.2023
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Code search aims to search for relevant code snippets through queries, which has become an essential requirement to assist programmers in software development. With the availability of large and rapidly growing source code repositories covering various languages, multilingual code search can leverage more training data to learn complementary information across languages. Contrastive learning can naturally understand the similarity between functionally equivalent code across different languages by narrowing the distance between objects with the same function while keeping dissimilar objects further apart. Some works exist addressing monolingual code search problems with contrastive learning, however, they mainly exploit every specific programming language's textual semantics or syntactic structures for code representation. Due to the high diversity of different languages in terms of syntax, format, and structure, these methods limit the performance of contrastive learning in multilingual training. To bridge this gap, we propose a unified semantic graph representation approach toward multilingual code search called MulCS. Specifically, we first design a general semantic graph construction strategy across different languages by Intermediate Representation (IR). Furthermore, we introduce the contrastive learning module integrated into a gated graph neural network (GGNN) to enhance query-multilingual code matching. The extensive experiments on three representative languages illustrate that our method outperforms state-of-the-art models by 10.7% to 77.5% in terms of MRR on average. |
---|---|
ISSN: | 2640-7574 |
DOI: | 10.1109/SANER56733.2023.00021 |