Hierarchical Context Pruning: Optimizing Real-World Code Completion with Repository-Level Pretrained Code LLMs
Some recently developed code large language models (Code LLMs) have been pre-trained on repository-level code data (Repo-Code LLMs), enabling these models to recognize repository structures and utilize cross-file information for code completion. However, in real-world development scenarios, simply c...
Saved in:
Main Authors | , , , , , , , , , |
---|---|
Format | Journal Article |
Language | English |
Published |
26.06.2024
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Some recently developed code large language models (Code LLMs) have been
pre-trained on repository-level code data (Repo-Code LLMs), enabling these
models to recognize repository structures and utilize cross-file information
for code completion. However, in real-world development scenarios, simply
concatenating the entire code repository often exceeds the context window
limits of these Repo-Code LLMs, leading to significant performance degradation.
In this study, we conducted extensive preliminary experiments and analyses on
six Repo-Code LLMs. The results indicate that maintaining the topological
dependencies of files and increasing the code file content in the completion
prompts can improve completion accuracy; pruning the specific implementations
of functions in all dependent files does not significantly reduce the accuracy
of completions. Based on these findings, we proposed a strategy named
Hierarchical Context Pruning (HCP) to construct completion prompts with high
informational code content. The HCP models the code repository at the function
level, maintaining the topological dependencies between code files while
removing a large amount of irrelevant code content, significantly reduces the
input length for repository-level code completion. We applied the HCP strategy
in experiments with six Repo-Code LLMs, and the results demonstrate that our
proposed method can significantly enhance completion accuracy while
substantially reducing the length of input. Our code and data are available at
https://github.com/Hambaobao/HCP-Coder. |
---|---|
DOI: | 10.48550/arxiv.2406.18294 |