How Effectively Do Code Language Models Understand Poor-Readability Code?

Code language models such as CodeT5 and CodeLlama have demonstrated substantial achievement in code comprehension. While the majority of research efforts have focused on improving model architectures and training processes, we find that the current benchmarks used for evaluating code comprehension m...

Full description

Saved in:

Bibliographic Details
Published in	IEEE/ACM International Conference on Automated Software Engineering : [proceedings] pp. 795 - 806
Main Authors	Hu, Chao, Chai, Yitian, Zhou, Hao, Meng, Fandong, Zhou, Jie, Gu, Xiaodong
Format	Conference Proceeding
Language	English
Published	ACM 27.10.2024
Subjects	Benchmark testing Cleaning Code language models Code readability Code summarization Codes Data models Robustness Semantics Sensitivity Software engineering Syntactics Training
Online Access	Get full text
ISSN	2643-1572
DOI	10.1145/3691620.3695072

Cover

Loading…

More Information
Summary:	Code language models such as CodeT5 and CodeLlama have demonstrated substantial achievement in code comprehension. While the majority of research efforts have focused on improving model architectures and training processes, we find that the current benchmarks used for evaluating code comprehension models are confined to high-readability code, regardless of the popularity of low-readability code in reality. As such, they are inadequate to demonstrate the full spectrum of the model's ability, particularly the robustness to varying readability degrees. In this paper, we analyze the robustness of code summarization models to code with varying readability, including seven obfuscated datasets derived from existing benchmarks. Our findings indicate that current code summarization models are vulnerable to code with poor readability. In particular, their performance predominantly depends on semantic cues within the code, often neglecting the syntactic aspects. Existing benchmarks are biased toward evaluating semantic features, thereby overlooking the models' ability to understand non-sensitive syntactic features. Based on the findings, we present PoorCodeSumEval, a new evaluation benchmark on code summarization tasks. PoorCodeSumEval innovatively introduces readability into the testing process, considering semantic, syntactic, and their cross-obfuscation, thereby providing a more comprehensive and rigorous evaluation of code summarization models. Our studies also provide more insightful suggestions for future research, such as constructing multi-readability benchmarks to evaluate the robustness of models on poor-readability code, proposing readability-awareness metrics, and automatic methods for code data cleaning and normalization.
ISSN:	2643-1572
DOI:	10.1145/3691620.3695072