Evaluating Fault Localization and Program Repair Capabilities of Existing Closed-Source General-Purpose LLMs
Automated debugging is an emerging research field that aims to automatically find and repair bugs. In this field, Fault Localization (FL) and Automated Program Repair (APR) gain the most research efforts. Most recently, researchers have adopted pre-trained Large Language Models (LLMs) to facilitate...
Saved in:
Published in | 2024 IEEE/ACM International Workshop on Large Language Models for Code (LLM4Code) pp. 75 - 78 |
---|---|
Main Authors | , , , , , |
Format | Conference Proceeding |
Language | English |
Published |
ACM
20.04.2024
|
Subjects | |
Online Access | Get full text |
DOI | 10.1145/3643795.3648390 |
Cover
Summary: | Automated debugging is an emerging research field that aims to automatically find and repair bugs. In this field, Fault Localization (FL) and Automated Program Repair (APR) gain the most research efforts. Most recently, researchers have adopted pre-trained Large Language Models (LLMs) to facilitate FL and APR and their results are promising. However, the LLMs they used either vanished (such as Codex) or outdated (such as early versions of GPT). In this paper, we evaluate the performance of recent commercial closed-source general-purpose LLMs on FL and APR, i.e., ChatGPT 3.5, ERNIE Bot 3.5, and IFlytek Spark 2.0. We select three popular LLMs and evaluate them on 120 real-world Java bugs from the benchmark Defects4J. For FL and APR, we designed three kinds of prompts for each, considering different kinds of information. The results show that these LLMs could successfully locate 53.3% and correctly fix 12.5% of these bugs.CCS CONCEPTS* Software and its engineering → Search-based software engineering; Software testing and debugging. |
---|---|
DOI: | 10.1145/3643795.3648390 |