Evaluating Fault Localization and Program Repair Capabilities of Existing Closed-Source General-Purpose LLMs

Automated debugging is an emerging research field that aims to automatically find and repair bugs. In this field, Fault Localization (FL) and Automated Program Repair (APR) gain the most research efforts. Most recently, researchers have adopted pre-trained Large Language Models (LLMs) to facilitate...

Full description

Saved in:
Bibliographic Details
Published in2024 IEEE/ACM International Workshop on Large Language Models for Code (LLM4Code) pp. 75 - 78
Main Authors Jiang, Shengbei, Zhang, Jiabao, Chen, Wei, Wang, Bo, Zhou, Jianyi, Zhang, Jie M.
Format Conference Proceeding
LanguageEnglish
Published ACM 20.04.2024
Subjects
Online AccessGet full text
DOI10.1145/3643795.3648390

Cover

More Information
Summary:Automated debugging is an emerging research field that aims to automatically find and repair bugs. In this field, Fault Localization (FL) and Automated Program Repair (APR) gain the most research efforts. Most recently, researchers have adopted pre-trained Large Language Models (LLMs) to facilitate FL and APR and their results are promising. However, the LLMs they used either vanished (such as Codex) or outdated (such as early versions of GPT). In this paper, we evaluate the performance of recent commercial closed-source general-purpose LLMs on FL and APR, i.e., ChatGPT 3.5, ERNIE Bot 3.5, and IFlytek Spark 2.0. We select three popular LLMs and evaluate them on 120 real-world Java bugs from the benchmark Defects4J. For FL and APR, we designed three kinds of prompts for each, considering different kinds of information. The results show that these LLMs could successfully locate 53.3% and correctly fix 12.5% of these bugs.CCS CONCEPTS* Software and its engineering → Search-based software engineering; Software testing and debugging.
DOI:10.1145/3643795.3648390