Calibration of Large Language Models on Code Summarization
A brief, fluent, and relevant summary can be helpful during program comprehension; however, such a summary does require significant human effort to produce. Often, good summaries are unavailable in software projects, which makes maintenance more difficult. There has been a considerable body of resea...
Saved in:
Main Authors | , , |
---|---|
Format | Journal Article |
Language | English |
Published |
30.04.2024
|
Subjects | |
Online Access | Get full text |
DOI | 10.48550/arxiv.2404.19318 |
Cover
Summary: | A brief, fluent, and relevant summary can be helpful during program
comprehension; however, such a summary does require significant human effort to
produce. Often, good summaries are unavailable in software projects, which
makes maintenance more difficult. There has been a considerable body of
research into automated AI-based methods, using Large Language models (LLMs),
to generate summaries of code; there also has been quite a bit of work on ways
to measure the performance of such summarization methods, with special
attention paid to how closely these AI-generated summaries resemble a summary a
human might have produced. Measures such as BERTScore and BLEU have been
suggested and evaluated with human-subject studies.
However, LLM-generated summaries can be inaccurate, incomplete, etc.:
generally, too dissimilar to one that a good developer might write. Given an
LLM-generated code summary, how can a user rationally judge if a summary is
sufficiently good and reliable? Given just some input source code, and an
LLM-generated summary, existing approaches can help judge brevity, fluency and
relevance of the summary; however, it's difficult to gauge whether an
LLM-generated summary sufficiently resembles what a human might produce,
without a "golden" human-produced summary to compare against. We study this
resemblance question as calibration problem: given just the code & the summary
from an LLM, can we compute a confidence measure, that provides a reliable
indication of whether the summary sufficiently resembles what a human would
have produced in this situation? We examine this question using several LLMs,
for several languages, and in several different settings. Our investigation
suggests approaches to provide reliable predictions of the likelihood that an
LLM-generated summary would sufficiently resemble a summary a human might write
for the same code. |
---|---|
DOI: | 10.48550/arxiv.2404.19318 |