DafnyBench: A Benchmark for Formal Software Verification

We introduce DafnyBench, the largest benchmark of its kind for training and evaluating machine learning systems for formal software verification. We test the ability of LLMs such as GPT-4 and Claude 3 to auto-generate enough hints for the Dafny formal verification engine to successfully verify over...

Full description

Saved in:

Bibliographic Details
Published in	arXiv.org
Main Authors	Loughridge, Chloe, Sun, Qinyi, Ahrenbach, Seth, Cassano, Federico, Sun, Chuyue, Sheng, Ying, Mudide, Anish, Md Rakib Hossain Misu, Amin, Nada, Tegmark, Max
Format	Paper
Language	English
Published	Ithaca Cornell University Library, arXiv.org 12.06.2024
Subjects	Benchmarks Large language models Machine learning Program verification (computers)
Online Access	Get full text

Cover

Loading…

More Information
Summary:	We introduce DafnyBench, the largest benchmark of its kind for training and evaluating machine learning systems for formal software verification. We test the ability of LLMs such as GPT-4 and Claude 3 to auto-generate enough hints for the Dafny formal verification engine to successfully verify over 750 programs with about 53,000 lines of code. The best model and prompting scheme achieved 68% success rate, and we quantify how this rate improves when retrying with error message feedback and how it deteriorates with the amount of required code and hints. We hope that DafnyBench will enable rapid improvements from this baseline as LLMs and verification techniques grow in quality.
ISSN:	2331-8422