HARDMath: A Benchmark Dataset for Challenging Problems in Applied Mathematics
Advanced applied mathematics problems are underrepresented in existing Large Language Model (LLM) benchmark datasets. To address this, we introduce HARDMath, a dataset inspired by a graduate course on asymptotic methods, featuring challenging applied mathematics problems that require analytical appr...
Saved in:
Main Authors | , , , , , , , , |
---|---|
Format | Journal Article |
Language | English |
Published |
13.10.2024
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Advanced applied mathematics problems are underrepresented in existing Large
Language Model (LLM) benchmark datasets. To address this, we introduce
HARDMath, a dataset inspired by a graduate course on asymptotic methods,
featuring challenging applied mathematics problems that require analytical
approximation techniques. These problems demand a combination of mathematical
reasoning, computational tools, and subjective judgment, making them difficult
for LLMs. Our framework auto-generates a large number of problems with
solutions validated against numerical ground truths. We evaluate both open- and
closed-source LLMs on HARDMath-mini, a sub-sampled test set of 366 problems, as
well as on 40 word problems formulated in applied science contexts. Even
leading closed-source models like GPT-4 achieve only 43.8% overall accuracy
with few-shot Chain-of-Thought prompting, and all models demonstrate
significantly lower performance compared to results on existing mathematics
benchmark datasets. We additionally conduct a detailed error analysis to gain
insights into the failure cases of LLMs. These results demonstrate limitations
of current LLM performance on advanced graduate-level applied math problems and
underscore the importance of datasets like HARDMath to advance mathematical
abilities of LLMs. |
---|---|
DOI: | 10.48550/arxiv.2410.09988 |