BLADE: Benchmarking Language Model Agents for Data-Driven Science
Data-driven scientific discovery requires the iterative integration of scientific domain knowledge, statistical expertise, and an understanding of data semantics to make nuanced analytical decisions, e.g., about which variables, transformations, and statistical models to consider. LM-based agents eq...
Saved in:
Main Authors | , , , , , , , , , , , , , , , |
---|---|
Format | Journal Article |
Language | English |
Published |
18.08.2024
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Data-driven scientific discovery requires the iterative integration of
scientific domain knowledge, statistical expertise, and an understanding of
data semantics to make nuanced analytical decisions, e.g., about which
variables, transformations, and statistical models to consider. LM-based agents
equipped with planning, memory, and code execution capabilities have the
potential to support data-driven science. However, evaluating agents on such
open-ended tasks is challenging due to multiple valid approaches, partially
correct steps, and different ways to express the same decisions. To address
these challenges, we present BLADE, a benchmark to automatically evaluate
agents' multifaceted approaches to open-ended research questions. BLADE
consists of 12 datasets and research questions drawn from existing scientific
literature, with ground truth collected from independent analyses by expert
data scientists and researchers. To automatically evaluate agent responses, we
developed corresponding computational methods to match different
representations of analyses to this ground truth. Though language models
possess considerable world knowledge, our evaluation shows that they are often
limited to basic analyses. However, agents capable of interacting with the
underlying data demonstrate improved, but still non-optimal, diversity in their
analytical decision making. Our work enables the evaluation of agents for
data-driven science and provides researchers deeper insights into agents'
analysis approaches. |
---|---|
DOI: | 10.48550/arxiv.2408.09667 |