MMAU: A Holistic Benchmark of Agent Capabilities Across Diverse Domains
Recent advances in large language models (LLMs) have increased the demand for comprehensive benchmarks to evaluate their capabilities as human-like agents. Existing benchmarks, while useful, often focus on specific application scenarios, emphasizing task completion but failing to dissect the underly...
Saved in:
Main Authors | , , , , , , , , , , , , , , , , , , , , , , , |
---|---|
Format | Journal Article |
Language | English |
Published |
17.07.2024
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Recent advances in large language models (LLMs) have increased the demand for
comprehensive benchmarks to evaluate their capabilities as human-like agents.
Existing benchmarks, while useful, often focus on specific application
scenarios, emphasizing task completion but failing to dissect the underlying
skills that drive these outcomes. This lack of granularity makes it difficult
to deeply discern where failures stem from. Additionally, setting up these
environments requires considerable effort, and issues of unreliability and
reproducibility sometimes arise, especially in interactive tasks. To address
these limitations, we introduce the Massive Multitask Agent Understanding
(MMAU) benchmark, featuring comprehensive offline tasks that eliminate the need
for complex environment setups. It evaluates models across five domains,
including Tool-use, Directed Acyclic Graph (DAG) QA, Data Science and Machine
Learning coding, Contest-level programming and Mathematics, and covers five
essential capabilities: Understanding, Reasoning, Planning, Problem-solving,
and Self-correction. With a total of 20 meticulously designed tasks
encompassing over 3K distinct prompts, MMAU provides a comprehensive framework
for evaluating the strengths and limitations of LLM agents. By testing 18
representative models on MMAU, we provide deep and insightful analyses.
Ultimately, MMAU not only sheds light on the capabilities and limitations of
LLM agents but also enhances the interpretability of their performance.
Datasets and evaluation scripts of MMAU are released at
https://github.com/apple/axlearn/tree/main/docs/research/mmau. |
---|---|
DOI: | 10.48550/arxiv.2407.18961 |