SPORTU: A Comprehensive Sports Understanding Benchmark for Multimodal Large Language Models
Multimodal Large Language Models (MLLMs) are advancing the ability to reason about complex sports scenarios by integrating textual and visual information. To comprehensively evaluate their capabilities, we introduce SPORTU, a benchmark designed to assess MLLMs across multi-level sports reasoning tas...
Saved in:
Main Authors | , , , , , , , , , , , , |
---|---|
Format | Journal Article |
Language | English |
Published |
10.10.2024
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Multimodal Large Language Models (MLLMs) are advancing the ability to reason
about complex sports scenarios by integrating textual and visual information.
To comprehensively evaluate their capabilities, we introduce SPORTU, a
benchmark designed to assess MLLMs across multi-level sports reasoning tasks.
SPORTU comprises two key components: SPORTU-text, featuring 900 multiple-choice
questions with human-annotated explanations for rule comprehension and strategy
understanding. This component focuses on testing models' ability to reason
about sports solely through question-answering (QA), without requiring visual
inputs; SPORTU-video, consisting of 1,701 slow-motion video clips across 7
different sports and 12,048 QA pairs, designed to assess multi-level reasoning,
from simple sports recognition to complex tasks like foul detection and rule
application. We evaluate four prevalent LLMs mainly utilizing few-shot learning
paradigms supplemented by chain-of-thought (CoT) prompting on the SPORTU-text
part. We evaluate four LLMs using few-shot learning and chain-of-thought (CoT)
prompting on SPORTU-text. GPT-4o achieves the highest accuracy of 71%, but
still falls short of human-level performance, highlighting room for improvement
in rule comprehension and reasoning. The evaluation for the SPORTU-video part
includes 7 proprietary and 6 open-source MLLMs. Experiments show that models
fall short on hard tasks that require deep reasoning and rule-based
understanding. Claude-3.5-Sonnet performs the best with only 52.6% accuracy on
the hard task, showing large room for improvement. We hope that SPORTU will
serve as a critical step toward evaluating models' capabilities in sports
understanding and reasoning. |
---|---|
DOI: | 10.48550/arxiv.2410.08474 |