Language Model Council: Democratically Benchmarking Foundation Models on Highly Subjective Tasks

As Large Language Models (LLMs) continue to evolve, the search for efficient and meaningful evaluation methods is ongoing. Many recent evaluations use LLMs as judges to score outputs from other LLMs, often relying on a single large model like GPT-4o. However, using a single LLM judge is prone to int...

Full description

Saved in:

Bibliographic Details
Published in	arXiv.org
Main Authors	Zhao, Justin, Plaza-del-Arco, Flor Miriam, Genchel, Benjie, Curry, Amanda Cercas
Format	Paper
Language	English
Published	Ithaca Cornell University Library, arXiv.org 21.10.2024
Subjects	Benchmarks Councils Emotional intelligence Large language models Robustness
Online Access	Get full text

Cover

Loading…

Be the first to leave a comment!