Multi-Model Consistency for LLMs' Evaluation

This paper introduces an evaluation method for large language models (LLMs) based on multi-model factual cognition consistency. Traditional evaluation methods, especially in terms of factuality assessments, face challenges in constructing extensive domain-specific question sets and relying on specif...

Full description

Saved in:

Bibliographic Details
Published in	2024 International Joint Conference on Neural Networks (IJCNN) pp. 1 - 8
Main Authors	Zhu, Qinrui, Lyu, Derui, Fan, Xi, Wang, Xiangyu, Tu, Qiang, Zhan, Yibin, Chen, Huanhuan
Format	Conference Proceeding
Language	English
Published	IEEE 30.06.2024
Subjects	Cognition Data models Heuristic algorithms Knowledge engineering Large language models Model Evaluation Neural networks Quality assessment
Online Access	Get full text

Cover

Loading…

More Information
Summary:	This paper introduces an evaluation method for large language models (LLMs) based on multi-model factual cognition consistency. Traditional evaluation methods, especially in terms of factuality assessments, face challenges in constructing extensive domain-specific question sets and relying on specific model answers. These methods fall short in the face of dynamic and diverse model development. To overcome these limitations, the proposed approach does not depend on a fixed set of standard answers. Instead, it utilizes the responses of multiple models to construct a dynamic, relative evaluation benchmark. We first developed a framework to capture and compare the cognitive consistency of different models when addressing specific questions. Subsequently, a dynamic iterative algorithm was designed to evaluate models based on these sets of answers. Experiments across multiple domains demonstrated the effectiveness of this method. This innovative evaluation strategy not only provides a more comprehensive and flexible approach to understanding and assessing the performance of LLMs in various scenarios but also offers practical guidance for future model development and improvement.
ISSN:	2161-4407
DOI:	10.1109/IJCNN60899.2024.10651158