Enhancing QA System Evaluation: An In-Depth Analysis of Metrics and Model-Specific Behaviors

The purpose of this study is to examine how evaluation metrics influence the perception and performance of question answering (QA) systems, particularly focusing on their effectiveness in QA tasks. We compare four different models: BERT, BioBERT, Bio-ClinicalBERT, and RoBERTa, utilizing ten EPIC-QA...

Full description

Saved in:

Bibliographic Details
Published in	Journal of information science theory and practice Vol. 13; no. 1; pp. 85 - 98
Main Authors	Kim, Heesop, Ademola, Aluko
Format	Journal Article
Language	English
Published	Daejeon Korea Institute of Science and Technology Information 01.03.2025 한국과학기술정보연구원
Subjects	bert evaluation metrics natural language processing question answering systems Semantics transformer models 문헌정보학
Online Access	Get full text
ISSN	2287-9099 2287-4577
DOI	10.1633/JISTaP.2025.13.1.6

Cover

More Information
Summary:	The purpose of this study is to examine how evaluation metrics influence the perception and performance of question answering (QA) systems, particularly focusing on their effectiveness in QA tasks. We compare four different models: BERT, BioBERT, Bio-ClinicalBERT, and RoBERTa, utilizing ten EPIC-QA questions to assess each model's answer extraction performance. The analysis employs both semantic and lexical metrics. The outcomes reveal clear model-specific behaviors: Bio-ClinicalBERT initially identified irrelevant phrases before focusing on relevant information, whereas BERT and BioBERT continually converge on similar answers, exhibiting a high degree of similarity. RoBERTa, on the other hand, demonstrates effective use of long-range dependencies in text. Semantic metrics outperform lexical metrics, with BERTScore attaining the maximum accuracy (0.97), highlighting the significance of semantic evaluation. Our findings indicate that the choice of evaluation metrics significantly influences the perceived efficacy of models, suggesting that semantic metrics offer more nuanced and insightful assessments of QA system performance. This study contributes to the field of natural language processing and machine learning by providing guidelines for selecting evaluation metrics that align with the strengths and weaknesses of various QA approaches.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 https://accesson.kr/jistap/v.13/1/85/55253
ISSN:	2287-9099 2287-4577
DOI:	10.1633/JISTaP.2025.13.1.6