Enhancing QA System Evaluation: An In-Depth Analysis of Metrics and Model-Specific Behaviors

The purpose of this study is to examine how evaluation metrics influence the perception and performance of question answering (QA) systems, particularly focusing on their effectiveness in QA tasks. We compare four different models: BERT, BioBERT, Bio-ClinicalBERT, and RoBERTa, utilizing ten EPIC-QA...

Full description

Saved in:
Bibliographic Details
Published inJournal of information science theory and practice Vol. 13; no. 1; pp. 85 - 98
Main Authors Kim, Heesop, Ademola, Aluko
Format Journal Article
LanguageEnglish
Published Daejeon Korea Institute of Science and Technology Information 01.03.2025
한국과학기술정보연구원
Subjects
Online AccessGet full text
ISSN2287-9099
2287-4577
DOI10.1633/JISTaP.2025.13.1.6

Cover

More Information
Summary:The purpose of this study is to examine how evaluation metrics influence the perception and performance of question answering (QA) systems, particularly focusing on their effectiveness in QA tasks. We compare four different models: BERT, BioBERT, Bio-ClinicalBERT, and RoBERTa, utilizing ten EPIC-QA questions to assess each model's answer extraction performance. The analysis employs both semantic and lexical metrics. The outcomes reveal clear model-specific behaviors: Bio-ClinicalBERT initially identified irrelevant phrases before focusing on relevant information, whereas BERT and BioBERT continually converge on similar answers, exhibiting a high degree of similarity. RoBERTa, on the other hand, demonstrates effective use of long-range dependencies in text. Semantic metrics outperform lexical metrics, with BERTScore attaining the maximum accuracy (0.97), highlighting the significance of semantic evaluation. Our findings indicate that the choice of evaluation metrics significantly influences the perceived efficacy of models, suggesting that semantic metrics offer more nuanced and insightful assessments of QA system performance. This study contributes to the field of natural language processing and machine learning by providing guidelines for selecting evaluation metrics that align with the strengths and weaknesses of various QA approaches.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
https://accesson.kr/jistap/v.13/1/85/55253
ISSN:2287-9099
2287-4577
DOI:10.1633/JISTaP.2025.13.1.6