Enhancing QA System Evaluation: An In-Depth Analysis of Metrics and Model-Specific Behaviors
The purpose of this study is to examine how evaluation metrics influence the perception and performance of question answering (QA) systems, particularly focusing on their effectiveness in QA tasks. We compare four different models: BERT, BioBERT, Bio-ClinicalBERT, and RoBERTa, utilizing ten EPIC-QA...
Saved in:
Published in | Journal of information science theory and practice Vol. 13; no. 1; pp. 85 - 98 |
---|---|
Main Authors | , |
Format | Journal Article |
Language | English |
Published |
Daejeon
Korea Institute of Science and Technology Information
01.03.2025
한국과학기술정보연구원 |
Subjects | |
Online Access | Get full text |
ISSN | 2287-9099 2287-4577 |
DOI | 10.1633/JISTaP.2025.13.1.6 |
Cover
Summary: | The purpose of this study is to examine how evaluation metrics influence the perception and performance of question answering (QA) systems, particularly focusing on their effectiveness in QA tasks. We compare four different models: BERT, BioBERT, Bio-ClinicalBERT, and RoBERTa, utilizing ten EPIC-QA questions to assess each model's answer extraction performance. The analysis employs both semantic and lexical metrics. The outcomes reveal clear model-specific behaviors: Bio-ClinicalBERT initially identified irrelevant phrases before focusing on relevant information, whereas BERT and BioBERT continually converge on similar answers, exhibiting a high degree of similarity. RoBERTa, on the other hand, demonstrates effective use of long-range dependencies in text. Semantic metrics outperform lexical metrics, with BERTScore attaining the maximum accuracy (0.97), highlighting the significance of semantic evaluation. Our findings indicate that the choice of evaluation metrics significantly influences the perceived efficacy of models, suggesting that semantic metrics offer more nuanced and insightful assessments of QA system performance. This study contributes to the field of natural language processing and machine learning by providing guidelines for selecting evaluation metrics that align with the strengths and weaknesses of various QA approaches. |
---|---|
Bibliography: | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 https://accesson.kr/jistap/v.13/1/85/55253 |
ISSN: | 2287-9099 2287-4577 |
DOI: | 10.1633/JISTaP.2025.13.1.6 |