Evaluating Natural Language Inference Models: A Metamorphic Testing Approach

Natural language inference (NLI) is a fundamental NLP task that forms the cornerstone of deep natural language understanding. Unfortunately, evaluation of NLI models is challenging. On one hand, due to the lack of test oracles, it is difficult to automatically judge the correctness of NLI's pre...

Full description

Saved in:

Bibliographic Details
Published in	Proceedings - International Symposium on Software Reliability Engineering pp. 220 - 230
Main Authors	Jiang, Mingyue, Bao, Houzhen, Tu, Kaiyi, Zhang, Xiao-Yi, Ding, Zuohua
Format	Conference Proceeding
Language	English Japanese
Published	IEEE 01.10.2021
Subjects	Analytical models Cognition Linguistics Maintenance engineering Metamorphic Relation Metamorphic Testing Natural Language Inference Natural languages Oracle Problem Predictive models Quality Evaluation Software reliability
Online Access	Get full text
ISSN	2332-6549
DOI	10.1109/ISSRE52982.2021.00033

Cover

More Information
Summary:	Natural language inference (NLI) is a fundamental NLP task that forms the cornerstone of deep natural language understanding. Unfortunately, evaluation of NLI models is challenging. On one hand, due to the lack of test oracles, it is difficult to automatically judge the correctness of NLI's prediction results. On the other hand, apart from knowing how well a model performs, there is a further need for understanding the capabilities and characteristics of different NLI models. To mitigate these issues, we propose to apply the technique of metamorphic testing (MT) to NLI. We identify six categories of metamorphic relations, covering a wide range of properties that are expected to be possessed by NLI task. Based on this, MT can be conducted on NLI models without using test oracles, and MT results are able to interpret NLI models' capabilities from varying aspects. We further demonstrate the validity and effectiveness of our approach by conducting experiments on five NLI models. Our experiments expose a large number of prediction failures from subject NLI models, and also yield interpretations for common characteristics of NLI models.
ISSN:	2332-6549
DOI:	10.1109/ISSRE52982.2021.00033