The Impact of Unstated Norms in Bias Analysis of Language Models
Bias in large language models (LLMs) has many forms, from overt discrimination to implicit stereotypes. Counterfactual bias evaluation is a widely used approach to quantifying bias and often relies on template-based probes that explicitly state group membership. It measures whether the outcome of a...
Saved in:
Main Authors | , , , , |
---|---|
Format | Journal Article |
Language | English |
Published |
04.04.2024
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Bias in large language models (LLMs) has many forms, from overt
discrimination to implicit stereotypes. Counterfactual bias evaluation is a
widely used approach to quantifying bias and often relies on template-based
probes that explicitly state group membership. It measures whether the outcome
of a task, performed by an LLM, is invariant to a change of group membership.
In this work, we find that template-based probes can lead to unrealistic bias
measurements. For example, LLMs appear to mistakenly cast text associated with
White race as negative at higher rates than other groups. We hypothesize that
this arises artificially via a mismatch between commonly unstated norms, in the
form of markedness, in the pretraining text of LLMs (e.g., Black president vs.
president) and templates used for bias measurement (e.g., Black president vs.
White president). The findings highlight the potential misleading impact of
varying group membership through explicit mention in counterfactual bias
quantification. |
---|---|
DOI: | 10.48550/arxiv.2404.03471 |