Counterfactual Fairness in Text Classification through Robustness
In this paper, we study counterfactual fairness in text classification, which asks the question: How would the prediction change if the sensitive attribute referenced in the example were different? Toxicity classifiers demonstrate a counterfactual fairness issue by predicting that "Some people...
Saved in:
Main Authors | , , , , , |
---|---|
Format | Journal Article |
Language | English |
Published |
27.09.2018
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | In this paper, we study counterfactual fairness in text classification, which
asks the question: How would the prediction change if the sensitive attribute
referenced in the example were different? Toxicity classifiers demonstrate a
counterfactual fairness issue by predicting that "Some people are gay" is toxic
while "Some people are straight" is nontoxic. We offer a metric, counterfactual
token fairness (CTF), for measuring this particular form of fairness in text
classifiers, and describe its relationship with group fairness. Further, we
offer three approaches, blindness, counterfactual augmentation, and
counterfactual logit pairing (CLP), for optimizing counterfactual token
fairness during training, bridging the robustness and fairness literature.
Empirically, we find that blindness and CLP address counterfactual token
fairness. The methods do not harm classifier performance, and have varying
tradeoffs with group fairness. These approaches, both for measurement and
optimization, provide a new path forward for addressing fairness concerns in
text classification. |
---|---|
DOI: | 10.48550/arxiv.1809.10610 |