Making clusterings fairer by post-processing: algorithms, complexity results and experiments
While existing fairness work typically focuses on fair-by-design algorithms, here we consider making a fairness-unaware algorithm’s output fairer. Specifically, we explore the area of fairness in clustering by modifying clusterings produced by existing algorithms to make them fairer whilst retaining...
Saved in:
Published in | Data mining and knowledge discovery Vol. 37; no. 4; pp. 1404 - 1440 |
---|---|
Main Authors | , , , |
Format | Journal Article |
Language | English |
Published |
New York
Springer US
01.07.2023
Springer Nature B.V |
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | While existing fairness work typically focuses on fair-by-design algorithms, here we consider making a fairness-unaware algorithm’s output fairer. Specifically, we explore the area of fairness in clustering by modifying clusterings produced by existing algorithms to make them fairer whilst retaining their quality. We formulate the minimal cluster modification for fairness (MCMF) problem, where the input is a given partitional clustering and the goal is to minimally change it so that the clustering is still of good quality but fairer. We show that for a single binary protected status variable, the problem is efficiently solvable (i.e., in the class
P
) by proving that the constraint matrix for an integer linear programming formulation is totally unimodular. Interestingly, we show that even for a single protected variable, the addition of simple pairwise guidance for clustering (to say ensure individual-level fairness) makes the MCMF problem computationally intractable (i.e.,
NP
-hard). Experimental results using Twitter, Census and NYT data sets show that our methods can modify existing clusterings for data sets in excess of 100,000 instances within minutes on laptops and find clusterings that are as fair but are of higher quality than those produced by fair-by-design clustering algorithms. Finally, we explore a challenging practical problem of making a historical clustering (i.e., zipcodes clustered into California’s congressional districts) fairer using a new multi-faceted benchmark data set. |
---|---|
ISSN: | 1384-5810 1573-756X |
DOI: | 10.1007/s10618-022-00893-6 |