Making clusterings fairer by post-processing: algorithms, complexity results and experiments

While existing fairness work typically focuses on fair-by-design algorithms, here we consider making a fairness-unaware algorithm’s output fairer. Specifically, we explore the area of fairness in clustering by modifying clusterings produced by existing algorithms to make them fairer whilst retaining...

Full description

Saved in:
Bibliographic Details
Published inData mining and knowledge discovery Vol. 37; no. 4; pp. 1404 - 1440
Main Authors Davidson, Ian, Bai, Zilong, Tran, Cindy Mylinh, Ravi, S. S.
Format Journal Article
LanguageEnglish
Published New York Springer US 01.07.2023
Springer Nature B.V
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:While existing fairness work typically focuses on fair-by-design algorithms, here we consider making a fairness-unaware algorithm’s output fairer. Specifically, we explore the area of fairness in clustering by modifying clusterings produced by existing algorithms to make them fairer whilst retaining their quality. We formulate the minimal cluster modification for fairness (MCMF) problem, where the input is a given partitional clustering and the goal is to minimally change it so that the clustering is still of good quality but fairer. We show that for a single binary protected status variable, the problem is efficiently solvable (i.e., in the class P ) by proving that the constraint matrix for an integer linear programming formulation is totally unimodular. Interestingly, we show that even for a single protected variable, the addition of simple pairwise guidance for clustering (to say ensure individual-level fairness) makes the MCMF problem computationally intractable (i.e., NP -hard). Experimental results using Twitter, Census and NYT data sets show that our methods can modify existing clusterings for data sets in excess of 100,000 instances within minutes on laptops and find clusterings that are as fair but are of higher quality than those produced by fair-by-design clustering algorithms. Finally, we explore a challenging practical problem of making a historical clustering (i.e., zipcodes clustered into California’s congressional districts) fairer using a new multi-faceted benchmark data set.
ISSN:1384-5810
1573-756X
DOI:10.1007/s10618-022-00893-6