Adaptive Sampling Strategies to Construct Equitable Training Datasets
In domains ranging from computer vision to natural language processing, machine learning models have been shown to exhibit stark disparities, often performing worse for members of traditionally underserved groups. One factor contributing to these performance gaps is a lack of representation in the d...
Saved in:
Main Authors | , , , , , , |
---|---|
Format | Journal Article |
Language | English |
Published |
31.01.2022
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | In domains ranging from computer vision to natural language processing,
machine learning models have been shown to exhibit stark disparities, often
performing worse for members of traditionally underserved groups. One factor
contributing to these performance gaps is a lack of representation in the data
the models are trained on. It is often unclear, however, how to operationalize
representativeness in specific applications. Here we formalize the problem of
creating equitable training datasets, and propose a statistical framework for
addressing this problem. We consider a setting where a model builder must
decide how to allocate a fixed data collection budget to gather training data
from different subgroups. We then frame dataset creation as a constrained
optimization problem, in which one maximizes a function of group-specific
performance metrics based on (estimated) group-specific learning rates and
costs per sample. This flexible approach incorporates preferences of
model-builders and other stakeholders, as well as the statistical properties of
the learning task. When data collection decisions are made sequentially, we
show that under certain conditions this optimization problem can be efficiently
solved even without prior knowledge of the learning rates. To illustrate our
approach, we conduct a simulation study of polygenic risk scores on synthetic
genomic data -- an application domain that often suffers from
non-representative data collection. We find that our adaptive sampling strategy
outperforms several common data collection heuristics, including equal and
proportional sampling, demonstrating the value of strategic dataset design for
building equitable models. |
---|---|
DOI: | 10.48550/arxiv.2202.01327 |