Finding High-Value Training Data Subset Through Differentiable Convex Programming

Finding valuable training data points for deep neural networks has been a core research challenge with many applications. In recent years, various techniques for calculating the “value” of individual training datapoints have been proposed for explaining trained models. However, the value of a traini...

Full description

Saved in:

Bibliographic Details
Published in	Machine Learning and Knowledge Discovery in Databases. Research Track Vol. 12976; pp. 666 - 681
Main Authors	Das, Soumi, Singh, Arshdeep, Chatterjee, Saptarshi, Bhattacharya, Suparna, Bhattacharya, Sourangshu
Format	Book Chapter
Language	English
Published	Switzerland Springer International Publishing AG 2021 Springer International Publishing
Series	Lecture Notes in Computer Science
Subjects	Convex optimisation Data valuation Explainability Subset selection
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Finding valuable training data points for deep neural networks has been a core research challenge with many applications. In recent years, various techniques for calculating the “value” of individual training datapoints have been proposed for explaining trained models. However, the value of a training datapoint also depends on other selected training datapoints - a notion which is not explicitly captured by existing methods. In this paper, we study the problem of selecting high-value subsets of training data. The key idea is to design a learnable framework for online subset selection, which can be learned using mini-batches of training data, thus making our method scalable. This results in a parameterised convex subset selection problem that is amenable to a differentiable convex programming paradigm, thus allowing us to learn the parameters of the selection model in an end-to-end training. Using this framework, we design an online alternating minimization based algorithm for jointly learning the parameters of the selection model and ML model. Extensive evaluation on a synthetic dataset, and three standard datasets, show that our algorithm finds consistently higher value subsets of training data, compared to the recent state of the art methods, sometimes ∼20% $$\sim 20\%$$ higher value than existing methods. The subsets are also useful in finding mislabelled training data. Our algorithm takes running time comparable to the existing valuation functions.
Bibliography:	Original Abstract: Finding valuable training data points for deep neural networks has been a core research challenge with many applications. In recent years, various techniques for calculating the “value” of individual training datapoints have been proposed for explaining trained models. However, the value of a training datapoint also depends on other selected training datapoints - a notion which is not explicitly captured by existing methods. In this paper, we study the problem of selecting high-value subsets of training data. The key idea is to design a learnable framework for online subset selection, which can be learned using mini-batches of training data, thus making our method scalable. This results in a parameterised convex subset selection problem that is amenable to a differentiable convex programming paradigm, thus allowing us to learn the parameters of the selection model in an end-to-end training. Using this framework, we design an online alternating minimization based algorithm for jointly learning the parameters of the selection model and ML model. Extensive evaluation on a synthetic dataset, and three standard datasets, show that our algorithm finds consistently higher value subsets of training data, compared to the recent state of the art methods, sometimes ∼20%\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\sim 20\%$$\end{document} higher value than existing methods. The subsets are also useful in finding mislabelled training data. Our algorithm takes running time comparable to the existing valuation functions.
ISBN:	3030865193 9783030865191
ISSN:	0302-9743 1611-3349
DOI:	10.1007/978-3-030-86520-7_41