Measuring the Effect of Training Data on Deep Learning Predictions via Randomized Experiments

Anqi Zhang; Jinkun Lin; Aurojit Panda; Jinyang Li; Mathias Lécuyer; Siddhartha Sen

Measuring the Effect of Training Data on Deep Learning Predictions via Randomized Experiments

Anqi Zhang ,
Jinkun Lin ,
Aurojit Panda ,
Jinyang Li ,
Mathias Lécuyer ,
Siddhartha Sen

ICML 2022 | July 2022

Organized by Microsoft

Download BibTex

We develop a new, principled algorithm for estimating the contribution of training data points to the behavior of a deep learning model, such as a specific prediction it makes or its accuracy on a test set. We define a new quantity, AME, that measures the expected (average) marginal effect of adding a data point to a subset of the training data, sampled from a given distribution. When the subsets are sampled from the uniform distribution, the AME reduces to the well-known Shapley Value. Our approach is inspired by causal inference and randomized experiments: we sample different subsets of the training data and train a separate (sub)model on each subset, and then evaluate the behavior of each submodel. We then use a LASSO regression to estimate the AME of each data point, based on the composition of the subsets. Under sparsity assumptions, our estimator requires only $O(k\log n)$ randomized submodel trainings, where n is the number of data points and $k \ll n$ of them have large AME, making it the most scalable approach to date. Under many settings, this also yields a more efficient estimator for the Shapley Value than was previously known. We extend our estimator to support control over its false positive rate using the Knockoffs method; we also extend it to support hierarchical data. We demonstrate the practicality of our estimator by applying on several data poisoning and model explanation tasks, across a variety of datasets.