A Sequential Non-Parametric Multivariate Two-Sample Test

Given samples from two distributions, a non-parametric two-sample test aims at determining whether the two distributions are equal or not, based on a test statistic. Classically, this statistic is computed on the whole data set, or is computed on a subset of the data set by a function trained on its...

Full description

Saved in:

Bibliographic Details
Published in	IEEE transactions on information theory Vol. 64; no. 5; pp. 3361 - 3370
Main Authors	Lheritier, Alix, Cazals, Frederic
Format	Journal Article
Language	English
Published	New York IEEE 01.05.2018 The Institute of Electrical and Electronics Engineers, Inc. (IEEE) Institute of Electrical and Electronics Engineers
Subjects	Bayes factor Bayesian analysis Bayesian mixtures Computation Computational Geometry Computer Science Datasets Hypothesis testing Information theory Multiplexing non-parametric two-sample test Nonparametric statistics Probabilistic logic Random variables regression sequential prediction Sociology State of the art Statistical analysis Statistical methods Statistics switch distributions Switches Testing universal distributions
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Given samples from two distributions, a non-parametric two-sample test aims at determining whether the two distributions are equal or not, based on a test statistic. Classically, this statistic is computed on the whole data set, or is computed on a subset of the data set by a function trained on its complement. We consider methods in a third tier, so as to deal with large (possibly infinite) data sets, and to automatically determine the most relevant scales to work at, making two contributions. First, we develop a generic sequential non-parametric testing framework, in which the sample size need not be fixed in advance. This makes our test a truly sequential non-parametric multivariate two-sample test. Under information theoretic conditions qualifying the difference between the tested distributions, consistency of the two-sample test is established. Second, we instantiate our framework using nearest neighbor regressors, and show how the power of the resulting two-sample test can be improved using Bayesian mixtures and switch distributions. This combination of techniques yields automatic scale selection, and experiments performed on challenging data sets show that our sequential tests exhibit comparable performances to those of state-of-the-art non-sequential tests.
ISSN:	0018-9448 1557-9654
DOI:	10.1109/TIT.2018.2800658