TY - GEN

T1 - A fast, consistent kernel two-sample test

AU - Gretton, Arthur

AU - Fukumizu, Kenji

AU - Harchaoui, Zaid

AU - Sriperumbudur, Bharath K.

PY - 2009/12/1

Y1 - 2009/12/1

N2 - A kernel embedding of probability distributions into reproducing kernel Hilbert spaces (RKHS) has recently been proposed, which allows the comparison of two probability measures P and Q based on the distance between their respective embeddings: for a sufficiently rich RKHS, this distance is zero if and only if P and Q coincide. In using this distance as a statistic for a test of whether two samples are from different distributions, a major difficulty arises in computing the significance threshold, since the empirical statistic has as its null distribution (where P = Q) an infinite weighted sum of χ 2 random variables. Prior finite sample approximations to the null distribution include using bootstrap resampling, which yields a consistent estimate but is computationally costly; and fitting a parametric model with the low order moments of the test statistic, which can work well in practice but has no consistency or accuracy guarantees. The main result of the present work is a novel estimate of the null distribution, computed from the eigenspectrum of the Gram matrix on the aggregate sample from P and Q, and having lower computational cost than the bootstrap. A proof of consistency of this estimate is provided. The performance of the null distribution estimate is compared with the bootstrap and parametric approaches on an artificial example, high dimensional multivariate data, and text.

AB - A kernel embedding of probability distributions into reproducing kernel Hilbert spaces (RKHS) has recently been proposed, which allows the comparison of two probability measures P and Q based on the distance between their respective embeddings: for a sufficiently rich RKHS, this distance is zero if and only if P and Q coincide. In using this distance as a statistic for a test of whether two samples are from different distributions, a major difficulty arises in computing the significance threshold, since the empirical statistic has as its null distribution (where P = Q) an infinite weighted sum of χ 2 random variables. Prior finite sample approximations to the null distribution include using bootstrap resampling, which yields a consistent estimate but is computationally costly; and fitting a parametric model with the low order moments of the test statistic, which can work well in practice but has no consistency or accuracy guarantees. The main result of the present work is a novel estimate of the null distribution, computed from the eigenspectrum of the Gram matrix on the aggregate sample from P and Q, and having lower computational cost than the bootstrap. A proof of consistency of this estimate is provided. The performance of the null distribution estimate is compared with the bootstrap and parametric approaches on an artificial example, high dimensional multivariate data, and text.

UR - http://www.scopus.com/inward/record.url?scp=80053164096&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=80053164096&partnerID=8YFLogxK

M3 - Conference contribution

AN - SCOPUS:80053164096

SN - 9781615679119

T3 - Advances in Neural Information Processing Systems 22 - Proceedings of the 2009 Conference

SP - 673

EP - 681

BT - Advances in Neural Information Processing Systems 22 - Proceedings of the 2009 Conference

T2 - 23rd Annual Conference on Neural Information Processing Systems, NIPS 2009

Y2 - 7 December 2009 through 10 December 2009

ER -