The statistical estimation and inference questions considered in this project appear in many areas of science and engineering that rely on statistical research. Some of these areas of high societal impact include social and behavioral research, forensic sciences, early detection of covert communications and breaches in data security, early-warning systems for identifying outbreaks of infectious diseases, cognitive development studies in children, and drug discovery, among many others. In many of these areas, commonly-used statistical methods make strong assumptions about the data-generating distribution. Such simplistic approaches may be unjustified; moreover, even if these assumptions make sense, they need to be tested. This research project investigates a powerful alternative to these existing methods and aims to develop a fundamental theoretical understanding of the same, leading to novel statistical applications. Code for algorithms that result from this project will be made publicly available for ready use.
The kernel method is a class of statistical methodology that has gained popularity in statistical learning due to its ability to handle both high-dimensional and non-Euclidean data. The core idea of the method is to map observed data to a function space, called the reproducing kernel Hilbert space (RKHS), which allows capture of non-linear relationships in the data. This project concerns theoretical and methodological research on a generalization of this method by embedding probability measures in an RKHS. This generalization has wide applicability in statistical learning problems such as nonparametric hypothesis testing, density estimation, and regression on distributions, which will be explored in this project. On the theoretical front, the characterization of injectivity of kernel embedding will be considered. While such a characterization is well understood for kernels defined on locally compact Abelian groups and compact non-Abelian groups, this project will investigate the injectivity of the kernel embedding for non-standard spaces such as nuclear spaces, the space of graphs, and the positive definite cone. The injectivity of the embedding is known to be related to the richness of the RKHS in approximating a certain class of functions. The research will investigate the rate of this approximation, which turns out to be critical in analyzing the convergence rates of kernel-based regression and density estimators and separation rates in hypothesis testing. An injective embedding induces a metric, called the kernel distance on the space of probabilities, which is defined as the RKHS distance between the kernel embeddings of two probability measures. The investigator will study the relation of kernel distance to other probability metrics such as the energy distance, distance covariance, f-divergence, and integral probability metrics in order to understand the statistical/computational (dis)advantages associated with these distances. These theoretical studies have an applied counterpart, wherein the RKHS embedding plays a critical role in the problems of regression on probability measures and density estimation in infinite dimensional exponential families. For these problems, the investigator plans to develop computationally efficient estimators with theoretical guarantees. Overall, the project aims to develop a comprehensive theory of RKHS embedding of probability measures with applications to problems in statistical learning.
|Effective start/end date||7/1/17 → 6/30/21|
- National Science Foundation: $178,251.00