Efficient distribution estimation for data with unobserved sub-population identifiers

Yanyuan Ma, Yuanjia Wang

Research output: Contribution to journalArticle

6 Citations (Scopus)

Abstract

We study efficient nonparametric estimation of distribution functions of several scientifically meaningful sub-populations from data consisting of mixed samples where the sub-population identifiers are missing. Only probabilities of each observation belonging to a sub-population are available. The problem arises from several biomedical studies such as quantitative trait locus (QTL) analysis and genetic studies with ungenotyped relatives where the scientific interest lies in estimating the cumulative distribution function of a trait given a specific genotype. However, in these studies subjects' genotypes may not be directly observed. The distribution of the trait outcome is therefore a mixture of several genotype-specific distributions. We characterize the complete class of consistent estimators which includes members such as one type of nonparametric maximum likelihood estimator (NPMLE) and least squares or weighted least squares estimators. We identify the efficient estimator in the class that reaches the semiparametric efficiency bound, and we implement it using a simple procedure that remains consistent even if several components of the estimator are mis-specified. In addition, our close inspections on two commonly used NPMLEs in these problems show the surprising results that the NPMLE in one form is highly inefficient, while in the other form is inconsistent. We provide simulation procedures to illustrate the theoretical results and demonstrate the proposed methods through two real data examples.

Original languageEnglish (US)
Pages (from-to)710-737
Number of pages28
JournalElectronic Journal of Statistics
Volume6
DOIs
StatePublished - Dec 1 2012

Fingerprint

Genotype
Nonparametric Maximum Likelihood Estimator
Semiparametric Efficiency
Weighted Least Squares Estimator
Quantitative Trait Loci
Efficient Estimator
Efficient Estimation
Consistent Estimator
Cumulative distribution function
Nonparametric Estimation
Inconsistent
Least Squares
Inspection
Distribution Function
Estimator
Demonstrate
Simulation
Class
Form
Maximum likelihood estimator

All Science Journal Classification (ASJC) codes

  • Statistics and Probability
  • Statistics, Probability and Uncertainty

Cite this

@article{75a960a4ff8f4c8fbefcd5f97cb6609d,
title = "Efficient distribution estimation for data with unobserved sub-population identifiers",
abstract = "We study efficient nonparametric estimation of distribution functions of several scientifically meaningful sub-populations from data consisting of mixed samples where the sub-population identifiers are missing. Only probabilities of each observation belonging to a sub-population are available. The problem arises from several biomedical studies such as quantitative trait locus (QTL) analysis and genetic studies with ungenotyped relatives where the scientific interest lies in estimating the cumulative distribution function of a trait given a specific genotype. However, in these studies subjects' genotypes may not be directly observed. The distribution of the trait outcome is therefore a mixture of several genotype-specific distributions. We characterize the complete class of consistent estimators which includes members such as one type of nonparametric maximum likelihood estimator (NPMLE) and least squares or weighted least squares estimators. We identify the efficient estimator in the class that reaches the semiparametric efficiency bound, and we implement it using a simple procedure that remains consistent even if several components of the estimator are mis-specified. In addition, our close inspections on two commonly used NPMLEs in these problems show the surprising results that the NPMLE in one form is highly inefficient, while in the other form is inconsistent. We provide simulation procedures to illustrate the theoretical results and demonstrate the proposed methods through two real data examples.",
author = "Yanyuan Ma and Yuanjia Wang",
year = "2012",
month = "12",
day = "1",
doi = "10.1214/12-EJS690",
language = "English (US)",
volume = "6",
pages = "710--737",
journal = "Electronic Journal of Statistics",
issn = "1935-7524",
publisher = "Institute of Mathematical Statistics",

}

Efficient distribution estimation for data with unobserved sub-population identifiers. / Ma, Yanyuan; Wang, Yuanjia.

In: Electronic Journal of Statistics, Vol. 6, 01.12.2012, p. 710-737.

Research output: Contribution to journalArticle

TY - JOUR

T1 - Efficient distribution estimation for data with unobserved sub-population identifiers

AU - Ma, Yanyuan

AU - Wang, Yuanjia

PY - 2012/12/1

Y1 - 2012/12/1

N2 - We study efficient nonparametric estimation of distribution functions of several scientifically meaningful sub-populations from data consisting of mixed samples where the sub-population identifiers are missing. Only probabilities of each observation belonging to a sub-population are available. The problem arises from several biomedical studies such as quantitative trait locus (QTL) analysis and genetic studies with ungenotyped relatives where the scientific interest lies in estimating the cumulative distribution function of a trait given a specific genotype. However, in these studies subjects' genotypes may not be directly observed. The distribution of the trait outcome is therefore a mixture of several genotype-specific distributions. We characterize the complete class of consistent estimators which includes members such as one type of nonparametric maximum likelihood estimator (NPMLE) and least squares or weighted least squares estimators. We identify the efficient estimator in the class that reaches the semiparametric efficiency bound, and we implement it using a simple procedure that remains consistent even if several components of the estimator are mis-specified. In addition, our close inspections on two commonly used NPMLEs in these problems show the surprising results that the NPMLE in one form is highly inefficient, while in the other form is inconsistent. We provide simulation procedures to illustrate the theoretical results and demonstrate the proposed methods through two real data examples.

AB - We study efficient nonparametric estimation of distribution functions of several scientifically meaningful sub-populations from data consisting of mixed samples where the sub-population identifiers are missing. Only probabilities of each observation belonging to a sub-population are available. The problem arises from several biomedical studies such as quantitative trait locus (QTL) analysis and genetic studies with ungenotyped relatives where the scientific interest lies in estimating the cumulative distribution function of a trait given a specific genotype. However, in these studies subjects' genotypes may not be directly observed. The distribution of the trait outcome is therefore a mixture of several genotype-specific distributions. We characterize the complete class of consistent estimators which includes members such as one type of nonparametric maximum likelihood estimator (NPMLE) and least squares or weighted least squares estimators. We identify the efficient estimator in the class that reaches the semiparametric efficiency bound, and we implement it using a simple procedure that remains consistent even if several components of the estimator are mis-specified. In addition, our close inspections on two commonly used NPMLEs in these problems show the surprising results that the NPMLE in one form is highly inefficient, while in the other form is inconsistent. We provide simulation procedures to illustrate the theoretical results and demonstrate the proposed methods through two real data examples.

UR - http://www.scopus.com/inward/record.url?scp=84871980607&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84871980607&partnerID=8YFLogxK

U2 - 10.1214/12-EJS690

DO - 10.1214/12-EJS690

M3 - Article

AN - SCOPUS:84871980607

VL - 6

SP - 710

EP - 737

JO - Electronic Journal of Statistics

JF - Electronic Journal of Statistics

SN - 1935-7524

ER -