Computing confidence intervals from massive data via penalized quantile smoothing splines

Likun Zhang, Enrique del Castillo, Andrew J. Berglund, Martin P. Tingley, Nirmal Govind

Research output: Contribution to journalArticle

Abstract

New methodology is presented for the computation of pointwise confidence intervals from massive response data sets in one or two covariates using robust and flexible quantile regression splines. Novel aspects of the method include a new cross-validation procedure for selecting the penalization coefficient and a reformulation of the quantile smoothing problem based on a weighted data representation. These innovations permit for uncertainty quantification and fast parameter selection in very large data sets via a distributed “bag of little bootstraps”. Experiments with synthetic data demonstrate that the computed confidence intervals feature empirical coverage rates that are generally within 2% of the nominal rates. The approach is broadly applicable to the analysis of large data sets in one or two dimensions. Comparative (or “A/B”) experiments conducted at Netflix aimed at optimizing the quality of streaming video originally motivated this work, but the proposed methods have general applicability. The methodology is illustrated using an open source application: the comparison of geo-spatial climate model scenarios from NASA's Earth Exchange.

Original languageEnglish (US)
Article number106885
JournalComputational Statistics and Data Analysis
Volume144
DOIs
StatePublished - Apr 2020

Fingerprint

Smoothing Splines
Quantile
Large Data Sets
Splines
Confidence interval
Regression Splines
Climate models
Uncertainty Quantification
Climate Models
Quantile Regression
Video Streaming
Methodology
Parameter Selection
Penalization
Computing
Video streaming
Spatial Model
Synthetic Data
NASA
Reformulation

All Science Journal Classification (ASJC) codes

  • Statistics and Probability
  • Computational Mathematics
  • Computational Theory and Mathematics
  • Applied Mathematics

Cite this

@article{e502e3d2b4704b90bf43bd88d7aca1bd,
title = "Computing confidence intervals from massive data via penalized quantile smoothing splines",
abstract = "New methodology is presented for the computation of pointwise confidence intervals from massive response data sets in one or two covariates using robust and flexible quantile regression splines. Novel aspects of the method include a new cross-validation procedure for selecting the penalization coefficient and a reformulation of the quantile smoothing problem based on a weighted data representation. These innovations permit for uncertainty quantification and fast parameter selection in very large data sets via a distributed “bag of little bootstraps”. Experiments with synthetic data demonstrate that the computed confidence intervals feature empirical coverage rates that are generally within 2{\%} of the nominal rates. The approach is broadly applicable to the analysis of large data sets in one or two dimensions. Comparative (or “A/B”) experiments conducted at Netflix aimed at optimizing the quality of streaming video originally motivated this work, but the proposed methods have general applicability. The methodology is illustrated using an open source application: the comparison of geo-spatial climate model scenarios from NASA's Earth Exchange.",
author = "Likun Zhang and Castillo, {Enrique del} and Berglund, {Andrew J.} and Tingley, {Martin P.} and Nirmal Govind",
year = "2020",
month = "4",
doi = "10.1016/j.csda.2019.106885",
language = "English (US)",
volume = "144",
journal = "Computational Statistics and Data Analysis",
issn = "0167-9473",
publisher = "Elsevier",

}

Computing confidence intervals from massive data via penalized quantile smoothing splines. / Zhang, Likun; Castillo, Enrique del; Berglund, Andrew J.; Tingley, Martin P.; Govind, Nirmal.

In: Computational Statistics and Data Analysis, Vol. 144, 106885, 04.2020.

Research output: Contribution to journalArticle

TY - JOUR

T1 - Computing confidence intervals from massive data via penalized quantile smoothing splines

AU - Zhang, Likun

AU - Castillo, Enrique del

AU - Berglund, Andrew J.

AU - Tingley, Martin P.

AU - Govind, Nirmal

PY - 2020/4

Y1 - 2020/4

N2 - New methodology is presented for the computation of pointwise confidence intervals from massive response data sets in one or two covariates using robust and flexible quantile regression splines. Novel aspects of the method include a new cross-validation procedure for selecting the penalization coefficient and a reformulation of the quantile smoothing problem based on a weighted data representation. These innovations permit for uncertainty quantification and fast parameter selection in very large data sets via a distributed “bag of little bootstraps”. Experiments with synthetic data demonstrate that the computed confidence intervals feature empirical coverage rates that are generally within 2% of the nominal rates. The approach is broadly applicable to the analysis of large data sets in one or two dimensions. Comparative (or “A/B”) experiments conducted at Netflix aimed at optimizing the quality of streaming video originally motivated this work, but the proposed methods have general applicability. The methodology is illustrated using an open source application: the comparison of geo-spatial climate model scenarios from NASA's Earth Exchange.

AB - New methodology is presented for the computation of pointwise confidence intervals from massive response data sets in one or two covariates using robust and flexible quantile regression splines. Novel aspects of the method include a new cross-validation procedure for selecting the penalization coefficient and a reformulation of the quantile smoothing problem based on a weighted data representation. These innovations permit for uncertainty quantification and fast parameter selection in very large data sets via a distributed “bag of little bootstraps”. Experiments with synthetic data demonstrate that the computed confidence intervals feature empirical coverage rates that are generally within 2% of the nominal rates. The approach is broadly applicable to the analysis of large data sets in one or two dimensions. Comparative (or “A/B”) experiments conducted at Netflix aimed at optimizing the quality of streaming video originally motivated this work, but the proposed methods have general applicability. The methodology is illustrated using an open source application: the comparison of geo-spatial climate model scenarios from NASA's Earth Exchange.

UR - http://www.scopus.com/inward/record.url?scp=85075519534&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85075519534&partnerID=8YFLogxK

U2 - 10.1016/j.csda.2019.106885

DO - 10.1016/j.csda.2019.106885

M3 - Article

AN - SCOPUS:85075519534

VL - 144

JO - Computational Statistics and Data Analysis

JF - Computational Statistics and Data Analysis

SN - 0167-9473

M1 - 106885

ER -