TY - JOUR
T1 - Computing confidence intervals from massive data via penalized quantile smoothing splines
AU - Zhang, Likun
AU - Castillo, Enrique del
AU - Berglund, Andrew J.
AU - Tingley, Martin Patrick
AU - Govind, Nirmal
PY - 2020/4
Y1 - 2020/4
N2 - New methodology is presented for the computation of pointwise confidence intervals from massive response data sets in one or two covariates using robust and flexible quantile regression splines. Novel aspects of the method include a new cross-validation procedure for selecting the penalization coefficient and a reformulation of the quantile smoothing problem based on a weighted data representation. These innovations permit for uncertainty quantification and fast parameter selection in very large data sets via a distributed “bag of little bootstraps”. Experiments with synthetic data demonstrate that the computed confidence intervals feature empirical coverage rates that are generally within 2% of the nominal rates. The approach is broadly applicable to the analysis of large data sets in one or two dimensions. Comparative (or “A/B”) experiments conducted at Netflix aimed at optimizing the quality of streaming video originally motivated this work, but the proposed methods have general applicability. The methodology is illustrated using an open source application: the comparison of geo-spatial climate model scenarios from NASA's Earth Exchange.
AB - New methodology is presented for the computation of pointwise confidence intervals from massive response data sets in one or two covariates using robust and flexible quantile regression splines. Novel aspects of the method include a new cross-validation procedure for selecting the penalization coefficient and a reformulation of the quantile smoothing problem based on a weighted data representation. These innovations permit for uncertainty quantification and fast parameter selection in very large data sets via a distributed “bag of little bootstraps”. Experiments with synthetic data demonstrate that the computed confidence intervals feature empirical coverage rates that are generally within 2% of the nominal rates. The approach is broadly applicable to the analysis of large data sets in one or two dimensions. Comparative (or “A/B”) experiments conducted at Netflix aimed at optimizing the quality of streaming video originally motivated this work, but the proposed methods have general applicability. The methodology is illustrated using an open source application: the comparison of geo-spatial climate model scenarios from NASA's Earth Exchange.
UR - http://www.scopus.com/inward/record.url?scp=85075519534&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85075519534&partnerID=8YFLogxK
U2 - 10.1016/j.csda.2019.106885
DO - 10.1016/j.csda.2019.106885
M3 - Article
AN - SCOPUS:85075519534
VL - 144
JO - Computational Statistics and Data Analysis
JF - Computational Statistics and Data Analysis
SN - 0167-9473
M1 - 106885
ER -