Statistical methodology for massive datasets and model selection

G. Jogesh Babu, James P. McDermott

Research output: Contribution to journalConference article

Abstract

Astronomy is facing a revolution in data collection, storage, analysis, and interpretation of large datasets. The data volumes here are several orders of magnitude larger than what astronomers and statisticians are used to dealing with, and the old methods simply do not work. The National Virtual Observatory (NVO) initiative has recently emerged in recognition of this need and to federate numerous large digital sky archives, both ground based and space based, and develop tools to explore and understand these vast volumes of data. In this paper, we address some of the critically important statistical challenges raised by the NVO. In particular a low-storage, single-pass, sequential method for simultaneous estimation of multiple quantiles for massive datasets will be presented. Density estimation based on this procedure and a multivariate extension will also be discussed. The NVO also requires statistical tools to analyze moderate size databases. Model selection is an important issue for many astrophysical databases. We present a simple likelihood based 'leave one out' method to select the best among the several possible alternatives. The performance of the method is compared to those based on Akaike Information Criterion and Bayesian Information Criterion.

Original languageEnglish (US)
Pages (from-to)228-237
Number of pages10
JournalProceedings of SPIE - The International Society for Optical Engineering
Volume4847
DOIs
StatePublished - Dec 1 2002
EventAstronomical data Analysis II - Waikoloa, HI, United States
Duration: Aug 27 2002Aug 28 2002

Fingerprint

Observatories
Observatory
Model Selection
observatories
methodology
Methodology
quantiles
Simultaneous Estimation
Bayesian Information Criterion
Sequential Methods
Akaike Information Criterion
Astronomy
Density Estimation
Quantile
astronomy
Large Data Sets
sky
Likelihood
astrophysics
Alternatives

All Science Journal Classification (ASJC) codes

  • Electronic, Optical and Magnetic Materials
  • Condensed Matter Physics
  • Computer Science Applications
  • Applied Mathematics
  • Electrical and Electronic Engineering

Cite this

@article{f7766c857c98455ab88fb59833754926,
title = "Statistical methodology for massive datasets and model selection",
abstract = "Astronomy is facing a revolution in data collection, storage, analysis, and interpretation of large datasets. The data volumes here are several orders of magnitude larger than what astronomers and statisticians are used to dealing with, and the old methods simply do not work. The National Virtual Observatory (NVO) initiative has recently emerged in recognition of this need and to federate numerous large digital sky archives, both ground based and space based, and develop tools to explore and understand these vast volumes of data. In this paper, we address some of the critically important statistical challenges raised by the NVO. In particular a low-storage, single-pass, sequential method for simultaneous estimation of multiple quantiles for massive datasets will be presented. Density estimation based on this procedure and a multivariate extension will also be discussed. The NVO also requires statistical tools to analyze moderate size databases. Model selection is an important issue for many astrophysical databases. We present a simple likelihood based 'leave one out' method to select the best among the several possible alternatives. The performance of the method is compared to those based on Akaike Information Criterion and Bayesian Information Criterion.",
author = "Babu, {G. Jogesh} and McDermott, {James P.}",
year = "2002",
month = "12",
day = "1",
doi = "10.1117/12.460339",
language = "English (US)",
volume = "4847",
pages = "228--237",
journal = "Proceedings of SPIE - The International Society for Optical Engineering",
issn = "0277-786X",
publisher = "SPIE",

}

Statistical methodology for massive datasets and model selection. / Babu, G. Jogesh; McDermott, James P.

In: Proceedings of SPIE - The International Society for Optical Engineering, Vol. 4847, 01.12.2002, p. 228-237.

Research output: Contribution to journalConference article

TY - JOUR

T1 - Statistical methodology for massive datasets and model selection

AU - Babu, G. Jogesh

AU - McDermott, James P.

PY - 2002/12/1

Y1 - 2002/12/1

N2 - Astronomy is facing a revolution in data collection, storage, analysis, and interpretation of large datasets. The data volumes here are several orders of magnitude larger than what astronomers and statisticians are used to dealing with, and the old methods simply do not work. The National Virtual Observatory (NVO) initiative has recently emerged in recognition of this need and to federate numerous large digital sky archives, both ground based and space based, and develop tools to explore and understand these vast volumes of data. In this paper, we address some of the critically important statistical challenges raised by the NVO. In particular a low-storage, single-pass, sequential method for simultaneous estimation of multiple quantiles for massive datasets will be presented. Density estimation based on this procedure and a multivariate extension will also be discussed. The NVO also requires statistical tools to analyze moderate size databases. Model selection is an important issue for many astrophysical databases. We present a simple likelihood based 'leave one out' method to select the best among the several possible alternatives. The performance of the method is compared to those based on Akaike Information Criterion and Bayesian Information Criterion.

AB - Astronomy is facing a revolution in data collection, storage, analysis, and interpretation of large datasets. The data volumes here are several orders of magnitude larger than what astronomers and statisticians are used to dealing with, and the old methods simply do not work. The National Virtual Observatory (NVO) initiative has recently emerged in recognition of this need and to federate numerous large digital sky archives, both ground based and space based, and develop tools to explore and understand these vast volumes of data. In this paper, we address some of the critically important statistical challenges raised by the NVO. In particular a low-storage, single-pass, sequential method for simultaneous estimation of multiple quantiles for massive datasets will be presented. Density estimation based on this procedure and a multivariate extension will also be discussed. The NVO also requires statistical tools to analyze moderate size databases. Model selection is an important issue for many astrophysical databases. We present a simple likelihood based 'leave one out' method to select the best among the several possible alternatives. The performance of the method is compared to those based on Akaike Information Criterion and Bayesian Information Criterion.

UR - http://www.scopus.com/inward/record.url?scp=0038640217&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=0038640217&partnerID=8YFLogxK

U2 - 10.1117/12.460339

DO - 10.1117/12.460339

M3 - Conference article

AN - SCOPUS:0038640217

VL - 4847

SP - 228

EP - 237

JO - Proceedings of SPIE - The International Society for Optical Engineering

JF - Proceedings of SPIE - The International Society for Optical Engineering

SN - 0277-786X

ER -