TY - JOUR
T1 - Incorporating Measurement Error in Astronomical Object Classification
AU - Shy, Sarah
AU - Tak, Hyungsuk
AU - Feigelson, Eric D.
AU - Timlin, John D.
AU - Babu, G. Jogesh
N1 - Funding Information:
S.S. and H.T. appreciate the Pennsylvania State University’s Institute for Computational and Data Sciences for its computational support via the Roar supercomputer. H.T. thanks the Kavli Foundation and AURA for travel support to the workshop, Petabytes to Science, held in Boston in 2019. J.D.T. appreciates support from NASA ADP grant 80NSSC18K0878, Chandra X-ray Center grant GO0-21080X, the V. M. Willaman Endowment, and Penn State ACIS Instrument Team Contract SV4-74018 (issued by the Chandra X-ray Center, which is operated by the Smithsonian Astrophysical Observatory for and on behalf of NASA under contract NAS8-03060). We thank Jackeline Moreno and Weixiang Yu for their helpful discussions about this work.
Publisher Copyright:
© 2022. The Author(s). Published by the American Astronomical Society.
PY - 2022/7/1
Y1 - 2022/7/1
N2 - Most general-purpose classification methods, such as support-vector machine (SVM) and random forest (RF), fail to account for an unusual characteristic of astronomical data: known measurement error uncertainties. In astronomical data, this information is often given in the data but discarded because popular machine learning classifiers cannot incorporate it. We propose a simulation-based approach that incorporates heteroscedastic measurement error into an existing classification method to better quantify uncertainty in classification. The proposed method first simulates perturbed realizations of the data from a Bayesian posterior predictive distribution of a Gaussian measurement error model. Then, a chosen classifier is fit to each simulation. The variation across the simulations naturally reflects the uncertainty propagated from the measurement errors in both labeled and unlabeled data sets. We demonstrate the use of this approach via two numerical studies. The first is a thorough simulation study applying the proposed procedure to SVM and RF, which are well-known hard and soft classifiers, respectively. The second study is a realistic classification problem of identifying high-z (2.9 ≤ z ≤ 5.1) quasar candidates from photometric data. The data are from merged catalogs of the Sloan Digital Sky Survey, the Spitzer IRAC Equatorial Survey, and the Spitzer-HETDEX Exploratory Large-Area Survey. The proposed approach reveals that out of 11,847 high-z quasar candidates identified by a random forest without incorporating measurement error, 3146 are potential misclassifications with measurement error. Additionally, out of 1.85 million objects not identified as high-z quasars without measurement error, 936 can be considered new candidates with measurement error.
AB - Most general-purpose classification methods, such as support-vector machine (SVM) and random forest (RF), fail to account for an unusual characteristic of astronomical data: known measurement error uncertainties. In astronomical data, this information is often given in the data but discarded because popular machine learning classifiers cannot incorporate it. We propose a simulation-based approach that incorporates heteroscedastic measurement error into an existing classification method to better quantify uncertainty in classification. The proposed method first simulates perturbed realizations of the data from a Bayesian posterior predictive distribution of a Gaussian measurement error model. Then, a chosen classifier is fit to each simulation. The variation across the simulations naturally reflects the uncertainty propagated from the measurement errors in both labeled and unlabeled data sets. We demonstrate the use of this approach via two numerical studies. The first is a thorough simulation study applying the proposed procedure to SVM and RF, which are well-known hard and soft classifiers, respectively. The second study is a realistic classification problem of identifying high-z (2.9 ≤ z ≤ 5.1) quasar candidates from photometric data. The data are from merged catalogs of the Sloan Digital Sky Survey, the Spitzer IRAC Equatorial Survey, and the Spitzer-HETDEX Exploratory Large-Area Survey. The proposed approach reveals that out of 11,847 high-z quasar candidates identified by a random forest without incorporating measurement error, 3146 are potential misclassifications with measurement error. Additionally, out of 1.85 million objects not identified as high-z quasars without measurement error, 936 can be considered new candidates with measurement error.
UR - http://www.scopus.com/inward/record.url?scp=85132948336&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85132948336&partnerID=8YFLogxK
U2 - 10.3847/1538-3881/ac6e64
DO - 10.3847/1538-3881/ac6e64
M3 - Article
AN - SCOPUS:85132948336
SN - 0004-6256
VL - 164
JO - Astronomical Journal
JF - Astronomical Journal
IS - 1
M1 - 6
ER -