General and specific utility measures for synthetic data

Joshua Snoke, Gillian M. Raab, Beata Nowok, Chris Dibben, Aleksandra Slavkovic

Research output: Contribution to journalArticle

3 Citations (Scopus)

Abstract

Data holders can produce synthetic versions of data sets when concerns about potential disclosure restrict the availability of the original records. The paper is concerned with methods to judge whether such synthetic data have a distribution that is comparable with that of the original data: what we term general utility. We consider how general utility compares with specific utility: the similarity of results of analyses from the synthetic data and the original data. We adapt a previous general measure of data utility, the propensity score mean-squared error pMSE, to the specific case of synthetic data and derive its distribution for the case when the correct synthesis model is used to create the synthetic data. Our asymptotic results are confirmed by a simulation study. We also consider two specific utility measures, confidence interval overlap and standardized difference in summary statistics, which we compare with the general utility results. We present two contrasting examples of data syntheses: one illustrating synthetic data that is evaluated as being useful by both general and specific measures and the second where neither is the case. For the second case we show how the general utility measures can identify the deficiencies of the synthetic data and suggest how this can inform possible improvements to the synthesis method.

Original languageEnglish (US)
Pages (from-to)663-688
Number of pages26
JournalJournal of the Royal Statistical Society. Series A: Statistics in Society
Volume181
Issue number3
DOIs
StatePublished - Jun 2018

Fingerprint

Synthetic Data
Synthesis
Propensity Score
Disclosure
Mean Squared Error
Confidence interval
Overlap
Availability
Simulation Study
General Terms
Statistics
Term
confidence
statistics

All Science Journal Classification (ASJC) codes

  • Statistics and Probability
  • Social Sciences (miscellaneous)
  • Economics and Econometrics
  • Statistics, Probability and Uncertainty

Cite this

Snoke, Joshua ; Raab, Gillian M. ; Nowok, Beata ; Dibben, Chris ; Slavkovic, Aleksandra. / General and specific utility measures for synthetic data. In: Journal of the Royal Statistical Society. Series A: Statistics in Society. 2018 ; Vol. 181, No. 3. pp. 663-688.
@article{78048b092b5d4433b5a59034e6c7ad80,
title = "General and specific utility measures for synthetic data",
abstract = "Data holders can produce synthetic versions of data sets when concerns about potential disclosure restrict the availability of the original records. The paper is concerned with methods to judge whether such synthetic data have a distribution that is comparable with that of the original data: what we term general utility. We consider how general utility compares with specific utility: the similarity of results of analyses from the synthetic data and the original data. We adapt a previous general measure of data utility, the propensity score mean-squared error pMSE, to the specific case of synthetic data and derive its distribution for the case when the correct synthesis model is used to create the synthetic data. Our asymptotic results are confirmed by a simulation study. We also consider two specific utility measures, confidence interval overlap and standardized difference in summary statistics, which we compare with the general utility results. We present two contrasting examples of data syntheses: one illustrating synthetic data that is evaluated as being useful by both general and specific measures and the second where neither is the case. For the second case we show how the general utility measures can identify the deficiencies of the synthetic data and suggest how this can inform possible improvements to the synthesis method.",
author = "Joshua Snoke and Raab, {Gillian M.} and Beata Nowok and Chris Dibben and Aleksandra Slavkovic",
year = "2018",
month = "6",
doi = "10.1111/rssa.12358",
language = "English (US)",
volume = "181",
pages = "663--688",
journal = "Journal of the Royal Statistical Society. Series A: Statistics in Society",
issn = "0964-1998",
publisher = "Wiley-Blackwell",
number = "3",

}

General and specific utility measures for synthetic data. / Snoke, Joshua; Raab, Gillian M.; Nowok, Beata; Dibben, Chris; Slavkovic, Aleksandra.

In: Journal of the Royal Statistical Society. Series A: Statistics in Society, Vol. 181, No. 3, 06.2018, p. 663-688.

Research output: Contribution to journalArticle

TY - JOUR

T1 - General and specific utility measures for synthetic data

AU - Snoke, Joshua

AU - Raab, Gillian M.

AU - Nowok, Beata

AU - Dibben, Chris

AU - Slavkovic, Aleksandra

PY - 2018/6

Y1 - 2018/6

N2 - Data holders can produce synthetic versions of data sets when concerns about potential disclosure restrict the availability of the original records. The paper is concerned with methods to judge whether such synthetic data have a distribution that is comparable with that of the original data: what we term general utility. We consider how general utility compares with specific utility: the similarity of results of analyses from the synthetic data and the original data. We adapt a previous general measure of data utility, the propensity score mean-squared error pMSE, to the specific case of synthetic data and derive its distribution for the case when the correct synthesis model is used to create the synthetic data. Our asymptotic results are confirmed by a simulation study. We also consider two specific utility measures, confidence interval overlap and standardized difference in summary statistics, which we compare with the general utility results. We present two contrasting examples of data syntheses: one illustrating synthetic data that is evaluated as being useful by both general and specific measures and the second where neither is the case. For the second case we show how the general utility measures can identify the deficiencies of the synthetic data and suggest how this can inform possible improvements to the synthesis method.

AB - Data holders can produce synthetic versions of data sets when concerns about potential disclosure restrict the availability of the original records. The paper is concerned with methods to judge whether such synthetic data have a distribution that is comparable with that of the original data: what we term general utility. We consider how general utility compares with specific utility: the similarity of results of analyses from the synthetic data and the original data. We adapt a previous general measure of data utility, the propensity score mean-squared error pMSE, to the specific case of synthetic data and derive its distribution for the case when the correct synthesis model is used to create the synthetic data. Our asymptotic results are confirmed by a simulation study. We also consider two specific utility measures, confidence interval overlap and standardized difference in summary statistics, which we compare with the general utility results. We present two contrasting examples of data syntheses: one illustrating synthetic data that is evaluated as being useful by both general and specific measures and the second where neither is the case. For the second case we show how the general utility measures can identify the deficiencies of the synthetic data and suggest how this can inform possible improvements to the synthesis method.

UR - http://www.scopus.com/inward/record.url?scp=85043373928&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85043373928&partnerID=8YFLogxK

U2 - 10.1111/rssa.12358

DO - 10.1111/rssa.12358

M3 - Article

AN - SCOPUS:85043373928

VL - 181

SP - 663

EP - 688

JO - Journal of the Royal Statistical Society. Series A: Statistics in Society

JF - Journal of the Royal Statistical Society. Series A: Statistics in Society

SN - 0964-1998

IS - 3

ER -