DeepClean: Data cleaning via question asking

Xinyang Zhang, Yujie Ji, Chanh Nguyen, Ting Wang

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

As one critical task in the data analysis pipeline, data cleaning is notoriously human labor-intensive and error-prone. Knowledge base-assisted data cleaning has proved a powerful tool for finding and fixing data defects; however, its applicability is inevitably bounded by the natural limitations of knowledge bases. Meanwhile, although a vast number of knowledge sources exist in the form of free-text corpora (e.g., Wikipedia), transforming them into formats usable by existing data cleaning tools can be prohibitively costly and error-prone, if not at all impossible. Here, we present DeepClean, the first end-to-end data cleaning framework powered by free-text knowledge sources. At a high level, DeepClean leverages a knowledge source through its question-answering (QA) interface and achieves high-quality cleaning via iterative question asking. Specifically, DeepClean detects and repairs data defects in three stages: (i) Pattern extraction - it automatically discovers the semantic types of the data attributes as well as their correlations; (ii) Question generation - it translates each data tuple into a minimal set of validation questions; (iii) Completion and repair - by checking the answers returned by the knowledge source against the data values, it identifies erroneous cases and suggests possible fixes. Through extensive empirical studies, we demonstrate that DeepClean is applicable to a range of domains, and can effectively repair a variety of data defects, highlighting data cleaning powered by free-text knowledge sources as a promising direction for future research.

Original languageEnglish (US)
Title of host publicationProceedings - 2018 IEEE 5th International Conference on Data Science and Advanced Analytics, DSAA 2018
EditorsTina Eliassi-Rad, Wei Wang, Ciro Cattuto, Foster Provost, Rayid Ghani, Francesco Bonchi
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages283-292
Number of pages10
ISBN (Electronic)9781538650905
DOIs
StatePublished - Jan 31 2019
Event5th IEEE International Conference on Data Science and Advanced Analytics, DSAA 2018 - Turin, Italy
Duration: Oct 1 2018Oct 4 2018

Publication series

NameProceedings - 2018 IEEE 5th International Conference on Data Science and Advanced Analytics, DSAA 2018

Conference

Conference5th IEEE International Conference on Data Science and Advanced Analytics, DSAA 2018
CountryItaly
CityTurin
Period10/1/1810/4/18

Fingerprint

Cleaning
Repair
Defects
Knowledge Base
Data cleaning
Pipelines
Semantics
Personnel
Question Answering
Wikipedia
Minimal Set
Leverage
Empirical Study
Completion
Data analysis
Attribute
Knowledge
Knowledge base

All Science Journal Classification (ASJC) codes

  • Signal Processing
  • Information Systems and Management
  • Statistics, Probability and Uncertainty
  • Computer Networks and Communications

Cite this

Zhang, X., Ji, Y., Nguyen, C., & Wang, T. (2019). DeepClean: Data cleaning via question asking. In T. Eliassi-Rad, W. Wang, C. Cattuto, F. Provost, R. Ghani, & F. Bonchi (Eds.), Proceedings - 2018 IEEE 5th International Conference on Data Science and Advanced Analytics, DSAA 2018 (pp. 283-292). [8631426] (Proceedings - 2018 IEEE 5th International Conference on Data Science and Advanced Analytics, DSAA 2018). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/DSAA.2018.00039
Zhang, Xinyang ; Ji, Yujie ; Nguyen, Chanh ; Wang, Ting. / DeepClean : Data cleaning via question asking. Proceedings - 2018 IEEE 5th International Conference on Data Science and Advanced Analytics, DSAA 2018. editor / Tina Eliassi-Rad ; Wei Wang ; Ciro Cattuto ; Foster Provost ; Rayid Ghani ; Francesco Bonchi. Institute of Electrical and Electronics Engineers Inc., 2019. pp. 283-292 (Proceedings - 2018 IEEE 5th International Conference on Data Science and Advanced Analytics, DSAA 2018).
@inproceedings{2a0bcde7e83a4e3d81fe89ef8f60c021,
title = "DeepClean: Data cleaning via question asking",
abstract = "As one critical task in the data analysis pipeline, data cleaning is notoriously human labor-intensive and error-prone. Knowledge base-assisted data cleaning has proved a powerful tool for finding and fixing data defects; however, its applicability is inevitably bounded by the natural limitations of knowledge bases. Meanwhile, although a vast number of knowledge sources exist in the form of free-text corpora (e.g., Wikipedia), transforming them into formats usable by existing data cleaning tools can be prohibitively costly and error-prone, if not at all impossible. Here, we present DeepClean, the first end-to-end data cleaning framework powered by free-text knowledge sources. At a high level, DeepClean leverages a knowledge source through its question-answering (QA) interface and achieves high-quality cleaning via iterative question asking. Specifically, DeepClean detects and repairs data defects in three stages: (i) Pattern extraction - it automatically discovers the semantic types of the data attributes as well as their correlations; (ii) Question generation - it translates each data tuple into a minimal set of validation questions; (iii) Completion and repair - by checking the answers returned by the knowledge source against the data values, it identifies erroneous cases and suggests possible fixes. Through extensive empirical studies, we demonstrate that DeepClean is applicable to a range of domains, and can effectively repair a variety of data defects, highlighting data cleaning powered by free-text knowledge sources as a promising direction for future research.",
author = "Xinyang Zhang and Yujie Ji and Chanh Nguyen and Ting Wang",
year = "2019",
month = "1",
day = "31",
doi = "10.1109/DSAA.2018.00039",
language = "English (US)",
series = "Proceedings - 2018 IEEE 5th International Conference on Data Science and Advanced Analytics, DSAA 2018",
publisher = "Institute of Electrical and Electronics Engineers Inc.",
pages = "283--292",
editor = "Tina Eliassi-Rad and Wei Wang and Ciro Cattuto and Foster Provost and Rayid Ghani and Francesco Bonchi",
booktitle = "Proceedings - 2018 IEEE 5th International Conference on Data Science and Advanced Analytics, DSAA 2018",
address = "United States",

}

Zhang, X, Ji, Y, Nguyen, C & Wang, T 2019, DeepClean: Data cleaning via question asking. in T Eliassi-Rad, W Wang, C Cattuto, F Provost, R Ghani & F Bonchi (eds), Proceedings - 2018 IEEE 5th International Conference on Data Science and Advanced Analytics, DSAA 2018., 8631426, Proceedings - 2018 IEEE 5th International Conference on Data Science and Advanced Analytics, DSAA 2018, Institute of Electrical and Electronics Engineers Inc., pp. 283-292, 5th IEEE International Conference on Data Science and Advanced Analytics, DSAA 2018, Turin, Italy, 10/1/18. https://doi.org/10.1109/DSAA.2018.00039

DeepClean : Data cleaning via question asking. / Zhang, Xinyang; Ji, Yujie; Nguyen, Chanh; Wang, Ting.

Proceedings - 2018 IEEE 5th International Conference on Data Science and Advanced Analytics, DSAA 2018. ed. / Tina Eliassi-Rad; Wei Wang; Ciro Cattuto; Foster Provost; Rayid Ghani; Francesco Bonchi. Institute of Electrical and Electronics Engineers Inc., 2019. p. 283-292 8631426 (Proceedings - 2018 IEEE 5th International Conference on Data Science and Advanced Analytics, DSAA 2018).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

TY - GEN

T1 - DeepClean

T2 - Data cleaning via question asking

AU - Zhang, Xinyang

AU - Ji, Yujie

AU - Nguyen, Chanh

AU - Wang, Ting

PY - 2019/1/31

Y1 - 2019/1/31

N2 - As one critical task in the data analysis pipeline, data cleaning is notoriously human labor-intensive and error-prone. Knowledge base-assisted data cleaning has proved a powerful tool for finding and fixing data defects; however, its applicability is inevitably bounded by the natural limitations of knowledge bases. Meanwhile, although a vast number of knowledge sources exist in the form of free-text corpora (e.g., Wikipedia), transforming them into formats usable by existing data cleaning tools can be prohibitively costly and error-prone, if not at all impossible. Here, we present DeepClean, the first end-to-end data cleaning framework powered by free-text knowledge sources. At a high level, DeepClean leverages a knowledge source through its question-answering (QA) interface and achieves high-quality cleaning via iterative question asking. Specifically, DeepClean detects and repairs data defects in three stages: (i) Pattern extraction - it automatically discovers the semantic types of the data attributes as well as their correlations; (ii) Question generation - it translates each data tuple into a minimal set of validation questions; (iii) Completion and repair - by checking the answers returned by the knowledge source against the data values, it identifies erroneous cases and suggests possible fixes. Through extensive empirical studies, we demonstrate that DeepClean is applicable to a range of domains, and can effectively repair a variety of data defects, highlighting data cleaning powered by free-text knowledge sources as a promising direction for future research.

AB - As one critical task in the data analysis pipeline, data cleaning is notoriously human labor-intensive and error-prone. Knowledge base-assisted data cleaning has proved a powerful tool for finding and fixing data defects; however, its applicability is inevitably bounded by the natural limitations of knowledge bases. Meanwhile, although a vast number of knowledge sources exist in the form of free-text corpora (e.g., Wikipedia), transforming them into formats usable by existing data cleaning tools can be prohibitively costly and error-prone, if not at all impossible. Here, we present DeepClean, the first end-to-end data cleaning framework powered by free-text knowledge sources. At a high level, DeepClean leverages a knowledge source through its question-answering (QA) interface and achieves high-quality cleaning via iterative question asking. Specifically, DeepClean detects and repairs data defects in three stages: (i) Pattern extraction - it automatically discovers the semantic types of the data attributes as well as their correlations; (ii) Question generation - it translates each data tuple into a minimal set of validation questions; (iii) Completion and repair - by checking the answers returned by the knowledge source against the data values, it identifies erroneous cases and suggests possible fixes. Through extensive empirical studies, we demonstrate that DeepClean is applicable to a range of domains, and can effectively repair a variety of data defects, highlighting data cleaning powered by free-text knowledge sources as a promising direction for future research.

UR - http://www.scopus.com/inward/record.url?scp=85062867879&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85062867879&partnerID=8YFLogxK

U2 - 10.1109/DSAA.2018.00039

DO - 10.1109/DSAA.2018.00039

M3 - Conference contribution

AN - SCOPUS:85062867879

T3 - Proceedings - 2018 IEEE 5th International Conference on Data Science and Advanced Analytics, DSAA 2018

SP - 283

EP - 292

BT - Proceedings - 2018 IEEE 5th International Conference on Data Science and Advanced Analytics, DSAA 2018

A2 - Eliassi-Rad, Tina

A2 - Wang, Wei

A2 - Cattuto, Ciro

A2 - Provost, Foster

A2 - Ghani, Rayid

A2 - Bonchi, Francesco

PB - Institute of Electrical and Electronics Engineers Inc.

ER -

Zhang X, Ji Y, Nguyen C, Wang T. DeepClean: Data cleaning via question asking. In Eliassi-Rad T, Wang W, Cattuto C, Provost F, Ghani R, Bonchi F, editors, Proceedings - 2018 IEEE 5th International Conference on Data Science and Advanced Analytics, DSAA 2018. Institute of Electrical and Electronics Engineers Inc. 2019. p. 283-292. 8631426. (Proceedings - 2018 IEEE 5th International Conference on Data Science and Advanced Analytics, DSAA 2018). https://doi.org/10.1109/DSAA.2018.00039