Identifying value mappings for data integration

An unsupervised approach

Research output: Chapter in Book/Report/Conference proceedingConference contribution

3 Citations (Scopus)

Abstract

The Web is a distributed network of information sources where the individual sources are autonomously created and maintained. Consequently, syntactic and semantic heterogeneity of data among sources abound. Most of the current data cleaning solutions assume that the data values referencing the same object bear some textual similarity. However, this assumption is often violated in practice. "Two-door front wheel drive" can be represented as "2DR-FWD" or "R2FD", or even as "CAR TYPE 3" in different data sources. To address this problem, we propose a novel two-step automated technique that exploits statistical dependency structures among objects which is invariant to the tokens representing the objects. The algorithm achieved a high accuracy in our empirical study, suggesting that it can be a useful addition to the existing information integration techniques.

Original languageEnglish (US)
Title of host publicationWeb Information Systems Engineering, WISE 2005 - 6th International Conference on Web Information Systems Engineering, Proceedings
Pages544-551
Number of pages8
DOIs
StatePublished - Dec 1 2005
Event6th International Conference on Web Information Systems Engineering, WISE 2005 - New York, NY, United States
Duration: Nov 20 2005Nov 22 2005

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume3806 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Other

Other6th International Conference on Web Information Systems Engineering, WISE 2005
CountryUnited States
CityNew York, NY
Period11/20/0511/22/05

Fingerprint

Data integration
Data Integration
Syntactics
Cleaning
Wheels
Semantics
Information Integration
Distributed Networks
Wheel
Empirical Study
High Accuracy
Invariant
Object

All Science Journal Classification (ASJC) codes

  • Theoretical Computer Science
  • Computer Science(all)

Cite this

Kang, J., Lee, D., & Mitra, P. (2005). Identifying value mappings for data integration: An unsupervised approach. In Web Information Systems Engineering, WISE 2005 - 6th International Conference on Web Information Systems Engineering, Proceedings (pp. 544-551). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 3806 LNCS). https://doi.org/10.1007/11581062_46
Kang, Jaewoo ; Lee, Dongwon ; Mitra, Prasenjit. / Identifying value mappings for data integration : An unsupervised approach. Web Information Systems Engineering, WISE 2005 - 6th International Conference on Web Information Systems Engineering, Proceedings. 2005. pp. 544-551 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).
@inproceedings{6cf740a4ce864d5f90124d2fb4eb8bfd,
title = "Identifying value mappings for data integration: An unsupervised approach",
abstract = "The Web is a distributed network of information sources where the individual sources are autonomously created and maintained. Consequently, syntactic and semantic heterogeneity of data among sources abound. Most of the current data cleaning solutions assume that the data values referencing the same object bear some textual similarity. However, this assumption is often violated in practice. {"}Two-door front wheel drive{"} can be represented as {"}2DR-FWD{"} or {"}R2FD{"}, or even as {"}CAR TYPE 3{"} in different data sources. To address this problem, we propose a novel two-step automated technique that exploits statistical dependency structures among objects which is invariant to the tokens representing the objects. The algorithm achieved a high accuracy in our empirical study, suggesting that it can be a useful addition to the existing information integration techniques.",
author = "Jaewoo Kang and Dongwon Lee and Prasenjit Mitra",
year = "2005",
month = "12",
day = "1",
doi = "10.1007/11581062_46",
language = "English (US)",
isbn = "3540300171",
series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",
pages = "544--551",
booktitle = "Web Information Systems Engineering, WISE 2005 - 6th International Conference on Web Information Systems Engineering, Proceedings",

}

Kang, J, Lee, D & Mitra, P 2005, Identifying value mappings for data integration: An unsupervised approach. in Web Information Systems Engineering, WISE 2005 - 6th International Conference on Web Information Systems Engineering, Proceedings. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 3806 LNCS, pp. 544-551, 6th International Conference on Web Information Systems Engineering, WISE 2005, New York, NY, United States, 11/20/05. https://doi.org/10.1007/11581062_46

Identifying value mappings for data integration : An unsupervised approach. / Kang, Jaewoo; Lee, Dongwon; Mitra, Prasenjit.

Web Information Systems Engineering, WISE 2005 - 6th International Conference on Web Information Systems Engineering, Proceedings. 2005. p. 544-551 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 3806 LNCS).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

TY - GEN

T1 - Identifying value mappings for data integration

T2 - An unsupervised approach

AU - Kang, Jaewoo

AU - Lee, Dongwon

AU - Mitra, Prasenjit

PY - 2005/12/1

Y1 - 2005/12/1

N2 - The Web is a distributed network of information sources where the individual sources are autonomously created and maintained. Consequently, syntactic and semantic heterogeneity of data among sources abound. Most of the current data cleaning solutions assume that the data values referencing the same object bear some textual similarity. However, this assumption is often violated in practice. "Two-door front wheel drive" can be represented as "2DR-FWD" or "R2FD", or even as "CAR TYPE 3" in different data sources. To address this problem, we propose a novel two-step automated technique that exploits statistical dependency structures among objects which is invariant to the tokens representing the objects. The algorithm achieved a high accuracy in our empirical study, suggesting that it can be a useful addition to the existing information integration techniques.

AB - The Web is a distributed network of information sources where the individual sources are autonomously created and maintained. Consequently, syntactic and semantic heterogeneity of data among sources abound. Most of the current data cleaning solutions assume that the data values referencing the same object bear some textual similarity. However, this assumption is often violated in practice. "Two-door front wheel drive" can be represented as "2DR-FWD" or "R2FD", or even as "CAR TYPE 3" in different data sources. To address this problem, we propose a novel two-step automated technique that exploits statistical dependency structures among objects which is invariant to the tokens representing the objects. The algorithm achieved a high accuracy in our empirical study, suggesting that it can be a useful addition to the existing information integration techniques.

UR - http://www.scopus.com/inward/record.url?scp=33744788355&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=33744788355&partnerID=8YFLogxK

U2 - 10.1007/11581062_46

DO - 10.1007/11581062_46

M3 - Conference contribution

SN - 3540300171

SN - 9783540300175

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 544

EP - 551

BT - Web Information Systems Engineering, WISE 2005 - 6th International Conference on Web Information Systems Engineering, Proceedings

ER -

Kang J, Lee D, Mitra P. Identifying value mappings for data integration: An unsupervised approach. In Web Information Systems Engineering, WISE 2005 - 6th International Conference on Web Information Systems Engineering, Proceedings. 2005. p. 544-551. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)). https://doi.org/10.1007/11581062_46