Group linkage

Byung Won On, Nick Koudas, Dongwon Lee, Divesh Srivastava

Research output: Chapter in Book/Report/Conference proceedingConference contribution

41 Citations (Scopus)

Abstract

Poor quality data is prevalent in databases due to a variety of reasons, including transcription errors, lack of standards for recording database fields, etc. To be able to query and integrate such data, considerable recent work has focused on the record linkage problem, i.e., determine if two entities represented as relational records are approximately the same. Often entities are represented as groups of relational records, rather than individual relational records, e.g., households in a census survey consist of a group of persons. We refer to the problem of determining if two entities represented as groups are approximately the same as group linkage. Intuitively, two groups can be linked to each other if (i) there is high enough similarity between "matching" pairs of individual records that constitute the two groups, and (ii) there is a large fraction of such matching record pairs. In this paper, we formalize this intuition and propose a group linkage measure based on bipartite graph matching. Given a data set consisting of a large number of groups, efficiently finding groups with a high group linkage similarity to an input query group requires quickly eliminating the many groups that are unlikely to be desired matches. To enable this task, we present simpler group similarity measures that can be used either during fast pre-processing steps or as approximations to our proposed group linkage measure. These measures can be easily instantiated using SQL, permitting our techniques to be implemented inside the database system itself. We experimentally validate the utility of our measures and techniques using a variety of real and synthetic data sets.

Original languageEnglish (US)
Title of host publication23rd International Conference on Data Engineering, ICDE 2007
Pages496-505
Number of pages10
DOIs
StatePublished - Sep 24 2007
Event23rd International Conference on Data Engineering, ICDE 2007 - Istanbul, Turkey
Duration: Apr 15 2007Apr 20 2007

Other

Other23rd International Conference on Data Engineering, ICDE 2007
CountryTurkey
CityIstanbul
Period4/15/074/20/07

Fingerprint

Transcription
Processing

All Science Journal Classification (ASJC) codes

  • Software
  • Signal Processing
  • Information Systems

Cite this

On, B. W., Koudas, N., Lee, D., & Srivastava, D. (2007). Group linkage. In 23rd International Conference on Data Engineering, ICDE 2007 (pp. 496-505). [4221698] https://doi.org/10.1109/ICDE.2007.367895
On, Byung Won ; Koudas, Nick ; Lee, Dongwon ; Srivastava, Divesh. / Group linkage. 23rd International Conference on Data Engineering, ICDE 2007. 2007. pp. 496-505
@inproceedings{f77e740855e34bcba5365b47541e0a44,
title = "Group linkage",
abstract = "Poor quality data is prevalent in databases due to a variety of reasons, including transcription errors, lack of standards for recording database fields, etc. To be able to query and integrate such data, considerable recent work has focused on the record linkage problem, i.e., determine if two entities represented as relational records are approximately the same. Often entities are represented as groups of relational records, rather than individual relational records, e.g., households in a census survey consist of a group of persons. We refer to the problem of determining if two entities represented as groups are approximately the same as group linkage. Intuitively, two groups can be linked to each other if (i) there is high enough similarity between {"}matching{"} pairs of individual records that constitute the two groups, and (ii) there is a large fraction of such matching record pairs. In this paper, we formalize this intuition and propose a group linkage measure based on bipartite graph matching. Given a data set consisting of a large number of groups, efficiently finding groups with a high group linkage similarity to an input query group requires quickly eliminating the many groups that are unlikely to be desired matches. To enable this task, we present simpler group similarity measures that can be used either during fast pre-processing steps or as approximations to our proposed group linkage measure. These measures can be easily instantiated using SQL, permitting our techniques to be implemented inside the database system itself. We experimentally validate the utility of our measures and techniques using a variety of real and synthetic data sets.",
author = "On, {Byung Won} and Nick Koudas and Dongwon Lee and Divesh Srivastava",
year = "2007",
month = "9",
day = "24",
doi = "10.1109/ICDE.2007.367895",
language = "English (US)",
isbn = "1424408032",
pages = "496--505",
booktitle = "23rd International Conference on Data Engineering, ICDE 2007",

}

On, BW, Koudas, N, Lee, D & Srivastava, D 2007, Group linkage. in 23rd International Conference on Data Engineering, ICDE 2007., 4221698, pp. 496-505, 23rd International Conference on Data Engineering, ICDE 2007, Istanbul, Turkey, 4/15/07. https://doi.org/10.1109/ICDE.2007.367895

Group linkage. / On, Byung Won; Koudas, Nick; Lee, Dongwon; Srivastava, Divesh.

23rd International Conference on Data Engineering, ICDE 2007. 2007. p. 496-505 4221698.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

TY - GEN

T1 - Group linkage

AU - On, Byung Won

AU - Koudas, Nick

AU - Lee, Dongwon

AU - Srivastava, Divesh

PY - 2007/9/24

Y1 - 2007/9/24

N2 - Poor quality data is prevalent in databases due to a variety of reasons, including transcription errors, lack of standards for recording database fields, etc. To be able to query and integrate such data, considerable recent work has focused on the record linkage problem, i.e., determine if two entities represented as relational records are approximately the same. Often entities are represented as groups of relational records, rather than individual relational records, e.g., households in a census survey consist of a group of persons. We refer to the problem of determining if two entities represented as groups are approximately the same as group linkage. Intuitively, two groups can be linked to each other if (i) there is high enough similarity between "matching" pairs of individual records that constitute the two groups, and (ii) there is a large fraction of such matching record pairs. In this paper, we formalize this intuition and propose a group linkage measure based on bipartite graph matching. Given a data set consisting of a large number of groups, efficiently finding groups with a high group linkage similarity to an input query group requires quickly eliminating the many groups that are unlikely to be desired matches. To enable this task, we present simpler group similarity measures that can be used either during fast pre-processing steps or as approximations to our proposed group linkage measure. These measures can be easily instantiated using SQL, permitting our techniques to be implemented inside the database system itself. We experimentally validate the utility of our measures and techniques using a variety of real and synthetic data sets.

AB - Poor quality data is prevalent in databases due to a variety of reasons, including transcription errors, lack of standards for recording database fields, etc. To be able to query and integrate such data, considerable recent work has focused on the record linkage problem, i.e., determine if two entities represented as relational records are approximately the same. Often entities are represented as groups of relational records, rather than individual relational records, e.g., households in a census survey consist of a group of persons. We refer to the problem of determining if two entities represented as groups are approximately the same as group linkage. Intuitively, two groups can be linked to each other if (i) there is high enough similarity between "matching" pairs of individual records that constitute the two groups, and (ii) there is a large fraction of such matching record pairs. In this paper, we formalize this intuition and propose a group linkage measure based on bipartite graph matching. Given a data set consisting of a large number of groups, efficiently finding groups with a high group linkage similarity to an input query group requires quickly eliminating the many groups that are unlikely to be desired matches. To enable this task, we present simpler group similarity measures that can be used either during fast pre-processing steps or as approximations to our proposed group linkage measure. These measures can be easily instantiated using SQL, permitting our techniques to be implemented inside the database system itself. We experimentally validate the utility of our measures and techniques using a variety of real and synthetic data sets.

UR - http://www.scopus.com/inward/record.url?scp=34548725521&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=34548725521&partnerID=8YFLogxK

U2 - 10.1109/ICDE.2007.367895

DO - 10.1109/ICDE.2007.367895

M3 - Conference contribution

SN - 1424408032

SN - 9781424408030

SP - 496

EP - 505

BT - 23rd International Conference on Data Engineering, ICDE 2007

ER -

On BW, Koudas N, Lee D, Srivastava D. Group linkage. In 23rd International Conference on Data Engineering, ICDE 2007. 2007. p. 496-505. 4221698 https://doi.org/10.1109/ICDE.2007.367895