Parallel linkage

Hung Sik Kim, Dongwon Lee

Research output: Chapter in Book/Report/Conference proceedingConference contribution

32 Citations (Scopus)

Abstract

We study the parallelization of the (record) linkage problem - i.e., to identify matching records between two collections of records, A and B. One of main idiosyncrasies of the linkage problem, compared to Database join, is the fact that once two records a in A and b in B are matched and merged to c, c needs to be compared to the rest of records in A and B again since it may incur new matching. This re-feeding stage of the linkage problem requires its solution to be iterative, and complicates the problem significantly. Toward this problem, we first discuss three plausible scenarios of inputs - when both collections are clean, only one is clean, and both are dirty. Then, we show that the intricate interplay between match and merge can exploit the characteristics of each scenario to achieve good parallelization. Our parallel algorithms achieve 6.55-7.49 times faster in speedup compared to sequential ones with 8 processors, and 11.15-18.56% improvement in efficiency compared to P-Swoosh.

Original languageEnglish (US)
Title of host publicationCIKM 2007 - Proceedings of the 16th ACM Conference on Information and Knowledge Management
Pages283-292
Number of pages10
DOIs
StatePublished - Dec 1 2007
Event16th ACM Conference on Information and Knowledge Management, CIKM 2007 - Lisboa, Portugal
Duration: Nov 6 2007Nov 9 2007

Publication series

NameInternational Conference on Information and Knowledge Management, Proceedings

Other

Other16th ACM Conference on Information and Knowledge Management, CIKM 2007
CountryPortugal
CityLisboa
Period11/6/0711/9/07

Fingerprint

Scenarios
Linkage
Record linkage
Data base
Join

All Science Journal Classification (ASJC) codes

  • Decision Sciences(all)
  • Business, Management and Accounting(all)

Cite this

Kim, H. S., & Lee, D. (2007). Parallel linkage. In CIKM 2007 - Proceedings of the 16th ACM Conference on Information and Knowledge Management (pp. 283-292). (International Conference on Information and Knowledge Management, Proceedings). https://doi.org/10.1145/1321440.1321482
Kim, Hung Sik ; Lee, Dongwon. / Parallel linkage. CIKM 2007 - Proceedings of the 16th ACM Conference on Information and Knowledge Management. 2007. pp. 283-292 (International Conference on Information and Knowledge Management, Proceedings).
@inproceedings{b6ec16dd97a34ac99880d5de64289e69,
title = "Parallel linkage",
abstract = "We study the parallelization of the (record) linkage problem - i.e., to identify matching records between two collections of records, A and B. One of main idiosyncrasies of the linkage problem, compared to Database join, is the fact that once two records a in A and b in B are matched and merged to c, c needs to be compared to the rest of records in A and B again since it may incur new matching. This re-feeding stage of the linkage problem requires its solution to be iterative, and complicates the problem significantly. Toward this problem, we first discuss three plausible scenarios of inputs - when both collections are clean, only one is clean, and both are dirty. Then, we show that the intricate interplay between match and merge can exploit the characteristics of each scenario to achieve good parallelization. Our parallel algorithms achieve 6.55-7.49 times faster in speedup compared to sequential ones with 8 processors, and 11.15-18.56{\%} improvement in efficiency compared to P-Swoosh.",
author = "Kim, {Hung Sik} and Dongwon Lee",
year = "2007",
month = "12",
day = "1",
doi = "10.1145/1321440.1321482",
language = "English (US)",
isbn = "9781595938039",
series = "International Conference on Information and Knowledge Management, Proceedings",
pages = "283--292",
booktitle = "CIKM 2007 - Proceedings of the 16th ACM Conference on Information and Knowledge Management",

}

Kim, HS & Lee, D 2007, Parallel linkage. in CIKM 2007 - Proceedings of the 16th ACM Conference on Information and Knowledge Management. International Conference on Information and Knowledge Management, Proceedings, pp. 283-292, 16th ACM Conference on Information and Knowledge Management, CIKM 2007, Lisboa, Portugal, 11/6/07. https://doi.org/10.1145/1321440.1321482

Parallel linkage. / Kim, Hung Sik; Lee, Dongwon.

CIKM 2007 - Proceedings of the 16th ACM Conference on Information and Knowledge Management. 2007. p. 283-292 (International Conference on Information and Knowledge Management, Proceedings).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

TY - GEN

T1 - Parallel linkage

AU - Kim, Hung Sik

AU - Lee, Dongwon

PY - 2007/12/1

Y1 - 2007/12/1

N2 - We study the parallelization of the (record) linkage problem - i.e., to identify matching records between two collections of records, A and B. One of main idiosyncrasies of the linkage problem, compared to Database join, is the fact that once two records a in A and b in B are matched and merged to c, c needs to be compared to the rest of records in A and B again since it may incur new matching. This re-feeding stage of the linkage problem requires its solution to be iterative, and complicates the problem significantly. Toward this problem, we first discuss three plausible scenarios of inputs - when both collections are clean, only one is clean, and both are dirty. Then, we show that the intricate interplay between match and merge can exploit the characteristics of each scenario to achieve good parallelization. Our parallel algorithms achieve 6.55-7.49 times faster in speedup compared to sequential ones with 8 processors, and 11.15-18.56% improvement in efficiency compared to P-Swoosh.

AB - We study the parallelization of the (record) linkage problem - i.e., to identify matching records between two collections of records, A and B. One of main idiosyncrasies of the linkage problem, compared to Database join, is the fact that once two records a in A and b in B are matched and merged to c, c needs to be compared to the rest of records in A and B again since it may incur new matching. This re-feeding stage of the linkage problem requires its solution to be iterative, and complicates the problem significantly. Toward this problem, we first discuss three plausible scenarios of inputs - when both collections are clean, only one is clean, and both are dirty. Then, we show that the intricate interplay between match and merge can exploit the characteristics of each scenario to achieve good parallelization. Our parallel algorithms achieve 6.55-7.49 times faster in speedup compared to sequential ones with 8 processors, and 11.15-18.56% improvement in efficiency compared to P-Swoosh.

UR - http://www.scopus.com/inward/record.url?scp=63449096255&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=63449096255&partnerID=8YFLogxK

U2 - 10.1145/1321440.1321482

DO - 10.1145/1321440.1321482

M3 - Conference contribution

AN - SCOPUS:63449096255

SN - 9781595938039

T3 - International Conference on Information and Knowledge Management, Proceedings

SP - 283

EP - 292

BT - CIKM 2007 - Proceedings of the 16th ACM Conference on Information and Knowledge Management

ER -

Kim HS, Lee D. Parallel linkage. In CIKM 2007 - Proceedings of the 16th ACM Conference on Information and Knowledge Management. 2007. p. 283-292. (International Conference on Information and Knowledge Management, Proceedings). https://doi.org/10.1145/1321440.1321482