HARRA: Fast iterative hashed record linkage for large-scale data collections

Hung Sik Kim, Dongwon Lee

Research output: Chapter in Book/Report/Conference proceedingConference contribution

54 Citations (Scopus)

Abstract

We study the performance issue of the "iterative" record linkage (RL) problem, where match and merge operations may occur together in iterations until convergence emerges. We first propose the Iterative Locality-Sensitive Hashing (ILSH) that dynamically merges LSH-based has tables for quick and accurate blocking. Then, by exploiting inherent characteristics within/across data sets, we develop a suite of I-LSH-based RL algorithms, named as HARRA (HAshed RecoRd linkAge). The superiority of HARRA in speed over competing RL solutions is thoroughly validated using various real data sets. While maintaining equivalent or comparable accuracy levels, for instance, HARRA runs: (1) 4.5 and 10.5 times faster than StringMap and R-Swoosh in iteratively linking 4,000 x 4,000 short records (i.e., one of the small test cases), and (2) 5.6 and 3.4 times faster than basic LSH and Multi-Probe LSH algorithms in iteratively linking 400,000 x 400,000 long records (i.e., the largest test case).

Original languageEnglish (US)
Title of host publicationAdvances in Database Technology - EDBT 2010 - 13th International Conference on Extending Database Technology, Proceedings
Pages525-536
Number of pages12
DOIs
StatePublished - May 19 2010
Event13th International Conference on Extending Database Technology: Advances in Database Technology - EDBT 2010 - Lausanne, Switzerland
Duration: Mar 22 2010Mar 26 2010

Publication series

NameAdvances in Database Technology - EDBT 2010 - 13th International Conference on Extending Database Technology, Proceedings

Other

Other13th International Conference on Extending Database Technology: Advances in Database Technology - EDBT 2010
CountrySwitzerland
CityLausanne
Period3/22/103/26/10

All Science Journal Classification (ASJC) codes

  • Computational Theory and Mathematics
  • Computer Science Applications
  • Software

Cite this

Kim, H. S., & Lee, D. (2010). HARRA: Fast iterative hashed record linkage for large-scale data collections. In Advances in Database Technology - EDBT 2010 - 13th International Conference on Extending Database Technology, Proceedings (pp. 525-536). (Advances in Database Technology - EDBT 2010 - 13th International Conference on Extending Database Technology, Proceedings). https://doi.org/10.1145/1739041.1739104
Kim, Hung Sik ; Lee, Dongwon. / HARRA : Fast iterative hashed record linkage for large-scale data collections. Advances in Database Technology - EDBT 2010 - 13th International Conference on Extending Database Technology, Proceedings. 2010. pp. 525-536 (Advances in Database Technology - EDBT 2010 - 13th International Conference on Extending Database Technology, Proceedings).
@inproceedings{c6da82c7a35a4b4cba07e742150acb72,
title = "HARRA: Fast iterative hashed record linkage for large-scale data collections",
abstract = "We study the performance issue of the {"}iterative{"} record linkage (RL) problem, where match and merge operations may occur together in iterations until convergence emerges. We first propose the Iterative Locality-Sensitive Hashing (ILSH) that dynamically merges LSH-based has tables for quick and accurate blocking. Then, by exploiting inherent characteristics within/across data sets, we develop a suite of I-LSH-based RL algorithms, named as HARRA (HAshed RecoRd linkAge). The superiority of HARRA in speed over competing RL solutions is thoroughly validated using various real data sets. While maintaining equivalent or comparable accuracy levels, for instance, HARRA runs: (1) 4.5 and 10.5 times faster than StringMap and R-Swoosh in iteratively linking 4,000 x 4,000 short records (i.e., one of the small test cases), and (2) 5.6 and 3.4 times faster than basic LSH and Multi-Probe LSH algorithms in iteratively linking 400,000 x 400,000 long records (i.e., the largest test case).",
author = "Kim, {Hung Sik} and Dongwon Lee",
year = "2010",
month = "5",
day = "19",
doi = "10.1145/1739041.1739104",
language = "English (US)",
isbn = "9781605589459",
series = "Advances in Database Technology - EDBT 2010 - 13th International Conference on Extending Database Technology, Proceedings",
pages = "525--536",
booktitle = "Advances in Database Technology - EDBT 2010 - 13th International Conference on Extending Database Technology, Proceedings",

}

Kim, HS & Lee, D 2010, HARRA: Fast iterative hashed record linkage for large-scale data collections. in Advances in Database Technology - EDBT 2010 - 13th International Conference on Extending Database Technology, Proceedings. Advances in Database Technology - EDBT 2010 - 13th International Conference on Extending Database Technology, Proceedings, pp. 525-536, 13th International Conference on Extending Database Technology: Advances in Database Technology - EDBT 2010, Lausanne, Switzerland, 3/22/10. https://doi.org/10.1145/1739041.1739104

HARRA : Fast iterative hashed record linkage for large-scale data collections. / Kim, Hung Sik; Lee, Dongwon.

Advances in Database Technology - EDBT 2010 - 13th International Conference on Extending Database Technology, Proceedings. 2010. p. 525-536 (Advances in Database Technology - EDBT 2010 - 13th International Conference on Extending Database Technology, Proceedings).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

TY - GEN

T1 - HARRA

T2 - Fast iterative hashed record linkage for large-scale data collections

AU - Kim, Hung Sik

AU - Lee, Dongwon

PY - 2010/5/19

Y1 - 2010/5/19

N2 - We study the performance issue of the "iterative" record linkage (RL) problem, where match and merge operations may occur together in iterations until convergence emerges. We first propose the Iterative Locality-Sensitive Hashing (ILSH) that dynamically merges LSH-based has tables for quick and accurate blocking. Then, by exploiting inherent characteristics within/across data sets, we develop a suite of I-LSH-based RL algorithms, named as HARRA (HAshed RecoRd linkAge). The superiority of HARRA in speed over competing RL solutions is thoroughly validated using various real data sets. While maintaining equivalent or comparable accuracy levels, for instance, HARRA runs: (1) 4.5 and 10.5 times faster than StringMap and R-Swoosh in iteratively linking 4,000 x 4,000 short records (i.e., one of the small test cases), and (2) 5.6 and 3.4 times faster than basic LSH and Multi-Probe LSH algorithms in iteratively linking 400,000 x 400,000 long records (i.e., the largest test case).

AB - We study the performance issue of the "iterative" record linkage (RL) problem, where match and merge operations may occur together in iterations until convergence emerges. We first propose the Iterative Locality-Sensitive Hashing (ILSH) that dynamically merges LSH-based has tables for quick and accurate blocking. Then, by exploiting inherent characteristics within/across data sets, we develop a suite of I-LSH-based RL algorithms, named as HARRA (HAshed RecoRd linkAge). The superiority of HARRA in speed over competing RL solutions is thoroughly validated using various real data sets. While maintaining equivalent or comparable accuracy levels, for instance, HARRA runs: (1) 4.5 and 10.5 times faster than StringMap and R-Swoosh in iteratively linking 4,000 x 4,000 short records (i.e., one of the small test cases), and (2) 5.6 and 3.4 times faster than basic LSH and Multi-Probe LSH algorithms in iteratively linking 400,000 x 400,000 long records (i.e., the largest test case).

UR - http://www.scopus.com/inward/record.url?scp=77952280581&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=77952280581&partnerID=8YFLogxK

U2 - 10.1145/1739041.1739104

DO - 10.1145/1739041.1739104

M3 - Conference contribution

AN - SCOPUS:77952280581

SN - 9781605589459

T3 - Advances in Database Technology - EDBT 2010 - 13th International Conference on Extending Database Technology, Proceedings

SP - 525

EP - 536

BT - Advances in Database Technology - EDBT 2010 - 13th International Conference on Extending Database Technology, Proceedings

ER -

Kim HS, Lee D. HARRA: Fast iterative hashed record linkage for large-scale data collections. In Advances in Database Technology - EDBT 2010 - 13th International Conference on Extending Database Technology, Proceedings. 2010. p. 525-536. (Advances in Database Technology - EDBT 2010 - 13th International Conference on Extending Database Technology, Proceedings). https://doi.org/10.1145/1739041.1739104