Adaptive sorted neighborhood methods for efficient record linkage

Su Yan, Dongwon Lee, Min Yen Kan, Lee C. Giles

Research output: Chapter in Book/Report/Conference proceedingConference contribution

68 Citations (Scopus)

Abstract

Traditionally, record linkage algorithms have played an important role in maintaining digital libraries - i.e., identifying matching citations or authors for consolidation in updating or integrating digital libraries. As such, a variety of record linkage algorithms have been developed and deployed successfully. Often, however, existing solutions have a set of parameters whose values are set by human experts off-lineand are fixed during the execution. Since finding the ideal values of such parameters is not straightforward, or no such single ideal value even exists, the applicability of existing solutions to new scenarios or domains is greatly hampered. To remedy this problem, we argue that one can achieve significant improvement by adaptively and dynamically changing such parameters of record linkage algorithms. To validate our hypothesis, we take a classical record linkage algorithm, the sorted neighborhood method (SNM), and demonstrate how we can achieve improved accuracy and performance by adaptively changing its fixed sliding window size. Our claim is analytically and empirically validated using both real and synthetic data sets of digital libraries and other domains.

Original languageEnglish (US)
Title of host publicationProceedings of the 7th ACM/IEEE Joint Conference on Digital Libraries, JCDL 2007
Subtitle of host publicationBuilding and Sustaining the Digital Environment
Pages185-194
Number of pages10
DOIs
StatePublished - Nov 29 2007
Event7th ACM/IEEE Joint Conference on Digital Libraries, JCDL 2007: Building and Sustaining the Digital Environment - Vancouver, BC, Canada
Duration: Jun 18 2007Jun 23 2007

Publication series

NameProceedings of the ACM International Conference on Digital Libraries

Other

Other7th ACM/IEEE Joint Conference on Digital Libraries, JCDL 2007: Building and Sustaining the Digital Environment
CountryCanada
CityVancouver, BC
Period6/18/076/23/07

Fingerprint

Digital libraries
Values
Consolidation
consolidation
remedies
expert
scenario
performance

All Science Journal Classification (ASJC) codes

  • Software
  • Information Systems
  • Computer Science Applications
  • Library and Information Sciences

Cite this

Yan, S., Lee, D., Kan, M. Y., & Giles, L. C. (2007). Adaptive sorted neighborhood methods for efficient record linkage. In Proceedings of the 7th ACM/IEEE Joint Conference on Digital Libraries, JCDL 2007: Building and Sustaining the Digital Environment (pp. 185-194). (Proceedings of the ACM International Conference on Digital Libraries). https://doi.org/10.1145/1255175.1255213
Yan, Su ; Lee, Dongwon ; Kan, Min Yen ; Giles, Lee C. / Adaptive sorted neighborhood methods for efficient record linkage. Proceedings of the 7th ACM/IEEE Joint Conference on Digital Libraries, JCDL 2007: Building and Sustaining the Digital Environment. 2007. pp. 185-194 (Proceedings of the ACM International Conference on Digital Libraries).
@inproceedings{b009baeb31674b8586518cc7a4cde4b0,
title = "Adaptive sorted neighborhood methods for efficient record linkage",
abstract = "Traditionally, record linkage algorithms have played an important role in maintaining digital libraries - i.e., identifying matching citations or authors for consolidation in updating or integrating digital libraries. As such, a variety of record linkage algorithms have been developed and deployed successfully. Often, however, existing solutions have a set of parameters whose values are set by human experts off-lineand are fixed during the execution. Since finding the ideal values of such parameters is not straightforward, or no such single ideal value even exists, the applicability of existing solutions to new scenarios or domains is greatly hampered. To remedy this problem, we argue that one can achieve significant improvement by adaptively and dynamically changing such parameters of record linkage algorithms. To validate our hypothesis, we take a classical record linkage algorithm, the sorted neighborhood method (SNM), and demonstrate how we can achieve improved accuracy and performance by adaptively changing its fixed sliding window size. Our claim is analytically and empirically validated using both real and synthetic data sets of digital libraries and other domains.",
author = "Su Yan and Dongwon Lee and Kan, {Min Yen} and Giles, {Lee C.}",
year = "2007",
month = "11",
day = "29",
doi = "10.1145/1255175.1255213",
language = "English (US)",
isbn = "1595936440",
series = "Proceedings of the ACM International Conference on Digital Libraries",
pages = "185--194",
booktitle = "Proceedings of the 7th ACM/IEEE Joint Conference on Digital Libraries, JCDL 2007",

}

Yan, S, Lee, D, Kan, MY & Giles, LC 2007, Adaptive sorted neighborhood methods for efficient record linkage. in Proceedings of the 7th ACM/IEEE Joint Conference on Digital Libraries, JCDL 2007: Building and Sustaining the Digital Environment. Proceedings of the ACM International Conference on Digital Libraries, pp. 185-194, 7th ACM/IEEE Joint Conference on Digital Libraries, JCDL 2007: Building and Sustaining the Digital Environment, Vancouver, BC, Canada, 6/18/07. https://doi.org/10.1145/1255175.1255213

Adaptive sorted neighborhood methods for efficient record linkage. / Yan, Su; Lee, Dongwon; Kan, Min Yen; Giles, Lee C.

Proceedings of the 7th ACM/IEEE Joint Conference on Digital Libraries, JCDL 2007: Building and Sustaining the Digital Environment. 2007. p. 185-194 (Proceedings of the ACM International Conference on Digital Libraries).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

TY - GEN

T1 - Adaptive sorted neighborhood methods for efficient record linkage

AU - Yan, Su

AU - Lee, Dongwon

AU - Kan, Min Yen

AU - Giles, Lee C.

PY - 2007/11/29

Y1 - 2007/11/29

N2 - Traditionally, record linkage algorithms have played an important role in maintaining digital libraries - i.e., identifying matching citations or authors for consolidation in updating or integrating digital libraries. As such, a variety of record linkage algorithms have been developed and deployed successfully. Often, however, existing solutions have a set of parameters whose values are set by human experts off-lineand are fixed during the execution. Since finding the ideal values of such parameters is not straightforward, or no such single ideal value even exists, the applicability of existing solutions to new scenarios or domains is greatly hampered. To remedy this problem, we argue that one can achieve significant improvement by adaptively and dynamically changing such parameters of record linkage algorithms. To validate our hypothesis, we take a classical record linkage algorithm, the sorted neighborhood method (SNM), and demonstrate how we can achieve improved accuracy and performance by adaptively changing its fixed sliding window size. Our claim is analytically and empirically validated using both real and synthetic data sets of digital libraries and other domains.

AB - Traditionally, record linkage algorithms have played an important role in maintaining digital libraries - i.e., identifying matching citations or authors for consolidation in updating or integrating digital libraries. As such, a variety of record linkage algorithms have been developed and deployed successfully. Often, however, existing solutions have a set of parameters whose values are set by human experts off-lineand are fixed during the execution. Since finding the ideal values of such parameters is not straightforward, or no such single ideal value even exists, the applicability of existing solutions to new scenarios or domains is greatly hampered. To remedy this problem, we argue that one can achieve significant improvement by adaptively and dynamically changing such parameters of record linkage algorithms. To validate our hypothesis, we take a classical record linkage algorithm, the sorted neighborhood method (SNM), and demonstrate how we can achieve improved accuracy and performance by adaptively changing its fixed sliding window size. Our claim is analytically and empirically validated using both real and synthetic data sets of digital libraries and other domains.

UR - http://www.scopus.com/inward/record.url?scp=36348961379&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=36348961379&partnerID=8YFLogxK

U2 - 10.1145/1255175.1255213

DO - 10.1145/1255175.1255213

M3 - Conference contribution

AN - SCOPUS:36348961379

SN - 1595936440

SN - 9781595936448

T3 - Proceedings of the ACM International Conference on Digital Libraries

SP - 185

EP - 194

BT - Proceedings of the 7th ACM/IEEE Joint Conference on Digital Libraries, JCDL 2007

ER -

Yan S, Lee D, Kan MY, Giles LC. Adaptive sorted neighborhood methods for efficient record linkage. In Proceedings of the 7th ACM/IEEE Joint Conference on Digital Libraries, JCDL 2007: Building and Sustaining the Digital Environment. 2007. p. 185-194. (Proceedings of the ACM International Conference on Digital Libraries). https://doi.org/10.1145/1255175.1255213