Large scale author name disambiguation in digital libraries

Madian Khabsa, Pucktada Treeratpituk, C. Lee Giles

Research output: Chapter in Book/Report/Conference proceedingConference contribution

9 Citations (Scopus)

Abstract

Person name disambiguation is essential to distinguish between persons that share the same name where unique identifiers are not present. In many domains this is a common problem including digital libraries where the same name can refer to multiple unique authors. Correctly attributing work and citations requires the digital library's database to be disambiguated. In this work we describe a large scale framework for disambiguating author names efficiently and effectively. The framework uses a density based clustering algorithm with a random forest based distance function to clusters unique authors. Effective use of blocking functions allows the clustering algorithm to be run in parallel. In our experiments we show that the framework disambiguates authors of more than 4 million papers in 24 hours.

Original languageEnglish (US)
Title of host publicationProceedings - 2014 IEEE International Conference on Big Data, IEEE Big Data 2014
EditorsWo Chang, Jun Huan, Nick Cercone, Saumyadipta Pyne, Vasant Honavar, Jimmy Lin, Xiaohua Tony Hu, Charu Aggarwal, Bamshad Mobasher, Jian Pei, Raghunath Nambiar
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages41-42
Number of pages2
ISBN (Electronic)9781479956654
DOIs
StatePublished - Jan 7 2015
Event2nd IEEE International Conference on Big Data, IEEE Big Data 2014 - Washington, United States
Duration: Oct 27 2014Oct 30 2014

Publication series

NameProceedings - 2014 IEEE International Conference on Big Data, IEEE Big Data 2014

Other

Other2nd IEEE International Conference on Big Data, IEEE Big Data 2014
CountryUnited States
CityWashington
Period10/27/1410/30/14

Fingerprint

Digital libraries
Clustering algorithms
Experiments

All Science Journal Classification (ASJC) codes

  • Artificial Intelligence
  • Information Systems

Cite this

Khabsa, M., Treeratpituk, P., & Giles, C. L. (2015). Large scale author name disambiguation in digital libraries. In W. Chang, J. Huan, N. Cercone, S. Pyne, V. Honavar, J. Lin, X. T. Hu, C. Aggarwal, B. Mobasher, J. Pei, ... R. Nambiar (Eds.), Proceedings - 2014 IEEE International Conference on Big Data, IEEE Big Data 2014 (pp. 41-42). [7004487] (Proceedings - 2014 IEEE International Conference on Big Data, IEEE Big Data 2014). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/BigData.2014.7004487
Khabsa, Madian ; Treeratpituk, Pucktada ; Giles, C. Lee. / Large scale author name disambiguation in digital libraries. Proceedings - 2014 IEEE International Conference on Big Data, IEEE Big Data 2014. editor / Wo Chang ; Jun Huan ; Nick Cercone ; Saumyadipta Pyne ; Vasant Honavar ; Jimmy Lin ; Xiaohua Tony Hu ; Charu Aggarwal ; Bamshad Mobasher ; Jian Pei ; Raghunath Nambiar. Institute of Electrical and Electronics Engineers Inc., 2015. pp. 41-42 (Proceedings - 2014 IEEE International Conference on Big Data, IEEE Big Data 2014).
@inproceedings{78a33f59123c4873abcb60635f9d884f,
title = "Large scale author name disambiguation in digital libraries",
abstract = "Person name disambiguation is essential to distinguish between persons that share the same name where unique identifiers are not present. In many domains this is a common problem including digital libraries where the same name can refer to multiple unique authors. Correctly attributing work and citations requires the digital library's database to be disambiguated. In this work we describe a large scale framework for disambiguating author names efficiently and effectively. The framework uses a density based clustering algorithm with a random forest based distance function to clusters unique authors. Effective use of blocking functions allows the clustering algorithm to be run in parallel. In our experiments we show that the framework disambiguates authors of more than 4 million papers in 24 hours.",
author = "Madian Khabsa and Pucktada Treeratpituk and Giles, {C. Lee}",
year = "2015",
month = "1",
day = "7",
doi = "10.1109/BigData.2014.7004487",
language = "English (US)",
series = "Proceedings - 2014 IEEE International Conference on Big Data, IEEE Big Data 2014",
publisher = "Institute of Electrical and Electronics Engineers Inc.",
pages = "41--42",
editor = "Wo Chang and Jun Huan and Nick Cercone and Saumyadipta Pyne and Vasant Honavar and Jimmy Lin and Hu, {Xiaohua Tony} and Charu Aggarwal and Bamshad Mobasher and Jian Pei and Raghunath Nambiar",
booktitle = "Proceedings - 2014 IEEE International Conference on Big Data, IEEE Big Data 2014",
address = "United States",

}

Khabsa, M, Treeratpituk, P & Giles, CL 2015, Large scale author name disambiguation in digital libraries. in W Chang, J Huan, N Cercone, S Pyne, V Honavar, J Lin, XT Hu, C Aggarwal, B Mobasher, J Pei & R Nambiar (eds), Proceedings - 2014 IEEE International Conference on Big Data, IEEE Big Data 2014., 7004487, Proceedings - 2014 IEEE International Conference on Big Data, IEEE Big Data 2014, Institute of Electrical and Electronics Engineers Inc., pp. 41-42, 2nd IEEE International Conference on Big Data, IEEE Big Data 2014, Washington, United States, 10/27/14. https://doi.org/10.1109/BigData.2014.7004487

Large scale author name disambiguation in digital libraries. / Khabsa, Madian; Treeratpituk, Pucktada; Giles, C. Lee.

Proceedings - 2014 IEEE International Conference on Big Data, IEEE Big Data 2014. ed. / Wo Chang; Jun Huan; Nick Cercone; Saumyadipta Pyne; Vasant Honavar; Jimmy Lin; Xiaohua Tony Hu; Charu Aggarwal; Bamshad Mobasher; Jian Pei; Raghunath Nambiar. Institute of Electrical and Electronics Engineers Inc., 2015. p. 41-42 7004487 (Proceedings - 2014 IEEE International Conference on Big Data, IEEE Big Data 2014).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

TY - GEN

T1 - Large scale author name disambiguation in digital libraries

AU - Khabsa, Madian

AU - Treeratpituk, Pucktada

AU - Giles, C. Lee

PY - 2015/1/7

Y1 - 2015/1/7

N2 - Person name disambiguation is essential to distinguish between persons that share the same name where unique identifiers are not present. In many domains this is a common problem including digital libraries where the same name can refer to multiple unique authors. Correctly attributing work and citations requires the digital library's database to be disambiguated. In this work we describe a large scale framework for disambiguating author names efficiently and effectively. The framework uses a density based clustering algorithm with a random forest based distance function to clusters unique authors. Effective use of blocking functions allows the clustering algorithm to be run in parallel. In our experiments we show that the framework disambiguates authors of more than 4 million papers in 24 hours.

AB - Person name disambiguation is essential to distinguish between persons that share the same name where unique identifiers are not present. In many domains this is a common problem including digital libraries where the same name can refer to multiple unique authors. Correctly attributing work and citations requires the digital library's database to be disambiguated. In this work we describe a large scale framework for disambiguating author names efficiently and effectively. The framework uses a density based clustering algorithm with a random forest based distance function to clusters unique authors. Effective use of blocking functions allows the clustering algorithm to be run in parallel. In our experiments we show that the framework disambiguates authors of more than 4 million papers in 24 hours.

UR - http://www.scopus.com/inward/record.url?scp=84921795861&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84921795861&partnerID=8YFLogxK

U2 - 10.1109/BigData.2014.7004487

DO - 10.1109/BigData.2014.7004487

M3 - Conference contribution

T3 - Proceedings - 2014 IEEE International Conference on Big Data, IEEE Big Data 2014

SP - 41

EP - 42

BT - Proceedings - 2014 IEEE International Conference on Big Data, IEEE Big Data 2014

A2 - Chang, Wo

A2 - Huan, Jun

A2 - Cercone, Nick

A2 - Pyne, Saumyadipta

A2 - Honavar, Vasant

A2 - Lin, Jimmy

A2 - Hu, Xiaohua Tony

A2 - Aggarwal, Charu

A2 - Mobasher, Bamshad

A2 - Pei, Jian

A2 - Nambiar, Raghunath

PB - Institute of Electrical and Electronics Engineers Inc.

ER -

Khabsa M, Treeratpituk P, Giles CL. Large scale author name disambiguation in digital libraries. In Chang W, Huan J, Cercone N, Pyne S, Honavar V, Lin J, Hu XT, Aggarwal C, Mobasher B, Pei J, Nambiar R, editors, Proceedings - 2014 IEEE International Conference on Big Data, IEEE Big Data 2014. Institute of Electrical and Electronics Engineers Inc. 2015. p. 41-42. 7004487. (Proceedings - 2014 IEEE International Conference on Big Data, IEEE Big Data 2014). https://doi.org/10.1109/BigData.2014.7004487