Efficient name disambiguation for large-scale databases

Jian Huang, Seyda Ertekin, C. Lee Giles

Research output: Chapter in Book/Report/Conference proceedingConference contribution

101 Scopus citations

Abstract

Name disambiguation can occur when one is seeking a list of publications of an author who has used different name variations and when there are multiple other authors with the same name. We present an efficient integrative framework for solving the name disambiguation problem: a blocking method retrieves candidate classes of authors with similar names and a clustering method, DBSCAN, clusters papers by author. The distance metric between papers used in DBSCAN is calculated by an online active selection support vector machine algorithm (LASVM), yielding a simpler model, lower test errors and faster prediction time than a standard SVM. We prove that by recasting transitivity as density reachability in DBSCAN, transitivity is guaranteed for core points. For evaluation, we manually annotated 3,355 papers yielding 490 authors and achieved 90.6% pairwise-F1. For scalability, authors in the entire CiteSeer dataset, over 700,000 papers, were readily disambiguated.

Original languageEnglish (US)
Title of host publicationKnowledge Discovery in Databases
Subtitle of host publicationPKDD 2006 - 10th European Conference on Principles and Practice of Knowledge Discovery in Databases, Proceedings
PublisherSpringer Verlag
Pages536-544
Number of pages9
ISBN (Print)3540453741, 9783540453741
DOIs
StatePublished - 2006
Event10th European Conference on Principles and Practice of Knowledge Discovery in Databases, PKDD 2006 - Berlin, Germany
Duration: Sep 18 2006Sep 22 2006

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume4213 LNAI
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Other

Other10th European Conference on Principles and Practice of Knowledge Discovery in Databases, PKDD 2006
CountryGermany
CityBerlin
Period9/18/069/22/06

All Science Journal Classification (ASJC) codes

  • Theoretical Computer Science
  • Computer Science(all)

Fingerprint Dive into the research topics of 'Efficient name disambiguation for large-scale databases'. Together they form a unique fingerprint.

Cite this