Efficient topic-based unsupervised name disambiguation

Yang Song, Jian Huang, Isaac G. Councill, Jia Li, C. Lee Giles

Research output: Chapter in Book/Report/Conference proceedingConference contribution

87 Citations (Scopus)

Abstract

Name ambiguity is a special case of identity uncertainty where one person can be referenced by multiple name variations in different situations or even share the same name with other people. In this paper, we focus on the problem of disambiguating person names within web pages and scientific documents. We present an efficient and effective two-stage approach to disambiguate names. In the first stage, two novel topic-based models are proposed by extending two hierarchical Bayesian text models, namely Probabilistic Latent Semantic Analysis (PLSA) and Latent Dirichlet Allocation (LDA). Our models explicitly introduce a new variable for persons and learn the distribution of topics with regard to persons and words. After learning an initial model, the topic distributions are treated as feature sets and names are disambiguated by leveraging a hierarchical agglomerative clustering method. Experiments on web data and scientific documents from CiteSeer indicate that our approach consistently outperforms other unsupervised learning methods such as spectral clustering and DBSCAN clustering and could be extended to other research fields. We empirically addressed the issue of scalability by disambiguating authors in over 750,000 papers from the entire CiteSeer dataset.

Original languageEnglish (US)
Title of host publicationProceedings of the 7th ACM/IEEE Joint Conference on Digital Libraries, JCDL 2007
Subtitle of host publicationBuilding and Sustaining the Digital Environment
Pages342-351
Number of pages10
DOIs
StatePublished - Nov 29 2007
Event7th ACM/IEEE Joint Conference on Digital Libraries, JCDL 2007: Building and Sustaining the Digital Environment - Vancouver, BC, Canada
Duration: Jun 18 2007Jun 23 2007

Publication series

NameProceedings of the ACM International Conference on Digital Libraries

Other

Other7th ACM/IEEE Joint Conference on Digital Libraries, JCDL 2007: Building and Sustaining the Digital Environment
CountryCanada
CityVancouver, BC
Period6/18/076/23/07

Fingerprint

Unsupervised learning
human being
Names of Persons
Scalability
Websites
Semantics
learning method
field research
semantics
uncertainty
Experiments
experiment
learning
Uncertainty
Statistical Models

All Science Journal Classification (ASJC) codes

  • Software
  • Information Systems
  • Computer Science Applications
  • Library and Information Sciences

Cite this

Song, Y., Huang, J., Councill, I. G., Li, J., & Giles, C. L. (2007). Efficient topic-based unsupervised name disambiguation. In Proceedings of the 7th ACM/IEEE Joint Conference on Digital Libraries, JCDL 2007: Building and Sustaining the Digital Environment (pp. 342-351). (Proceedings of the ACM International Conference on Digital Libraries). https://doi.org/10.1145/1255175.1255243
Song, Yang ; Huang, Jian ; Councill, Isaac G. ; Li, Jia ; Giles, C. Lee. / Efficient topic-based unsupervised name disambiguation. Proceedings of the 7th ACM/IEEE Joint Conference on Digital Libraries, JCDL 2007: Building and Sustaining the Digital Environment. 2007. pp. 342-351 (Proceedings of the ACM International Conference on Digital Libraries).
@inproceedings{9c729a68f42d4bd6aad7881eb2b6d24b,
title = "Efficient topic-based unsupervised name disambiguation",
abstract = "Name ambiguity is a special case of identity uncertainty where one person can be referenced by multiple name variations in different situations or even share the same name with other people. In this paper, we focus on the problem of disambiguating person names within web pages and scientific documents. We present an efficient and effective two-stage approach to disambiguate names. In the first stage, two novel topic-based models are proposed by extending two hierarchical Bayesian text models, namely Probabilistic Latent Semantic Analysis (PLSA) and Latent Dirichlet Allocation (LDA). Our models explicitly introduce a new variable for persons and learn the distribution of topics with regard to persons and words. After learning an initial model, the topic distributions are treated as feature sets and names are disambiguated by leveraging a hierarchical agglomerative clustering method. Experiments on web data and scientific documents from CiteSeer indicate that our approach consistently outperforms other unsupervised learning methods such as spectral clustering and DBSCAN clustering and could be extended to other research fields. We empirically addressed the issue of scalability by disambiguating authors in over 750,000 papers from the entire CiteSeer dataset.",
author = "Yang Song and Jian Huang and Councill, {Isaac G.} and Jia Li and Giles, {C. Lee}",
year = "2007",
month = "11",
day = "29",
doi = "10.1145/1255175.1255243",
language = "English (US)",
isbn = "1595936440",
series = "Proceedings of the ACM International Conference on Digital Libraries",
pages = "342--351",
booktitle = "Proceedings of the 7th ACM/IEEE Joint Conference on Digital Libraries, JCDL 2007",

}

Song, Y, Huang, J, Councill, IG, Li, J & Giles, CL 2007, Efficient topic-based unsupervised name disambiguation. in Proceedings of the 7th ACM/IEEE Joint Conference on Digital Libraries, JCDL 2007: Building and Sustaining the Digital Environment. Proceedings of the ACM International Conference on Digital Libraries, pp. 342-351, 7th ACM/IEEE Joint Conference on Digital Libraries, JCDL 2007: Building and Sustaining the Digital Environment, Vancouver, BC, Canada, 6/18/07. https://doi.org/10.1145/1255175.1255243

Efficient topic-based unsupervised name disambiguation. / Song, Yang; Huang, Jian; Councill, Isaac G.; Li, Jia; Giles, C. Lee.

Proceedings of the 7th ACM/IEEE Joint Conference on Digital Libraries, JCDL 2007: Building and Sustaining the Digital Environment. 2007. p. 342-351 (Proceedings of the ACM International Conference on Digital Libraries).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

TY - GEN

T1 - Efficient topic-based unsupervised name disambiguation

AU - Song, Yang

AU - Huang, Jian

AU - Councill, Isaac G.

AU - Li, Jia

AU - Giles, C. Lee

PY - 2007/11/29

Y1 - 2007/11/29

N2 - Name ambiguity is a special case of identity uncertainty where one person can be referenced by multiple name variations in different situations or even share the same name with other people. In this paper, we focus on the problem of disambiguating person names within web pages and scientific documents. We present an efficient and effective two-stage approach to disambiguate names. In the first stage, two novel topic-based models are proposed by extending two hierarchical Bayesian text models, namely Probabilistic Latent Semantic Analysis (PLSA) and Latent Dirichlet Allocation (LDA). Our models explicitly introduce a new variable for persons and learn the distribution of topics with regard to persons and words. After learning an initial model, the topic distributions are treated as feature sets and names are disambiguated by leveraging a hierarchical agglomerative clustering method. Experiments on web data and scientific documents from CiteSeer indicate that our approach consistently outperforms other unsupervised learning methods such as spectral clustering and DBSCAN clustering and could be extended to other research fields. We empirically addressed the issue of scalability by disambiguating authors in over 750,000 papers from the entire CiteSeer dataset.

AB - Name ambiguity is a special case of identity uncertainty where one person can be referenced by multiple name variations in different situations or even share the same name with other people. In this paper, we focus on the problem of disambiguating person names within web pages and scientific documents. We present an efficient and effective two-stage approach to disambiguate names. In the first stage, two novel topic-based models are proposed by extending two hierarchical Bayesian text models, namely Probabilistic Latent Semantic Analysis (PLSA) and Latent Dirichlet Allocation (LDA). Our models explicitly introduce a new variable for persons and learn the distribution of topics with regard to persons and words. After learning an initial model, the topic distributions are treated as feature sets and names are disambiguated by leveraging a hierarchical agglomerative clustering method. Experiments on web data and scientific documents from CiteSeer indicate that our approach consistently outperforms other unsupervised learning methods such as spectral clustering and DBSCAN clustering and could be extended to other research fields. We empirically addressed the issue of scalability by disambiguating authors in over 750,000 papers from the entire CiteSeer dataset.

UR - http://www.scopus.com/inward/record.url?scp=36348962507&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=36348962507&partnerID=8YFLogxK

U2 - 10.1145/1255175.1255243

DO - 10.1145/1255175.1255243

M3 - Conference contribution

AN - SCOPUS:36348962507

SN - 1595936440

SN - 9781595936448

T3 - Proceedings of the ACM International Conference on Digital Libraries

SP - 342

EP - 351

BT - Proceedings of the 7th ACM/IEEE Joint Conference on Digital Libraries, JCDL 2007

ER -

Song Y, Huang J, Councill IG, Li J, Giles CL. Efficient topic-based unsupervised name disambiguation. In Proceedings of the 7th ACM/IEEE Joint Conference on Digital Libraries, JCDL 2007: Building and Sustaining the Digital Environment. 2007. p. 342-351. (Proceedings of the ACM International Conference on Digital Libraries). https://doi.org/10.1145/1255175.1255243