Scalable clustering methods for the name disambiguation problem

Byung Won On, Ingyu Lee, Dongwon Lee

Research output: Contribution to journalArticle

11 Citations (Scopus)

Abstract

When non-unique values are used as the identifier of entities, due to their homonym, confusion can occur. In particular, when (part of) "names" of entities are used as their identifier, the problem is often referred to as a name disambiguation problem, where goal is to sort out the erroneous entities due to name homonyms (e. g., If only last name is used as the identifier, one cannot distinguish "Masao Obama" from "Norio Obama"). In this paper, in particular, we study the scalability issue of the name disambiguation problem-when (1) a small number of entities with large contents or (2) a large number of entities get un-distinguishable due to homonyms. First, we carefully examine two of the state-of-the-art solutions to the name disambiguation problem and point out their limitations with respect to scalability. Then, we propose two scalable graph partitioning algorithms known as multi-level graph partitioning and multi-level graph partitioning and merging to solve the large-scale name disambiguation problem. Our claim is empirically validated via experimentation-our proposal shows orders of magnitude improvement in terms of performance while maintaining equivalent or reasonable accuracy compared to competing solutions.

Original languageEnglish (US)
Pages (from-to)129-151
Number of pages23
JournalKnowledge and Information Systems
Volume31
Issue number1
DOIs
StatePublished - Apr 1 2012

Fingerprint

Scalability
Merging

All Science Journal Classification (ASJC) codes

  • Software
  • Information Systems
  • Human-Computer Interaction
  • Hardware and Architecture
  • Artificial Intelligence

Cite this

@article{7c06f53b17b14f999c364786fed27ecb,
title = "Scalable clustering methods for the name disambiguation problem",
abstract = "When non-unique values are used as the identifier of entities, due to their homonym, confusion can occur. In particular, when (part of) {"}names{"} of entities are used as their identifier, the problem is often referred to as a name disambiguation problem, where goal is to sort out the erroneous entities due to name homonyms (e. g., If only last name is used as the identifier, one cannot distinguish {"}Masao Obama{"} from {"}Norio Obama{"}). In this paper, in particular, we study the scalability issue of the name disambiguation problem-when (1) a small number of entities with large contents or (2) a large number of entities get un-distinguishable due to homonyms. First, we carefully examine two of the state-of-the-art solutions to the name disambiguation problem and point out their limitations with respect to scalability. Then, we propose two scalable graph partitioning algorithms known as multi-level graph partitioning and multi-level graph partitioning and merging to solve the large-scale name disambiguation problem. Our claim is empirically validated via experimentation-our proposal shows orders of magnitude improvement in terms of performance while maintaining equivalent or reasonable accuracy compared to competing solutions.",
author = "On, {Byung Won} and Ingyu Lee and Dongwon Lee",
year = "2012",
month = "4",
day = "1",
doi = "10.1007/s10115-011-0397-1",
language = "English (US)",
volume = "31",
pages = "129--151",
journal = "Knowledge and Information Systems",
issn = "0219-1377",
publisher = "Springer London",
number = "1",

}

Scalable clustering methods for the name disambiguation problem. / On, Byung Won; Lee, Ingyu; Lee, Dongwon.

In: Knowledge and Information Systems, Vol. 31, No. 1, 01.04.2012, p. 129-151.

Research output: Contribution to journalArticle

TY - JOUR

T1 - Scalable clustering methods for the name disambiguation problem

AU - On, Byung Won

AU - Lee, Ingyu

AU - Lee, Dongwon

PY - 2012/4/1

Y1 - 2012/4/1

N2 - When non-unique values are used as the identifier of entities, due to their homonym, confusion can occur. In particular, when (part of) "names" of entities are used as their identifier, the problem is often referred to as a name disambiguation problem, where goal is to sort out the erroneous entities due to name homonyms (e. g., If only last name is used as the identifier, one cannot distinguish "Masao Obama" from "Norio Obama"). In this paper, in particular, we study the scalability issue of the name disambiguation problem-when (1) a small number of entities with large contents or (2) a large number of entities get un-distinguishable due to homonyms. First, we carefully examine two of the state-of-the-art solutions to the name disambiguation problem and point out their limitations with respect to scalability. Then, we propose two scalable graph partitioning algorithms known as multi-level graph partitioning and multi-level graph partitioning and merging to solve the large-scale name disambiguation problem. Our claim is empirically validated via experimentation-our proposal shows orders of magnitude improvement in terms of performance while maintaining equivalent or reasonable accuracy compared to competing solutions.

AB - When non-unique values are used as the identifier of entities, due to their homonym, confusion can occur. In particular, when (part of) "names" of entities are used as their identifier, the problem is often referred to as a name disambiguation problem, where goal is to sort out the erroneous entities due to name homonyms (e. g., If only last name is used as the identifier, one cannot distinguish "Masao Obama" from "Norio Obama"). In this paper, in particular, we study the scalability issue of the name disambiguation problem-when (1) a small number of entities with large contents or (2) a large number of entities get un-distinguishable due to homonyms. First, we carefully examine two of the state-of-the-art solutions to the name disambiguation problem and point out their limitations with respect to scalability. Then, we propose two scalable graph partitioning algorithms known as multi-level graph partitioning and multi-level graph partitioning and merging to solve the large-scale name disambiguation problem. Our claim is empirically validated via experimentation-our proposal shows orders of magnitude improvement in terms of performance while maintaining equivalent or reasonable accuracy compared to competing solutions.

UR - http://www.scopus.com/inward/record.url?scp=84859108216&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84859108216&partnerID=8YFLogxK

U2 - 10.1007/s10115-011-0397-1

DO - 10.1007/s10115-011-0397-1

M3 - Article

AN - SCOPUS:84859108216

VL - 31

SP - 129

EP - 151

JO - Knowledge and Information Systems

JF - Knowledge and Information Systems

SN - 0219-1377

IS - 1

ER -