Disambiguating authors in academic publications using random forests

Pucktada Treeratpituk, C. Lee Giles

Research output: Chapter in Book/Report/Conference proceedingConference contribution

94 Citations (Scopus)

Abstract

Users of digital libraries usually want to know the exact author or authors of an article. But different authors may share the same names, either as full names or as initials and last names (complete name change examples are not considered here). In such a case, the user would like the digital library to differentiate among these authors. Name disambiguation can help in many cases; one being a user in a search of all articles written by a particular author. Disambiguation also enables better bibliometric analysis by allowing a more accurate counting and grouping of publications and citations. In this paper, we describe an algorithm for pairwise disambiguation of author names based on a machine learning classification algorithm, random forests. We define a set of similarity profile features to assist in author disambiguation. Our experiments on the Medline database show that the random forest model outperforms other previously proposed techniques such as those using support-vector machines (SVM). In addition, we demonstrate that the variable importance produced by the random forest model can be used in feature selection with little degradation in the disambiguation accuracy. In particular, the inverse document frequency of author last name and the middle name's similarity alone achieves an accuracy of almost 90%.

Original languageEnglish (US)
Title of host publicationJCDL'09 - Proceedings of the 2009 ACM/IEEE Joint Conference on Digital Libraries
Pages39-48
Number of pages10
DOIs
StatePublished - Nov 30 2009
Event2009 ACM/IEEE Joint Conference on Digital Libraries, JCDL'09 - Austin, TX, United States
Duration: Jun 15 2009Jun 19 2009

Publication series

NameProceedings of the ACM/IEEE Joint Conference on Digital Libraries
ISSN (Print)1552-5996

Other

Other2009 ACM/IEEE Joint Conference on Digital Libraries, JCDL'09
CountryUnited States
CityAustin, TX
Period6/15/096/19/09

Fingerprint

Digital libraries
Support vector machines
Learning systems
Feature extraction
Degradation
Experiments

All Science Journal Classification (ASJC) codes

  • Engineering(all)

Cite this

Treeratpituk, P., & Giles, C. L. (2009). Disambiguating authors in academic publications using random forests. In JCDL'09 - Proceedings of the 2009 ACM/IEEE Joint Conference on Digital Libraries (pp. 39-48). (Proceedings of the ACM/IEEE Joint Conference on Digital Libraries). https://doi.org/10.1145/1555400.1555408
Treeratpituk, Pucktada ; Giles, C. Lee. / Disambiguating authors in academic publications using random forests. JCDL'09 - Proceedings of the 2009 ACM/IEEE Joint Conference on Digital Libraries. 2009. pp. 39-48 (Proceedings of the ACM/IEEE Joint Conference on Digital Libraries).
@inproceedings{11db33eaebf34c95a666f053f59e8cfc,
title = "Disambiguating authors in academic publications using random forests",
abstract = "Users of digital libraries usually want to know the exact author or authors of an article. But different authors may share the same names, either as full names or as initials and last names (complete name change examples are not considered here). In such a case, the user would like the digital library to differentiate among these authors. Name disambiguation can help in many cases; one being a user in a search of all articles written by a particular author. Disambiguation also enables better bibliometric analysis by allowing a more accurate counting and grouping of publications and citations. In this paper, we describe an algorithm for pairwise disambiguation of author names based on a machine learning classification algorithm, random forests. We define a set of similarity profile features to assist in author disambiguation. Our experiments on the Medline database show that the random forest model outperforms other previously proposed techniques such as those using support-vector machines (SVM). In addition, we demonstrate that the variable importance produced by the random forest model can be used in feature selection with little degradation in the disambiguation accuracy. In particular, the inverse document frequency of author last name and the middle name's similarity alone achieves an accuracy of almost 90{\%}.",
author = "Pucktada Treeratpituk and Giles, {C. Lee}",
year = "2009",
month = "11",
day = "30",
doi = "10.1145/1555400.1555408",
language = "English (US)",
isbn = "9781605586977",
series = "Proceedings of the ACM/IEEE Joint Conference on Digital Libraries",
pages = "39--48",
booktitle = "JCDL'09 - Proceedings of the 2009 ACM/IEEE Joint Conference on Digital Libraries",

}

Treeratpituk, P & Giles, CL 2009, Disambiguating authors in academic publications using random forests. in JCDL'09 - Proceedings of the 2009 ACM/IEEE Joint Conference on Digital Libraries. Proceedings of the ACM/IEEE Joint Conference on Digital Libraries, pp. 39-48, 2009 ACM/IEEE Joint Conference on Digital Libraries, JCDL'09, Austin, TX, United States, 6/15/09. https://doi.org/10.1145/1555400.1555408

Disambiguating authors in academic publications using random forests. / Treeratpituk, Pucktada; Giles, C. Lee.

JCDL'09 - Proceedings of the 2009 ACM/IEEE Joint Conference on Digital Libraries. 2009. p. 39-48 (Proceedings of the ACM/IEEE Joint Conference on Digital Libraries).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

TY - GEN

T1 - Disambiguating authors in academic publications using random forests

AU - Treeratpituk, Pucktada

AU - Giles, C. Lee

PY - 2009/11/30

Y1 - 2009/11/30

N2 - Users of digital libraries usually want to know the exact author or authors of an article. But different authors may share the same names, either as full names or as initials and last names (complete name change examples are not considered here). In such a case, the user would like the digital library to differentiate among these authors. Name disambiguation can help in many cases; one being a user in a search of all articles written by a particular author. Disambiguation also enables better bibliometric analysis by allowing a more accurate counting and grouping of publications and citations. In this paper, we describe an algorithm for pairwise disambiguation of author names based on a machine learning classification algorithm, random forests. We define a set of similarity profile features to assist in author disambiguation. Our experiments on the Medline database show that the random forest model outperforms other previously proposed techniques such as those using support-vector machines (SVM). In addition, we demonstrate that the variable importance produced by the random forest model can be used in feature selection with little degradation in the disambiguation accuracy. In particular, the inverse document frequency of author last name and the middle name's similarity alone achieves an accuracy of almost 90%.

AB - Users of digital libraries usually want to know the exact author or authors of an article. But different authors may share the same names, either as full names or as initials and last names (complete name change examples are not considered here). In such a case, the user would like the digital library to differentiate among these authors. Name disambiguation can help in many cases; one being a user in a search of all articles written by a particular author. Disambiguation also enables better bibliometric analysis by allowing a more accurate counting and grouping of publications and citations. In this paper, we describe an algorithm for pairwise disambiguation of author names based on a machine learning classification algorithm, random forests. We define a set of similarity profile features to assist in author disambiguation. Our experiments on the Medline database show that the random forest model outperforms other previously proposed techniques such as those using support-vector machines (SVM). In addition, we demonstrate that the variable importance produced by the random forest model can be used in feature selection with little degradation in the disambiguation accuracy. In particular, the inverse document frequency of author last name and the middle name's similarity alone achieves an accuracy of almost 90%.

UR - http://www.scopus.com/inward/record.url?scp=70450273106&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=70450273106&partnerID=8YFLogxK

U2 - 10.1145/1555400.1555408

DO - 10.1145/1555400.1555408

M3 - Conference contribution

AN - SCOPUS:70450273106

SN - 9781605586977

T3 - Proceedings of the ACM/IEEE Joint Conference on Digital Libraries

SP - 39

EP - 48

BT - JCDL'09 - Proceedings of the 2009 ACM/IEEE Joint Conference on Digital Libraries

ER -

Treeratpituk P, Giles CL. Disambiguating authors in academic publications using random forests. In JCDL'09 - Proceedings of the 2009 ACM/IEEE Joint Conference on Digital Libraries. 2009. p. 39-48. (Proceedings of the ACM/IEEE Joint Conference on Digital Libraries). https://doi.org/10.1145/1555400.1555408