Name-ethnicity classification and ethnicity-sensitive name matching

Pucktada Treeratpituk, C. Lee Giles

Research output: Chapter in Book/Report/Conference proceedingConference contribution

7 Citations (Scopus)

Abstract

Personal names are important and common information in many data sources, ranging from social networks and news articles to patient records and scientific documents. They are often used as queries for retrieving records and also as key information for linking documents from multiple sources. Matching personal names can be challenging due to variations in spelling and various formatting of names. While many approximated name matching techniques have been proposed, most are generic string-matching algorithms. Unlike other types of proper names, personal names are highly cultural. Many ethnicities have their own unique naming systems and identifiable characteristics. In this paper we explore such relationships between ethnicities and personal names to improve the name matching performance. First, we propose a name-ethnicity classifier based on the multinomial logistic regression. Our model can effectively identify name-ethnicity from personal names in Wikipedia, which we use to define name-ethnicity, to within 85% accuracy. Next, we propose a novel alignment-based name matching algorithm, based on Smith-Waterman algorithm and logistic regression. Different name matching models are then trained for different name-ethnicity groups. Our preliminary experimental result on DBLP's disambiguated author dataset yields a performance of 99% precision and 89% recall. Surprisingly, textual features carry more weight than phonetic ones in name-ethnicity classification.

Original languageEnglish (US)
Title of host publicationAAAI-12 / IAAI-12 - Proceedings of the 26th AAAI Conference on Artificial Intelligence and the 24th Innovative Applications of Artificial Intelligence Conference
Pages1141-1147
Number of pages7
StatePublished - Nov 7 2012
Event26th AAAI Conference on Artificial Intelligence and the 24th Innovative Applications of Artificial Intelligence Conference, AAAI-12 / IAAI-12 - Toronto, ON, Canada
Duration: Jul 22 2012Jul 26 2012

Publication series

NameProceedings of the National Conference on Artificial Intelligence
Volume2

Other

Other26th AAAI Conference on Artificial Intelligence and the 24th Innovative Applications of Artificial Intelligence Conference, AAAI-12 / IAAI-12
CountryCanada
CityToronto, ON
Period7/22/127/26/12

Fingerprint

Logistics
String searching algorithms
Speech analysis
Classifiers

All Science Journal Classification (ASJC) codes

  • Software
  • Artificial Intelligence

Cite this

Treeratpituk, P., & Giles, C. L. (2012). Name-ethnicity classification and ethnicity-sensitive name matching. In AAAI-12 / IAAI-12 - Proceedings of the 26th AAAI Conference on Artificial Intelligence and the 24th Innovative Applications of Artificial Intelligence Conference (pp. 1141-1147). (Proceedings of the National Conference on Artificial Intelligence; Vol. 2).
Treeratpituk, Pucktada ; Giles, C. Lee. / Name-ethnicity classification and ethnicity-sensitive name matching. AAAI-12 / IAAI-12 - Proceedings of the 26th AAAI Conference on Artificial Intelligence and the 24th Innovative Applications of Artificial Intelligence Conference. 2012. pp. 1141-1147 (Proceedings of the National Conference on Artificial Intelligence).
@inproceedings{bc0fbebf8c5047bd8c27b6fed88aefce,
title = "Name-ethnicity classification and ethnicity-sensitive name matching",
abstract = "Personal names are important and common information in many data sources, ranging from social networks and news articles to patient records and scientific documents. They are often used as queries for retrieving records and also as key information for linking documents from multiple sources. Matching personal names can be challenging due to variations in spelling and various formatting of names. While many approximated name matching techniques have been proposed, most are generic string-matching algorithms. Unlike other types of proper names, personal names are highly cultural. Many ethnicities have their own unique naming systems and identifiable characteristics. In this paper we explore such relationships between ethnicities and personal names to improve the name matching performance. First, we propose a name-ethnicity classifier based on the multinomial logistic regression. Our model can effectively identify name-ethnicity from personal names in Wikipedia, which we use to define name-ethnicity, to within 85{\%} accuracy. Next, we propose a novel alignment-based name matching algorithm, based on Smith-Waterman algorithm and logistic regression. Different name matching models are then trained for different name-ethnicity groups. Our preliminary experimental result on DBLP's disambiguated author dataset yields a performance of 99{\%} precision and 89{\%} recall. Surprisingly, textual features carry more weight than phonetic ones in name-ethnicity classification.",
author = "Pucktada Treeratpituk and Giles, {C. Lee}",
year = "2012",
month = "11",
day = "7",
language = "English (US)",
isbn = "9781577355687",
series = "Proceedings of the National Conference on Artificial Intelligence",
pages = "1141--1147",
booktitle = "AAAI-12 / IAAI-12 - Proceedings of the 26th AAAI Conference on Artificial Intelligence and the 24th Innovative Applications of Artificial Intelligence Conference",

}

Treeratpituk, P & Giles, CL 2012, Name-ethnicity classification and ethnicity-sensitive name matching. in AAAI-12 / IAAI-12 - Proceedings of the 26th AAAI Conference on Artificial Intelligence and the 24th Innovative Applications of Artificial Intelligence Conference. Proceedings of the National Conference on Artificial Intelligence, vol. 2, pp. 1141-1147, 26th AAAI Conference on Artificial Intelligence and the 24th Innovative Applications of Artificial Intelligence Conference, AAAI-12 / IAAI-12, Toronto, ON, Canada, 7/22/12.

Name-ethnicity classification and ethnicity-sensitive name matching. / Treeratpituk, Pucktada; Giles, C. Lee.

AAAI-12 / IAAI-12 - Proceedings of the 26th AAAI Conference on Artificial Intelligence and the 24th Innovative Applications of Artificial Intelligence Conference. 2012. p. 1141-1147 (Proceedings of the National Conference on Artificial Intelligence; Vol. 2).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

TY - GEN

T1 - Name-ethnicity classification and ethnicity-sensitive name matching

AU - Treeratpituk, Pucktada

AU - Giles, C. Lee

PY - 2012/11/7

Y1 - 2012/11/7

N2 - Personal names are important and common information in many data sources, ranging from social networks and news articles to patient records and scientific documents. They are often used as queries for retrieving records and also as key information for linking documents from multiple sources. Matching personal names can be challenging due to variations in spelling and various formatting of names. While many approximated name matching techniques have been proposed, most are generic string-matching algorithms. Unlike other types of proper names, personal names are highly cultural. Many ethnicities have their own unique naming systems and identifiable characteristics. In this paper we explore such relationships between ethnicities and personal names to improve the name matching performance. First, we propose a name-ethnicity classifier based on the multinomial logistic regression. Our model can effectively identify name-ethnicity from personal names in Wikipedia, which we use to define name-ethnicity, to within 85% accuracy. Next, we propose a novel alignment-based name matching algorithm, based on Smith-Waterman algorithm and logistic regression. Different name matching models are then trained for different name-ethnicity groups. Our preliminary experimental result on DBLP's disambiguated author dataset yields a performance of 99% precision and 89% recall. Surprisingly, textual features carry more weight than phonetic ones in name-ethnicity classification.

AB - Personal names are important and common information in many data sources, ranging from social networks and news articles to patient records and scientific documents. They are often used as queries for retrieving records and also as key information for linking documents from multiple sources. Matching personal names can be challenging due to variations in spelling and various formatting of names. While many approximated name matching techniques have been proposed, most are generic string-matching algorithms. Unlike other types of proper names, personal names are highly cultural. Many ethnicities have their own unique naming systems and identifiable characteristics. In this paper we explore such relationships between ethnicities and personal names to improve the name matching performance. First, we propose a name-ethnicity classifier based on the multinomial logistic regression. Our model can effectively identify name-ethnicity from personal names in Wikipedia, which we use to define name-ethnicity, to within 85% accuracy. Next, we propose a novel alignment-based name matching algorithm, based on Smith-Waterman algorithm and logistic regression. Different name matching models are then trained for different name-ethnicity groups. Our preliminary experimental result on DBLP's disambiguated author dataset yields a performance of 99% precision and 89% recall. Surprisingly, textual features carry more weight than phonetic ones in name-ethnicity classification.

UR - http://www.scopus.com/inward/record.url?scp=84868293536&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84868293536&partnerID=8YFLogxK

M3 - Conference contribution

AN - SCOPUS:84868293536

SN - 9781577355687

T3 - Proceedings of the National Conference on Artificial Intelligence

SP - 1141

EP - 1147

BT - AAAI-12 / IAAI-12 - Proceedings of the 26th AAAI Conference on Artificial Intelligence and the 24th Innovative Applications of Artificial Intelligence Conference

ER -

Treeratpituk P, Giles CL. Name-ethnicity classification and ethnicity-sensitive name matching. In AAAI-12 / IAAI-12 - Proceedings of the 26th AAAI Conference on Artificial Intelligence and the 24th Innovative Applications of Artificial Intelligence Conference. 2012. p. 1141-1147. (Proceedings of the National Conference on Artificial Intelligence).