Inventor name disambiguation for a patent database using a random forest and DBSCAN

Kunho Kim, Madian Khabsa, C. Lee Giles

Research output: Chapter in Book/Report/Conference proceedingConference contribution

5 Citations (Scopus)

Abstract

Inventor name disambiguation is the task that distinguishes each unique inventor from all other inventor records in a patent database. This task is essential for processing person name queries in order to get information related to a specific inventor, e.g. a list of all that inventor's patents. Using earlier work on author name disambiguation, we apply it to inventor name disambiguation. A random forest classifier is trained to classify whether each pair of inventor records is the same person. The DBSCAN algorithm is use for inventor record clustering, and its distance function is derived using the random forest classifier. For scalability, blocking functions are used to reduce the complexity of record matching and enable parallelization since each block can be run simultaneously. Tested on the USPTO patent database, 12 million inventor records were disambiguated in 6.5 hours. Evaluation on the labeled datasets from USPTO PatentsView competition shows our algorithm outperforms all algorithms submitted to the competition.

Original languageEnglish (US)
Title of host publicationJCDL 2016 - Proceedings of the 16th ACM/IEEE-CS Joint Conference on Digital Libraries
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages269-270
Number of pages2
Volume2016-September
ISBN (Electronic)9781450342292
DOIs
StatePublished - Sep 1 2016
Event16th ACM/IEEE-CS Joint Conference on Digital Libraries, JCDL 2016 - Newark, United States
Duration: Jun 19 2016Jun 23 2016

Other

Other16th ACM/IEEE-CS Joint Conference on Digital Libraries, JCDL 2016
CountryUnited States
CityNewark
Period6/19/166/23/16

Fingerprint

Classifiers
Scalability
Processing

All Science Journal Classification (ASJC) codes

  • Engineering(all)

Cite this

Kim, K., Khabsa, M., & Giles, C. L. (2016). Inventor name disambiguation for a patent database using a random forest and DBSCAN. In JCDL 2016 - Proceedings of the 16th ACM/IEEE-CS Joint Conference on Digital Libraries (Vol. 2016-September, pp. 269-270). [7559618] Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1145/2910896.2925465
Kim, Kunho ; Khabsa, Madian ; Giles, C. Lee. / Inventor name disambiguation for a patent database using a random forest and DBSCAN. JCDL 2016 - Proceedings of the 16th ACM/IEEE-CS Joint Conference on Digital Libraries. Vol. 2016-September Institute of Electrical and Electronics Engineers Inc., 2016. pp. 269-270
@inproceedings{107d9323edbe43f5b474531b973aa8c8,
title = "Inventor name disambiguation for a patent database using a random forest and DBSCAN",
abstract = "Inventor name disambiguation is the task that distinguishes each unique inventor from all other inventor records in a patent database. This task is essential for processing person name queries in order to get information related to a specific inventor, e.g. a list of all that inventor's patents. Using earlier work on author name disambiguation, we apply it to inventor name disambiguation. A random forest classifier is trained to classify whether each pair of inventor records is the same person. The DBSCAN algorithm is use for inventor record clustering, and its distance function is derived using the random forest classifier. For scalability, blocking functions are used to reduce the complexity of record matching and enable parallelization since each block can be run simultaneously. Tested on the USPTO patent database, 12 million inventor records were disambiguated in 6.5 hours. Evaluation on the labeled datasets from USPTO PatentsView competition shows our algorithm outperforms all algorithms submitted to the competition.",
author = "Kunho Kim and Madian Khabsa and Giles, {C. Lee}",
year = "2016",
month = "9",
day = "1",
doi = "10.1145/2910896.2925465",
language = "English (US)",
volume = "2016-September",
pages = "269--270",
booktitle = "JCDL 2016 - Proceedings of the 16th ACM/IEEE-CS Joint Conference on Digital Libraries",
publisher = "Institute of Electrical and Electronics Engineers Inc.",
address = "United States",

}

Kim, K, Khabsa, M & Giles, CL 2016, Inventor name disambiguation for a patent database using a random forest and DBSCAN. in JCDL 2016 - Proceedings of the 16th ACM/IEEE-CS Joint Conference on Digital Libraries. vol. 2016-September, 7559618, Institute of Electrical and Electronics Engineers Inc., pp. 269-270, 16th ACM/IEEE-CS Joint Conference on Digital Libraries, JCDL 2016, Newark, United States, 6/19/16. https://doi.org/10.1145/2910896.2925465

Inventor name disambiguation for a patent database using a random forest and DBSCAN. / Kim, Kunho; Khabsa, Madian; Giles, C. Lee.

JCDL 2016 - Proceedings of the 16th ACM/IEEE-CS Joint Conference on Digital Libraries. Vol. 2016-September Institute of Electrical and Electronics Engineers Inc., 2016. p. 269-270 7559618.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

TY - GEN

T1 - Inventor name disambiguation for a patent database using a random forest and DBSCAN

AU - Kim, Kunho

AU - Khabsa, Madian

AU - Giles, C. Lee

PY - 2016/9/1

Y1 - 2016/9/1

N2 - Inventor name disambiguation is the task that distinguishes each unique inventor from all other inventor records in a patent database. This task is essential for processing person name queries in order to get information related to a specific inventor, e.g. a list of all that inventor's patents. Using earlier work on author name disambiguation, we apply it to inventor name disambiguation. A random forest classifier is trained to classify whether each pair of inventor records is the same person. The DBSCAN algorithm is use for inventor record clustering, and its distance function is derived using the random forest classifier. For scalability, blocking functions are used to reduce the complexity of record matching and enable parallelization since each block can be run simultaneously. Tested on the USPTO patent database, 12 million inventor records were disambiguated in 6.5 hours. Evaluation on the labeled datasets from USPTO PatentsView competition shows our algorithm outperforms all algorithms submitted to the competition.

AB - Inventor name disambiguation is the task that distinguishes each unique inventor from all other inventor records in a patent database. This task is essential for processing person name queries in order to get information related to a specific inventor, e.g. a list of all that inventor's patents. Using earlier work on author name disambiguation, we apply it to inventor name disambiguation. A random forest classifier is trained to classify whether each pair of inventor records is the same person. The DBSCAN algorithm is use for inventor record clustering, and its distance function is derived using the random forest classifier. For scalability, blocking functions are used to reduce the complexity of record matching and enable parallelization since each block can be run simultaneously. Tested on the USPTO patent database, 12 million inventor records were disambiguated in 6.5 hours. Evaluation on the labeled datasets from USPTO PatentsView competition shows our algorithm outperforms all algorithms submitted to the competition.

UR - http://www.scopus.com/inward/record.url?scp=84989831750&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84989831750&partnerID=8YFLogxK

U2 - 10.1145/2910896.2925465

DO - 10.1145/2910896.2925465

M3 - Conference contribution

AN - SCOPUS:84989831750

VL - 2016-September

SP - 269

EP - 270

BT - JCDL 2016 - Proceedings of the 16th ACM/IEEE-CS Joint Conference on Digital Libraries

PB - Institute of Electrical and Electronics Engineers Inc.

ER -

Kim K, Khabsa M, Giles CL. Inventor name disambiguation for a patent database using a random forest and DBSCAN. In JCDL 2016 - Proceedings of the 16th ACM/IEEE-CS Joint Conference on Digital Libraries. Vol. 2016-September. Institute of Electrical and Electronics Engineers Inc. 2016. p. 269-270. 7559618 https://doi.org/10.1145/2910896.2925465