A supervised learning approach to entity matching between scholarly big datasets

Jian Wu, Athar Sefid, Allen C. Ge, Clyde Lee Giles

Research output: Chapter in Book/Report/Conference proceedingConference contribution

2 Citations (Scopus)

Abstract

Bibliography metadata in scientific documents are essential in indexing and retrieval of scholarly big data for production search engines and bibliometrics research studies. Crawl-based digital library search engines can harvest millions of documents efficiently but metadata information extracted by automatic extractors are often noisy, incomplete, and/or with parsing errors. These metadata could be cleaned given a reference database. In this work, we develop a supervised machine learning based approach to match entities in a target database to a reference database, which can further be used to clean metadata in the target database. The approach leverages a number of features extracted from headers available from automatic extraction results. By adjusting combinations of hyper-parameters and various sampling strategies, the best results of Support Vector Machines, Logistic Regression, Random Forests, and Naïve Bayes models give comparable results, with F1-measure of about 90%, outperforming information retrieval only based method by about 14%, evaluated with cross validation.

Original languageEnglish (US)
Title of host publicationProceedings of the Knowledge Capture Conference, K-CAP 2017
PublisherAssociation for Computing Machinery, Inc
ISBN (Electronic)9781450355537
DOIs
StatePublished - Dec 4 2017
Event9th International Conference on Knowledge Capture, K-CAP 2017 - Austin, United States
Duration: Dec 4 2017Dec 6 2017

Publication series

NameProceedings of the Knowledge Capture Conference, K-CAP 2017

Other

Other9th International Conference on Knowledge Capture, K-CAP 2017
CountryUnited States
CityAustin
Period12/4/1712/6/17

Fingerprint

Supervised learning
Metadata
Search engines
Digital libraries
Bibliographies
Information retrieval
Support vector machines
Learning systems
Logistics
Sampling

All Science Journal Classification (ASJC) codes

  • Computational Theory and Mathematics
  • Software
  • Computer Science Applications
  • Information Systems

Cite this

Wu, J., Sefid, A., Ge, A. C., & Giles, C. L. (2017). A supervised learning approach to entity matching between scholarly big datasets. In Proceedings of the Knowledge Capture Conference, K-CAP 2017 [41] (Proceedings of the Knowledge Capture Conference, K-CAP 2017). Association for Computing Machinery, Inc. https://doi.org/10.1145/3148011.3154470
Wu, Jian ; Sefid, Athar ; Ge, Allen C. ; Giles, Clyde Lee. / A supervised learning approach to entity matching between scholarly big datasets. Proceedings of the Knowledge Capture Conference, K-CAP 2017. Association for Computing Machinery, Inc, 2017. (Proceedings of the Knowledge Capture Conference, K-CAP 2017).
@inproceedings{9382de0e31a6466480c8ac67488c8b3d,
title = "A supervised learning approach to entity matching between scholarly big datasets",
abstract = "Bibliography metadata in scientific documents are essential in indexing and retrieval of scholarly big data for production search engines and bibliometrics research studies. Crawl-based digital library search engines can harvest millions of documents efficiently but metadata information extracted by automatic extractors are often noisy, incomplete, and/or with parsing errors. These metadata could be cleaned given a reference database. In this work, we develop a supervised machine learning based approach to match entities in a target database to a reference database, which can further be used to clean metadata in the target database. The approach leverages a number of features extracted from headers available from automatic extraction results. By adjusting combinations of hyper-parameters and various sampling strategies, the best results of Support Vector Machines, Logistic Regression, Random Forests, and Na{\"i}ve Bayes models give comparable results, with F1-measure of about 90{\%}, outperforming information retrieval only based method by about 14{\%}, evaluated with cross validation.",
author = "Jian Wu and Athar Sefid and Ge, {Allen C.} and Giles, {Clyde Lee}",
year = "2017",
month = "12",
day = "4",
doi = "10.1145/3148011.3154470",
language = "English (US)",
series = "Proceedings of the Knowledge Capture Conference, K-CAP 2017",
publisher = "Association for Computing Machinery, Inc",
booktitle = "Proceedings of the Knowledge Capture Conference, K-CAP 2017",

}

Wu, J, Sefid, A, Ge, AC & Giles, CL 2017, A supervised learning approach to entity matching between scholarly big datasets. in Proceedings of the Knowledge Capture Conference, K-CAP 2017., 41, Proceedings of the Knowledge Capture Conference, K-CAP 2017, Association for Computing Machinery, Inc, 9th International Conference on Knowledge Capture, K-CAP 2017, Austin, United States, 12/4/17. https://doi.org/10.1145/3148011.3154470

A supervised learning approach to entity matching between scholarly big datasets. / Wu, Jian; Sefid, Athar; Ge, Allen C.; Giles, Clyde Lee.

Proceedings of the Knowledge Capture Conference, K-CAP 2017. Association for Computing Machinery, Inc, 2017. 41 (Proceedings of the Knowledge Capture Conference, K-CAP 2017).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

TY - GEN

T1 - A supervised learning approach to entity matching between scholarly big datasets

AU - Wu, Jian

AU - Sefid, Athar

AU - Ge, Allen C.

AU - Giles, Clyde Lee

PY - 2017/12/4

Y1 - 2017/12/4

N2 - Bibliography metadata in scientific documents are essential in indexing and retrieval of scholarly big data for production search engines and bibliometrics research studies. Crawl-based digital library search engines can harvest millions of documents efficiently but metadata information extracted by automatic extractors are often noisy, incomplete, and/or with parsing errors. These metadata could be cleaned given a reference database. In this work, we develop a supervised machine learning based approach to match entities in a target database to a reference database, which can further be used to clean metadata in the target database. The approach leverages a number of features extracted from headers available from automatic extraction results. By adjusting combinations of hyper-parameters and various sampling strategies, the best results of Support Vector Machines, Logistic Regression, Random Forests, and Naïve Bayes models give comparable results, with F1-measure of about 90%, outperforming information retrieval only based method by about 14%, evaluated with cross validation.

AB - Bibliography metadata in scientific documents are essential in indexing and retrieval of scholarly big data for production search engines and bibliometrics research studies. Crawl-based digital library search engines can harvest millions of documents efficiently but metadata information extracted by automatic extractors are often noisy, incomplete, and/or with parsing errors. These metadata could be cleaned given a reference database. In this work, we develop a supervised machine learning based approach to match entities in a target database to a reference database, which can further be used to clean metadata in the target database. The approach leverages a number of features extracted from headers available from automatic extraction results. By adjusting combinations of hyper-parameters and various sampling strategies, the best results of Support Vector Machines, Logistic Regression, Random Forests, and Naïve Bayes models give comparable results, with F1-measure of about 90%, outperforming information retrieval only based method by about 14%, evaluated with cross validation.

UR - http://www.scopus.com/inward/record.url?scp=85040623747&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85040623747&partnerID=8YFLogxK

U2 - 10.1145/3148011.3154470

DO - 10.1145/3148011.3154470

M3 - Conference contribution

AN - SCOPUS:85040623747

T3 - Proceedings of the Knowledge Capture Conference, K-CAP 2017

BT - Proceedings of the Knowledge Capture Conference, K-CAP 2017

PB - Association for Computing Machinery, Inc

ER -

Wu J, Sefid A, Ge AC, Giles CL. A supervised learning approach to entity matching between scholarly big datasets. In Proceedings of the Knowledge Capture Conference, K-CAP 2017. Association for Computing Machinery, Inc. 2017. 41. (Proceedings of the Knowledge Capture Conference, K-CAP 2017). https://doi.org/10.1145/3148011.3154470