Extracting researcher metadata with labeled features

Sujatha Das Gollapalli, Yanjun Qi, Prasenjit Mitra, C. Lee Giles

Research output: Chapter in Book/Report/Conference proceedingConference contribution

3 Citations (Scopus)

Abstract

Professional homepages of researchers contain metadata that provides crucial evidence in several digital library tasks such as academic network extraction, record linkage and expertise search. Due to inherent diversity in values for certain metadata fields (e.g., affiliation) supervised algorithms require a large number of labeled examples for accurately identifying values for these fields. We address this issue with feature labeling, a recent semi-supervised machine learning technique. We apply feature labeling to researcher metadata extraction from homepages by combining a small set of expert-provided feature distributions with few fully-labeled examples. We study two types of labeled features: (1) Dictionary features provide unigram hints related to specific metadata fields, whereas, (2) Proximity features capture the layout information between metadata fields on a homepage in a second stage. We experimentally show that this two-stage approach along with labeled features provides significant improvements in the tagging performance. In one experiment with only ten labeled homepages and 22 expert-specified labeled features, we obtained a 45% relative increase in the Fl value for the affiliation field, while the overall Fl improves by 9%.

Original languageEnglish (US)
Title of host publicationSIAM International Conference on Data Mining 2014, SDM 2014
EditorsPang Ning-Tan, Arindam Banerjee, Srinivasan Parthasarathy, Zoran Obradovic, Chandrika Kamath, Mohammed Zaki
PublisherSociety for Industrial and Applied Mathematics Publications
Pages740-748
Number of pages9
ISBN (Electronic)9781510811515
DOIs
StatePublished - Jan 1 2014
Event14th SIAM International Conference on Data Mining, SDM 2014 - Philadelphia, United States
Duration: Apr 24 2014Apr 26 2014

Publication series

NameSIAM International Conference on Data Mining 2014, SDM 2014
Volume2

Other

Other14th SIAM International Conference on Data Mining, SDM 2014
CountryUnited States
CityPhiladelphia
Period4/24/144/26/14

Fingerprint

Metadata
Labeling
Digital libraries
Glossaries
Learning systems
Experiments

All Science Journal Classification (ASJC) codes

  • Computer Science Applications
  • Software

Cite this

Das Gollapalli, S., Qi, Y., Mitra, P., & Giles, C. L. (2014). Extracting researcher metadata with labeled features. In P. Ning-Tan, A. Banerjee, S. Parthasarathy, Z. Obradovic, C. Kamath, & M. Zaki (Eds.), SIAM International Conference on Data Mining 2014, SDM 2014 (pp. 740-748). (SIAM International Conference on Data Mining 2014, SDM 2014; Vol. 2). Society for Industrial and Applied Mathematics Publications. https://doi.org/10.1137/1.9781611973440.85
Das Gollapalli, Sujatha ; Qi, Yanjun ; Mitra, Prasenjit ; Giles, C. Lee. / Extracting researcher metadata with labeled features. SIAM International Conference on Data Mining 2014, SDM 2014. editor / Pang Ning-Tan ; Arindam Banerjee ; Srinivasan Parthasarathy ; Zoran Obradovic ; Chandrika Kamath ; Mohammed Zaki. Society for Industrial and Applied Mathematics Publications, 2014. pp. 740-748 (SIAM International Conference on Data Mining 2014, SDM 2014).
@inproceedings{fae4e7c31de14d5a8425c458d7bf8ef8,
title = "Extracting researcher metadata with labeled features",
abstract = "Professional homepages of researchers contain metadata that provides crucial evidence in several digital library tasks such as academic network extraction, record linkage and expertise search. Due to inherent diversity in values for certain metadata fields (e.g., affiliation) supervised algorithms require a large number of labeled examples for accurately identifying values for these fields. We address this issue with feature labeling, a recent semi-supervised machine learning technique. We apply feature labeling to researcher metadata extraction from homepages by combining a small set of expert-provided feature distributions with few fully-labeled examples. We study two types of labeled features: (1) Dictionary features provide unigram hints related to specific metadata fields, whereas, (2) Proximity features capture the layout information between metadata fields on a homepage in a second stage. We experimentally show that this two-stage approach along with labeled features provides significant improvements in the tagging performance. In one experiment with only ten labeled homepages and 22 expert-specified labeled features, we obtained a 45{\%} relative increase in the Fl value for the affiliation field, while the overall Fl improves by 9{\%}.",
author = "{Das Gollapalli}, Sujatha and Yanjun Qi and Prasenjit Mitra and Giles, {C. Lee}",
year = "2014",
month = "1",
day = "1",
doi = "10.1137/1.9781611973440.85",
language = "English (US)",
series = "SIAM International Conference on Data Mining 2014, SDM 2014",
publisher = "Society for Industrial and Applied Mathematics Publications",
pages = "740--748",
editor = "Pang Ning-Tan and Arindam Banerjee and Srinivasan Parthasarathy and Zoran Obradovic and Chandrika Kamath and Mohammed Zaki",
booktitle = "SIAM International Conference on Data Mining 2014, SDM 2014",
address = "United States",

}

Das Gollapalli, S, Qi, Y, Mitra, P & Giles, CL 2014, Extracting researcher metadata with labeled features. in P Ning-Tan, A Banerjee, S Parthasarathy, Z Obradovic, C Kamath & M Zaki (eds), SIAM International Conference on Data Mining 2014, SDM 2014. SIAM International Conference on Data Mining 2014, SDM 2014, vol. 2, Society for Industrial and Applied Mathematics Publications, pp. 740-748, 14th SIAM International Conference on Data Mining, SDM 2014, Philadelphia, United States, 4/24/14. https://doi.org/10.1137/1.9781611973440.85

Extracting researcher metadata with labeled features. / Das Gollapalli, Sujatha; Qi, Yanjun; Mitra, Prasenjit; Giles, C. Lee.

SIAM International Conference on Data Mining 2014, SDM 2014. ed. / Pang Ning-Tan; Arindam Banerjee; Srinivasan Parthasarathy; Zoran Obradovic; Chandrika Kamath; Mohammed Zaki. Society for Industrial and Applied Mathematics Publications, 2014. p. 740-748 (SIAM International Conference on Data Mining 2014, SDM 2014; Vol. 2).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

TY - GEN

T1 - Extracting researcher metadata with labeled features

AU - Das Gollapalli, Sujatha

AU - Qi, Yanjun

AU - Mitra, Prasenjit

AU - Giles, C. Lee

PY - 2014/1/1

Y1 - 2014/1/1

N2 - Professional homepages of researchers contain metadata that provides crucial evidence in several digital library tasks such as academic network extraction, record linkage and expertise search. Due to inherent diversity in values for certain metadata fields (e.g., affiliation) supervised algorithms require a large number of labeled examples for accurately identifying values for these fields. We address this issue with feature labeling, a recent semi-supervised machine learning technique. We apply feature labeling to researcher metadata extraction from homepages by combining a small set of expert-provided feature distributions with few fully-labeled examples. We study two types of labeled features: (1) Dictionary features provide unigram hints related to specific metadata fields, whereas, (2) Proximity features capture the layout information between metadata fields on a homepage in a second stage. We experimentally show that this two-stage approach along with labeled features provides significant improvements in the tagging performance. In one experiment with only ten labeled homepages and 22 expert-specified labeled features, we obtained a 45% relative increase in the Fl value for the affiliation field, while the overall Fl improves by 9%.

AB - Professional homepages of researchers contain metadata that provides crucial evidence in several digital library tasks such as academic network extraction, record linkage and expertise search. Due to inherent diversity in values for certain metadata fields (e.g., affiliation) supervised algorithms require a large number of labeled examples for accurately identifying values for these fields. We address this issue with feature labeling, a recent semi-supervised machine learning technique. We apply feature labeling to researcher metadata extraction from homepages by combining a small set of expert-provided feature distributions with few fully-labeled examples. We study two types of labeled features: (1) Dictionary features provide unigram hints related to specific metadata fields, whereas, (2) Proximity features capture the layout information between metadata fields on a homepage in a second stage. We experimentally show that this two-stage approach along with labeled features provides significant improvements in the tagging performance. In one experiment with only ten labeled homepages and 22 expert-specified labeled features, we obtained a 45% relative increase in the Fl value for the affiliation field, while the overall Fl improves by 9%.

UR - http://www.scopus.com/inward/record.url?scp=84959872994&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84959872994&partnerID=8YFLogxK

U2 - 10.1137/1.9781611973440.85

DO - 10.1137/1.9781611973440.85

M3 - Conference contribution

AN - SCOPUS:84959872994

T3 - SIAM International Conference on Data Mining 2014, SDM 2014

SP - 740

EP - 748

BT - SIAM International Conference on Data Mining 2014, SDM 2014

A2 - Ning-Tan, Pang

A2 - Banerjee, Arindam

A2 - Parthasarathy, Srinivasan

A2 - Obradovic, Zoran

A2 - Kamath, Chandrika

A2 - Zaki, Mohammed

PB - Society for Industrial and Applied Mathematics Publications

ER -

Das Gollapalli S, Qi Y, Mitra P, Giles CL. Extracting researcher metadata with labeled features. In Ning-Tan P, Banerjee A, Parthasarathy S, Obradovic Z, Kamath C, Zaki M, editors, SIAM International Conference on Data Mining 2014, SDM 2014. Society for Industrial and Applied Mathematics Publications. 2014. p. 740-748. (SIAM International Conference on Data Mining 2014, SDM 2014). https://doi.org/10.1137/1.9781611973440.85