Automatic document metadata extraction using support vector machines

Hui Han, C. L. Giles, E. Manavoglu, Hongyuan Zha, Zhenyue Zhang, E. A. Fox

Research output: Chapter in Book/Report/Conference proceedingConference contribution

195 Citations (Scopus)

Abstract

Automatic metadata generation provides scalability and usability for digital libraries and their collections. Machine learning methods offer robust and adaptable automatic metadata extraction. We describe a support vector machine classification-based method for metadata extraction from header part of research papers and show that it outperforms other machine learning methods on the same task. The method first classifies each line of the header into one or more of 15 classes. An iterative convergence procedure is then used to improve the line classification by using the predicted class labels of its neighbor lines in the previous round. Further metadata extraction is done by seeking the best chunk boundaries of each line. We found that discovery and use of the structural patterns of the data and domain based word clustering can improve the metadata extraction performance. An appropriate feature normalization also greatly improves the classification performance. Our metadata extraction method was originally designed to improve the metadata extraction quality of the digital libraries Citeseer [S. Lawrence et al., (1999)] and EbizSearch [Y. Petinot et al., (2003)]. We believe it can be generalized to other digital libraries.

Original languageEnglish (US)
Title of host publicationProceedings - 2003 Joint Conference on Digital Libraries, JCDL 2003
EditorsLois Delcambre, Geneva Henry, Catherine C. Marshall
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages37-48
Number of pages12
ISBN (Electronic)0769519393
DOIs
StatePublished - Jan 1 2003
Event2003 Joint Conference on Digital Libraries, JCDL 2003 - Houston, United States
Duration: May 27 2003May 31 2003

Publication series

NameProceedings of the ACM/IEEE Joint Conference on Digital Libraries
Volume2003-January
ISSN (Print)1552-5996

Other

Other2003 Joint Conference on Digital Libraries, JCDL 2003
CountryUnited States
CityHouston
Period5/27/035/31/03

Fingerprint

Metadata
Support vector machines
Digital libraries
Learning systems
Scalability
Labels

All Science Journal Classification (ASJC) codes

  • Engineering(all)

Cite this

Han, H., Giles, C. L., Manavoglu, E., Zha, H., Zhang, Z., & Fox, E. A. (2003). Automatic document metadata extraction using support vector machines. In L. Delcambre, G. Henry, & C. C. Marshall (Eds.), Proceedings - 2003 Joint Conference on Digital Libraries, JCDL 2003 (pp. 37-48). [1204842] (Proceedings of the ACM/IEEE Joint Conference on Digital Libraries; Vol. 2003-January). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/JCDL.2003.1204842
Han, Hui ; Giles, C. L. ; Manavoglu, E. ; Zha, Hongyuan ; Zhang, Zhenyue ; Fox, E. A. / Automatic document metadata extraction using support vector machines. Proceedings - 2003 Joint Conference on Digital Libraries, JCDL 2003. editor / Lois Delcambre ; Geneva Henry ; Catherine C. Marshall. Institute of Electrical and Electronics Engineers Inc., 2003. pp. 37-48 (Proceedings of the ACM/IEEE Joint Conference on Digital Libraries).
@inproceedings{406c65129d0e416eae5e4396b8a628c7,
title = "Automatic document metadata extraction using support vector machines",
abstract = "Automatic metadata generation provides scalability and usability for digital libraries and their collections. Machine learning methods offer robust and adaptable automatic metadata extraction. We describe a support vector machine classification-based method for metadata extraction from header part of research papers and show that it outperforms other machine learning methods on the same task. The method first classifies each line of the header into one or more of 15 classes. An iterative convergence procedure is then used to improve the line classification by using the predicted class labels of its neighbor lines in the previous round. Further metadata extraction is done by seeking the best chunk boundaries of each line. We found that discovery and use of the structural patterns of the data and domain based word clustering can improve the metadata extraction performance. An appropriate feature normalization also greatly improves the classification performance. Our metadata extraction method was originally designed to improve the metadata extraction quality of the digital libraries Citeseer [S. Lawrence et al., (1999)] and EbizSearch [Y. Petinot et al., (2003)]. We believe it can be generalized to other digital libraries.",
author = "Hui Han and Giles, {C. L.} and E. Manavoglu and Hongyuan Zha and Zhenyue Zhang and Fox, {E. A.}",
year = "2003",
month = "1",
day = "1",
doi = "10.1109/JCDL.2003.1204842",
language = "English (US)",
series = "Proceedings of the ACM/IEEE Joint Conference on Digital Libraries",
publisher = "Institute of Electrical and Electronics Engineers Inc.",
pages = "37--48",
editor = "Lois Delcambre and Geneva Henry and Marshall, {Catherine C.}",
booktitle = "Proceedings - 2003 Joint Conference on Digital Libraries, JCDL 2003",
address = "United States",

}

Han, H, Giles, CL, Manavoglu, E, Zha, H, Zhang, Z & Fox, EA 2003, Automatic document metadata extraction using support vector machines. in L Delcambre, G Henry & CC Marshall (eds), Proceedings - 2003 Joint Conference on Digital Libraries, JCDL 2003., 1204842, Proceedings of the ACM/IEEE Joint Conference on Digital Libraries, vol. 2003-January, Institute of Electrical and Electronics Engineers Inc., pp. 37-48, 2003 Joint Conference on Digital Libraries, JCDL 2003, Houston, United States, 5/27/03. https://doi.org/10.1109/JCDL.2003.1204842

Automatic document metadata extraction using support vector machines. / Han, Hui; Giles, C. L.; Manavoglu, E.; Zha, Hongyuan; Zhang, Zhenyue; Fox, E. A.

Proceedings - 2003 Joint Conference on Digital Libraries, JCDL 2003. ed. / Lois Delcambre; Geneva Henry; Catherine C. Marshall. Institute of Electrical and Electronics Engineers Inc., 2003. p. 37-48 1204842 (Proceedings of the ACM/IEEE Joint Conference on Digital Libraries; Vol. 2003-January).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

TY - GEN

T1 - Automatic document metadata extraction using support vector machines

AU - Han, Hui

AU - Giles, C. L.

AU - Manavoglu, E.

AU - Zha, Hongyuan

AU - Zhang, Zhenyue

AU - Fox, E. A.

PY - 2003/1/1

Y1 - 2003/1/1

N2 - Automatic metadata generation provides scalability and usability for digital libraries and their collections. Machine learning methods offer robust and adaptable automatic metadata extraction. We describe a support vector machine classification-based method for metadata extraction from header part of research papers and show that it outperforms other machine learning methods on the same task. The method first classifies each line of the header into one or more of 15 classes. An iterative convergence procedure is then used to improve the line classification by using the predicted class labels of its neighbor lines in the previous round. Further metadata extraction is done by seeking the best chunk boundaries of each line. We found that discovery and use of the structural patterns of the data and domain based word clustering can improve the metadata extraction performance. An appropriate feature normalization also greatly improves the classification performance. Our metadata extraction method was originally designed to improve the metadata extraction quality of the digital libraries Citeseer [S. Lawrence et al., (1999)] and EbizSearch [Y. Petinot et al., (2003)]. We believe it can be generalized to other digital libraries.

AB - Automatic metadata generation provides scalability and usability for digital libraries and their collections. Machine learning methods offer robust and adaptable automatic metadata extraction. We describe a support vector machine classification-based method for metadata extraction from header part of research papers and show that it outperforms other machine learning methods on the same task. The method first classifies each line of the header into one or more of 15 classes. An iterative convergence procedure is then used to improve the line classification by using the predicted class labels of its neighbor lines in the previous round. Further metadata extraction is done by seeking the best chunk boundaries of each line. We found that discovery and use of the structural patterns of the data and domain based word clustering can improve the metadata extraction performance. An appropriate feature normalization also greatly improves the classification performance. Our metadata extraction method was originally designed to improve the metadata extraction quality of the digital libraries Citeseer [S. Lawrence et al., (1999)] and EbizSearch [Y. Petinot et al., (2003)]. We believe it can be generalized to other digital libraries.

UR - http://www.scopus.com/inward/record.url?scp=84941274546&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84941274546&partnerID=8YFLogxK

U2 - 10.1109/JCDL.2003.1204842

DO - 10.1109/JCDL.2003.1204842

M3 - Conference contribution

AN - SCOPUS:84941274546

T3 - Proceedings of the ACM/IEEE Joint Conference on Digital Libraries

SP - 37

EP - 48

BT - Proceedings - 2003 Joint Conference on Digital Libraries, JCDL 2003

A2 - Delcambre, Lois

A2 - Henry, Geneva

A2 - Marshall, Catherine C.

PB - Institute of Electrical and Electronics Engineers Inc.

ER -

Han H, Giles CL, Manavoglu E, Zha H, Zhang Z, Fox EA. Automatic document metadata extraction using support vector machines. In Delcambre L, Henry G, Marshall CC, editors, Proceedings - 2003 Joint Conference on Digital Libraries, JCDL 2003. Institute of Electrical and Electronics Engineers Inc. 2003. p. 37-48. 1204842. (Proceedings of the ACM/IEEE Joint Conference on Digital Libraries). https://doi.org/10.1109/JCDL.2003.1204842