Rule-based word clustering for document metadata extraction

Hui Han, Kostas Tsioutsiouliklis, Eren Manavoglu, C. Lee Giles, Hongyuan Zha, Xiangmin Zhang

Research output: Chapter in Book/Report/Conference proceedingConference contribution

15 Citations (Scopus)

Abstract

Text classification is still an important problem for unlabeled text; CiteSeer, a computer science document search engine, uses automatic text classification methods for document indexing. Text classification uses a document's original text words as the primary feature representation. However, such representation usually comes with high dimensionality and feature sparseness. Word clustering is an effective approach to reduce feature dimensionality and feature sparseness, and improve text classification performance. This paper introduces a domain Rule-based word clustering method for cluster feature representation. The clusters are formed from various domain databases and the word orthographic properties. Besides significant dimensionality reduction, such cluster feature representations show a 6.6% absolute improvement on average on classification performance of document header lines and a 8.4% absolute improvement on the overall accuracy of bibliographic fields extraction, in contrast to feature representation just based on the original text words. Our word clustering even outperforms the distributional word clustering in the context of document metadata extraction.

Original languageEnglish (US)
Title of host publicationApplied Computing 2005 - Proceedings of the 20th Annual ACM Symposium on Applied Computing
Pages1049-1053
Number of pages5
Volume2
StatePublished - 2005
Event20th Annual ACM Symposium on Applied Computing - Santa Fe, NM, United States
Duration: Mar 13 2005Mar 17 2005

Other

Other20th Annual ACM Symposium on Applied Computing
CountryUnited States
CitySanta Fe, NM
Period3/13/053/17/05

Fingerprint

Metadata
Indexing (of information)
Search engines
Computer science

All Science Journal Classification (ASJC) codes

  • Software

Cite this

Han, H., Tsioutsiouliklis, K., Manavoglu, E., Lee Giles, C., Zha, H., & Zhang, X. (2005). Rule-based word clustering for document metadata extraction. In Applied Computing 2005 - Proceedings of the 20th Annual ACM Symposium on Applied Computing (Vol. 2, pp. 1049-1053)
Han, Hui ; Tsioutsiouliklis, Kostas ; Manavoglu, Eren ; Lee Giles, C. ; Zha, Hongyuan ; Zhang, Xiangmin. / Rule-based word clustering for document metadata extraction. Applied Computing 2005 - Proceedings of the 20th Annual ACM Symposium on Applied Computing. Vol. 2 2005. pp. 1049-1053
@inproceedings{b4d1dc12e2a44451a1c924afefde7e6a,
title = "Rule-based word clustering for document metadata extraction",
abstract = "Text classification is still an important problem for unlabeled text; CiteSeer, a computer science document search engine, uses automatic text classification methods for document indexing. Text classification uses a document's original text words as the primary feature representation. However, such representation usually comes with high dimensionality and feature sparseness. Word clustering is an effective approach to reduce feature dimensionality and feature sparseness, and improve text classification performance. This paper introduces a domain Rule-based word clustering method for cluster feature representation. The clusters are formed from various domain databases and the word orthographic properties. Besides significant dimensionality reduction, such cluster feature representations show a 6.6{\%} absolute improvement on average on classification performance of document header lines and a 8.4{\%} absolute improvement on the overall accuracy of bibliographic fields extraction, in contrast to feature representation just based on the original text words. Our word clustering even outperforms the distributional word clustering in the context of document metadata extraction.",
author = "Hui Han and Kostas Tsioutsiouliklis and Eren Manavoglu and {Lee Giles}, C. and Hongyuan Zha and Xiangmin Zhang",
year = "2005",
language = "English (US)",
volume = "2",
pages = "1049--1053",
booktitle = "Applied Computing 2005 - Proceedings of the 20th Annual ACM Symposium on Applied Computing",

}

Han, H, Tsioutsiouliklis, K, Manavoglu, E, Lee Giles, C, Zha, H & Zhang, X 2005, Rule-based word clustering for document metadata extraction. in Applied Computing 2005 - Proceedings of the 20th Annual ACM Symposium on Applied Computing. vol. 2, pp. 1049-1053, 20th Annual ACM Symposium on Applied Computing, Santa Fe, NM, United States, 3/13/05.

Rule-based word clustering for document metadata extraction. / Han, Hui; Tsioutsiouliklis, Kostas; Manavoglu, Eren; Lee Giles, C.; Zha, Hongyuan; Zhang, Xiangmin.

Applied Computing 2005 - Proceedings of the 20th Annual ACM Symposium on Applied Computing. Vol. 2 2005. p. 1049-1053.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

TY - GEN

T1 - Rule-based word clustering for document metadata extraction

AU - Han, Hui

AU - Tsioutsiouliklis, Kostas

AU - Manavoglu, Eren

AU - Lee Giles, C.

AU - Zha, Hongyuan

AU - Zhang, Xiangmin

PY - 2005

Y1 - 2005

N2 - Text classification is still an important problem for unlabeled text; CiteSeer, a computer science document search engine, uses automatic text classification methods for document indexing. Text classification uses a document's original text words as the primary feature representation. However, such representation usually comes with high dimensionality and feature sparseness. Word clustering is an effective approach to reduce feature dimensionality and feature sparseness, and improve text classification performance. This paper introduces a domain Rule-based word clustering method for cluster feature representation. The clusters are formed from various domain databases and the word orthographic properties. Besides significant dimensionality reduction, such cluster feature representations show a 6.6% absolute improvement on average on classification performance of document header lines and a 8.4% absolute improvement on the overall accuracy of bibliographic fields extraction, in contrast to feature representation just based on the original text words. Our word clustering even outperforms the distributional word clustering in the context of document metadata extraction.

AB - Text classification is still an important problem for unlabeled text; CiteSeer, a computer science document search engine, uses automatic text classification methods for document indexing. Text classification uses a document's original text words as the primary feature representation. However, such representation usually comes with high dimensionality and feature sparseness. Word clustering is an effective approach to reduce feature dimensionality and feature sparseness, and improve text classification performance. This paper introduces a domain Rule-based word clustering method for cluster feature representation. The clusters are formed from various domain databases and the word orthographic properties. Besides significant dimensionality reduction, such cluster feature representations show a 6.6% absolute improvement on average on classification performance of document header lines and a 8.4% absolute improvement on the overall accuracy of bibliographic fields extraction, in contrast to feature representation just based on the original text words. Our word clustering even outperforms the distributional word clustering in the context of document metadata extraction.

UR - http://www.scopus.com/inward/record.url?scp=33644558351&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=33644558351&partnerID=8YFLogxK

M3 - Conference contribution

VL - 2

SP - 1049

EP - 1053

BT - Applied Computing 2005 - Proceedings of the 20th Annual ACM Symposium on Applied Computing

ER -

Han H, Tsioutsiouliklis K, Manavoglu E, Lee Giles C, Zha H, Zhang X. Rule-based word clustering for document metadata extraction. In Applied Computing 2005 - Proceedings of the 20th Annual ACM Symposium on Applied Computing. Vol. 2. 2005. p. 1049-1053