Enhancing cross document co reference of web documents with context similarity and very large scale text categorization

Jian Huang, Pucktada Treeratpituk, Sarah M. Taylor, C. Lee Giles

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Cross Document Co reference (CDC) is the task of constructing the co reference chain for mentions of a person across a set of documents. This work offers a holistic view of using document-level categories, sub-document level context and extracted entities and relations for the CDC task. We train a categorization component with an efficient flat algorithm using thousands of ODP categories and over a million web documents. We propose to use ranked categories as co reference information, particularly suitable for web documents that are widely different in style and content. An ensemble composite co reference function, amenable to inactive features, combines these three levels of evidence for disambiguation. A thorough feature importance study is conducted to analyze how these three components contribute to the co reference results. The overall solution is evaluated using the WePS benchmark data and demonstrate superior performance.

Original languageEnglish (US)
Title of host publicationColing 2010 - 23rd International Conference on Computational Linguistics, Proceedings of the Conference
Pages483-491
Number of pages9
Volume2
StatePublished - 2010
Event23rd International Conference on Computational Linguistics, Coling 2010 - Beijing, China
Duration: Aug 23 2010Aug 27 2010

Other

Other23rd International Conference on Computational Linguistics, Coling 2010
CountryChina
CityBeijing
Period8/23/108/27/10

Fingerprint

Composite materials
World Wide Web
Coreference
human being
performance
evidence

All Science Journal Classification (ASJC) codes

  • Language and Linguistics
  • Computational Theory and Mathematics
  • Linguistics and Language

Cite this

Huang, J., Treeratpituk, P., Taylor, S. M., & Giles, C. L. (2010). Enhancing cross document co reference of web documents with context similarity and very large scale text categorization. In Coling 2010 - 23rd International Conference on Computational Linguistics, Proceedings of the Conference (Vol. 2, pp. 483-491)
Huang, Jian ; Treeratpituk, Pucktada ; Taylor, Sarah M. ; Giles, C. Lee. / Enhancing cross document co reference of web documents with context similarity and very large scale text categorization. Coling 2010 - 23rd International Conference on Computational Linguistics, Proceedings of the Conference. Vol. 2 2010. pp. 483-491
@inproceedings{a00e425ef5ef4d0789e9df48d3c2a364,
title = "Enhancing cross document co reference of web documents with context similarity and very large scale text categorization",
abstract = "Cross Document Co reference (CDC) is the task of constructing the co reference chain for mentions of a person across a set of documents. This work offers a holistic view of using document-level categories, sub-document level context and extracted entities and relations for the CDC task. We train a categorization component with an efficient flat algorithm using thousands of ODP categories and over a million web documents. We propose to use ranked categories as co reference information, particularly suitable for web documents that are widely different in style and content. An ensemble composite co reference function, amenable to inactive features, combines these three levels of evidence for disambiguation. A thorough feature importance study is conducted to analyze how these three components contribute to the co reference results. The overall solution is evaluated using the WePS benchmark data and demonstrate superior performance.",
author = "Jian Huang and Pucktada Treeratpituk and Taylor, {Sarah M.} and Giles, {C. Lee}",
year = "2010",
language = "English (US)",
volume = "2",
pages = "483--491",
booktitle = "Coling 2010 - 23rd International Conference on Computational Linguistics, Proceedings of the Conference",

}

Huang, J, Treeratpituk, P, Taylor, SM & Giles, CL 2010, Enhancing cross document co reference of web documents with context similarity and very large scale text categorization. in Coling 2010 - 23rd International Conference on Computational Linguistics, Proceedings of the Conference. vol. 2, pp. 483-491, 23rd International Conference on Computational Linguistics, Coling 2010, Beijing, China, 8/23/10.

Enhancing cross document co reference of web documents with context similarity and very large scale text categorization. / Huang, Jian; Treeratpituk, Pucktada; Taylor, Sarah M.; Giles, C. Lee.

Coling 2010 - 23rd International Conference on Computational Linguistics, Proceedings of the Conference. Vol. 2 2010. p. 483-491.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

TY - GEN

T1 - Enhancing cross document co reference of web documents with context similarity and very large scale text categorization

AU - Huang, Jian

AU - Treeratpituk, Pucktada

AU - Taylor, Sarah M.

AU - Giles, C. Lee

PY - 2010

Y1 - 2010

N2 - Cross Document Co reference (CDC) is the task of constructing the co reference chain for mentions of a person across a set of documents. This work offers a holistic view of using document-level categories, sub-document level context and extracted entities and relations for the CDC task. We train a categorization component with an efficient flat algorithm using thousands of ODP categories and over a million web documents. We propose to use ranked categories as co reference information, particularly suitable for web documents that are widely different in style and content. An ensemble composite co reference function, amenable to inactive features, combines these three levels of evidence for disambiguation. A thorough feature importance study is conducted to analyze how these three components contribute to the co reference results. The overall solution is evaluated using the WePS benchmark data and demonstrate superior performance.

AB - Cross Document Co reference (CDC) is the task of constructing the co reference chain for mentions of a person across a set of documents. This work offers a holistic view of using document-level categories, sub-document level context and extracted entities and relations for the CDC task. We train a categorization component with an efficient flat algorithm using thousands of ODP categories and over a million web documents. We propose to use ranked categories as co reference information, particularly suitable for web documents that are widely different in style and content. An ensemble composite co reference function, amenable to inactive features, combines these three levels of evidence for disambiguation. A thorough feature importance study is conducted to analyze how these three components contribute to the co reference results. The overall solution is evaluated using the WePS benchmark data and demonstrate superior performance.

UR - http://www.scopus.com/inward/record.url?scp=80053404098&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=80053404098&partnerID=8YFLogxK

M3 - Conference contribution

AN - SCOPUS:80053404098

VL - 2

SP - 483

EP - 491

BT - Coling 2010 - 23rd International Conference on Computational Linguistics, Proceedings of the Conference

ER -

Huang J, Treeratpituk P, Taylor SM, Giles CL. Enhancing cross document co reference of web documents with context similarity and very large scale text categorization. In Coling 2010 - 23rd International Conference on Computational Linguistics, Proceedings of the Conference. Vol. 2. 2010. p. 483-491