Towards noise-resilient document modeling

Tao Yang, Dongwon Lee

Research output: Chapter in Book/Report/Conference proceedingConference contribution

2 Citations (Scopus)

Abstract

We introduce a generative probabilistic document model based on latent Dirichlet allocation (LDA), to deal with textual errors in the document collection. Our model is inspired by the fact that most large-scale text data are machine-generated and thus inevitably contain many types of noise. The new model, termed as TE-LDA, is developed from the traditional LDA by adding a switch variable into the term generation process in order to tackle the issue of noisy text data. Through extensive experiments, the efficacy of our proposed model is validated using both real and synthetic data sets.

Original languageEnglish (US)
Title of host publicationCIKM'11 - Proceedings of the 2011 ACM International Conference on Information and Knowledge Management
Pages2345-2348
Number of pages4
DOIs
StatePublished - Dec 13 2011
Event20th ACM Conference on Information and Knowledge Management, CIKM'11 - Glasgow, United Kingdom
Duration: Oct 24 2011Oct 28 2011

Publication series

NameInternational Conference on Information and Knowledge Management, Proceedings

Other

Other20th ACM Conference on Information and Knowledge Management, CIKM'11
CountryUnited Kingdom
CityGlasgow
Period10/24/1110/28/11

Fingerprint

Modeling
Dirichlet
Efficacy
Experiment

All Science Journal Classification (ASJC) codes

  • Decision Sciences(all)
  • Business, Management and Accounting(all)

Cite this

Yang, T., & Lee, D. (2011). Towards noise-resilient document modeling. In CIKM'11 - Proceedings of the 2011 ACM International Conference on Information and Knowledge Management (pp. 2345-2348). (International Conference on Information and Knowledge Management, Proceedings). https://doi.org/10.1145/2063576.2063962
Yang, Tao ; Lee, Dongwon. / Towards noise-resilient document modeling. CIKM'11 - Proceedings of the 2011 ACM International Conference on Information and Knowledge Management. 2011. pp. 2345-2348 (International Conference on Information and Knowledge Management, Proceedings).
@inproceedings{ab9432c45c3342eba2840888839dc8e9,
title = "Towards noise-resilient document modeling",
abstract = "We introduce a generative probabilistic document model based on latent Dirichlet allocation (LDA), to deal with textual errors in the document collection. Our model is inspired by the fact that most large-scale text data are machine-generated and thus inevitably contain many types of noise. The new model, termed as TE-LDA, is developed from the traditional LDA by adding a switch variable into the term generation process in order to tackle the issue of noisy text data. Through extensive experiments, the efficacy of our proposed model is validated using both real and synthetic data sets.",
author = "Tao Yang and Dongwon Lee",
year = "2011",
month = "12",
day = "13",
doi = "10.1145/2063576.2063962",
language = "English (US)",
isbn = "9781450307178",
series = "International Conference on Information and Knowledge Management, Proceedings",
pages = "2345--2348",
booktitle = "CIKM'11 - Proceedings of the 2011 ACM International Conference on Information and Knowledge Management",

}

Yang, T & Lee, D 2011, Towards noise-resilient document modeling. in CIKM'11 - Proceedings of the 2011 ACM International Conference on Information and Knowledge Management. International Conference on Information and Knowledge Management, Proceedings, pp. 2345-2348, 20th ACM Conference on Information and Knowledge Management, CIKM'11, Glasgow, United Kingdom, 10/24/11. https://doi.org/10.1145/2063576.2063962

Towards noise-resilient document modeling. / Yang, Tao; Lee, Dongwon.

CIKM'11 - Proceedings of the 2011 ACM International Conference on Information and Knowledge Management. 2011. p. 2345-2348 (International Conference on Information and Knowledge Management, Proceedings).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

TY - GEN

T1 - Towards noise-resilient document modeling

AU - Yang, Tao

AU - Lee, Dongwon

PY - 2011/12/13

Y1 - 2011/12/13

N2 - We introduce a generative probabilistic document model based on latent Dirichlet allocation (LDA), to deal with textual errors in the document collection. Our model is inspired by the fact that most large-scale text data are machine-generated and thus inevitably contain many types of noise. The new model, termed as TE-LDA, is developed from the traditional LDA by adding a switch variable into the term generation process in order to tackle the issue of noisy text data. Through extensive experiments, the efficacy of our proposed model is validated using both real and synthetic data sets.

AB - We introduce a generative probabilistic document model based on latent Dirichlet allocation (LDA), to deal with textual errors in the document collection. Our model is inspired by the fact that most large-scale text data are machine-generated and thus inevitably contain many types of noise. The new model, termed as TE-LDA, is developed from the traditional LDA by adding a switch variable into the term generation process in order to tackle the issue of noisy text data. Through extensive experiments, the efficacy of our proposed model is validated using both real and synthetic data sets.

UR - http://www.scopus.com/inward/record.url?scp=83055165905&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=83055165905&partnerID=8YFLogxK

U2 - 10.1145/2063576.2063962

DO - 10.1145/2063576.2063962

M3 - Conference contribution

AN - SCOPUS:83055165905

SN - 9781450307178

T3 - International Conference on Information and Knowledge Management, Proceedings

SP - 2345

EP - 2348

BT - CIKM'11 - Proceedings of the 2011 ACM International Conference on Information and Knowledge Management

ER -

Yang T, Lee D. Towards noise-resilient document modeling. In CIKM'11 - Proceedings of the 2011 ACM International Conference on Information and Knowledge Management. 2011. p. 2345-2348. (International Conference on Information and Knowledge Management, Proceedings). https://doi.org/10.1145/2063576.2063962