On handling textual errors in latent document modeling

Tao Yang, Dongwon Lee

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

As large-scale text data become available on the Web, textual errors in a corpus are often inevitable (e.g., digitizing historic documents). Due to the calculation of frequencies of words, however, such textual errors can significantly impact the accuracy of statistical models such as the popular Latent Dirichlet Allocation (LDA) model. To address such an issue, in this paper, we propose two novel extensions to LDA (i.e., TE-LDA and TDE-LDA): (1) The TE-LDA model incorporates textual errors into term generation process; and (2) The TDE-LDA model extends TE-LDA further by taking into account topic dependency to leverage on semantic connections among consecutive words even if parts are typos. Using both real and synthetic data sets with varying degrees of "errors", our TDE-LDA model outperforms: (1) the traditional LDA model by 16%-39% (real) and 20%-63% (synthetic); and (2) the state-of-the-art N-Grams model by ll%-27% (real) and 16%-54% (synthetic).

Original languageEnglish (US)
Title of host publicationCIKM 2013 - Proceedings of the 22nd ACM International Conference on Information and Knowledge Management
Pages2089-2098
Number of pages10
DOIs
StatePublished - Dec 11 2013
Event22nd ACM International Conference on Information and Knowledge Management, CIKM 2013 - San Francisco, CA, United States
Duration: Oct 27 2013Nov 1 2013

Publication series

NameInternational Conference on Information and Knowledge Management, Proceedings

Other

Other22nd ACM International Conference on Information and Knowledge Management, CIKM 2013
CountryUnited States
CitySan Francisco, CA
Period10/27/1311/1/13

Fingerprint

Modeling
Dirichlet
Statistical model
World Wide Web
Leverage

All Science Journal Classification (ASJC) codes

  • Decision Sciences(all)
  • Business, Management and Accounting(all)

Cite this

Yang, T., & Lee, D. (2013). On handling textual errors in latent document modeling. In CIKM 2013 - Proceedings of the 22nd ACM International Conference on Information and Knowledge Management (pp. 2089-2098). (International Conference on Information and Knowledge Management, Proceedings). https://doi.org/10.1145/2505515.2505555
Yang, Tao ; Lee, Dongwon. / On handling textual errors in latent document modeling. CIKM 2013 - Proceedings of the 22nd ACM International Conference on Information and Knowledge Management. 2013. pp. 2089-2098 (International Conference on Information and Knowledge Management, Proceedings).
@inproceedings{1c0a7989d86c4e97bb39889e50b96d33,
title = "On handling textual errors in latent document modeling",
abstract = "As large-scale text data become available on the Web, textual errors in a corpus are often inevitable (e.g., digitizing historic documents). Due to the calculation of frequencies of words, however, such textual errors can significantly impact the accuracy of statistical models such as the popular Latent Dirichlet Allocation (LDA) model. To address such an issue, in this paper, we propose two novel extensions to LDA (i.e., TE-LDA and TDE-LDA): (1) The TE-LDA model incorporates textual errors into term generation process; and (2) The TDE-LDA model extends TE-LDA further by taking into account topic dependency to leverage on semantic connections among consecutive words even if parts are typos. Using both real and synthetic data sets with varying degrees of {"}errors{"}, our TDE-LDA model outperforms: (1) the traditional LDA model by 16{\%}-39{\%} (real) and 20{\%}-63{\%} (synthetic); and (2) the state-of-the-art N-Grams model by ll{\%}-27{\%} (real) and 16{\%}-54{\%} (synthetic).",
author = "Tao Yang and Dongwon Lee",
year = "2013",
month = "12",
day = "11",
doi = "10.1145/2505515.2505555",
language = "English (US)",
isbn = "9781450322638",
series = "International Conference on Information and Knowledge Management, Proceedings",
pages = "2089--2098",
booktitle = "CIKM 2013 - Proceedings of the 22nd ACM International Conference on Information and Knowledge Management",

}

Yang, T & Lee, D 2013, On handling textual errors in latent document modeling. in CIKM 2013 - Proceedings of the 22nd ACM International Conference on Information and Knowledge Management. International Conference on Information and Knowledge Management, Proceedings, pp. 2089-2098, 22nd ACM International Conference on Information and Knowledge Management, CIKM 2013, San Francisco, CA, United States, 10/27/13. https://doi.org/10.1145/2505515.2505555

On handling textual errors in latent document modeling. / Yang, Tao; Lee, Dongwon.

CIKM 2013 - Proceedings of the 22nd ACM International Conference on Information and Knowledge Management. 2013. p. 2089-2098 (International Conference on Information and Knowledge Management, Proceedings).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

TY - GEN

T1 - On handling textual errors in latent document modeling

AU - Yang, Tao

AU - Lee, Dongwon

PY - 2013/12/11

Y1 - 2013/12/11

N2 - As large-scale text data become available on the Web, textual errors in a corpus are often inevitable (e.g., digitizing historic documents). Due to the calculation of frequencies of words, however, such textual errors can significantly impact the accuracy of statistical models such as the popular Latent Dirichlet Allocation (LDA) model. To address such an issue, in this paper, we propose two novel extensions to LDA (i.e., TE-LDA and TDE-LDA): (1) The TE-LDA model incorporates textual errors into term generation process; and (2) The TDE-LDA model extends TE-LDA further by taking into account topic dependency to leverage on semantic connections among consecutive words even if parts are typos. Using both real and synthetic data sets with varying degrees of "errors", our TDE-LDA model outperforms: (1) the traditional LDA model by 16%-39% (real) and 20%-63% (synthetic); and (2) the state-of-the-art N-Grams model by ll%-27% (real) and 16%-54% (synthetic).

AB - As large-scale text data become available on the Web, textual errors in a corpus are often inevitable (e.g., digitizing historic documents). Due to the calculation of frequencies of words, however, such textual errors can significantly impact the accuracy of statistical models such as the popular Latent Dirichlet Allocation (LDA) model. To address such an issue, in this paper, we propose two novel extensions to LDA (i.e., TE-LDA and TDE-LDA): (1) The TE-LDA model incorporates textual errors into term generation process; and (2) The TDE-LDA model extends TE-LDA further by taking into account topic dependency to leverage on semantic connections among consecutive words even if parts are typos. Using both real and synthetic data sets with varying degrees of "errors", our TDE-LDA model outperforms: (1) the traditional LDA model by 16%-39% (real) and 20%-63% (synthetic); and (2) the state-of-the-art N-Grams model by ll%-27% (real) and 16%-54% (synthetic).

UR - http://www.scopus.com/inward/record.url?scp=84889578031&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84889578031&partnerID=8YFLogxK

U2 - 10.1145/2505515.2505555

DO - 10.1145/2505515.2505555

M3 - Conference contribution

AN - SCOPUS:84889578031

SN - 9781450322638

T3 - International Conference on Information and Knowledge Management, Proceedings

SP - 2089

EP - 2098

BT - CIKM 2013 - Proceedings of the 22nd ACM International Conference on Information and Knowledge Management

ER -

Yang T, Lee D. On handling textual errors in latent document modeling. In CIKM 2013 - Proceedings of the 22nd ACM International Conference on Information and Knowledge Management. 2013. p. 2089-2098. (International Conference on Information and Knowledge Management, Proceedings). https://doi.org/10.1145/2505515.2505555