TY - GEN
T1 - On handling textual errors in latent document modeling
AU - Yang, Tao
AU - Lee, Dongwon
PY - 2013
Y1 - 2013
N2 - As large-scale text data become available on the Web, textual errors in a corpus are often inevitable (e.g., digitizing historic documents). Due to the calculation of frequencies of words, however, such textual errors can significantly impact the accuracy of statistical models such as the popular Latent Dirichlet Allocation (LDA) model. To address such an issue, in this paper, we propose two novel extensions to LDA (i.e., TE-LDA and TDE-LDA): (1) The TE-LDA model incorporates textual errors into term generation process; and (2) The TDE-LDA model extends TE-LDA further by taking into account topic dependency to leverage on semantic connections among consecutive words even if parts are typos. Using both real and synthetic data sets with varying degrees of "errors", our TDE-LDA model outperforms: (1) the traditional LDA model by 16%-39% (real) and 20%-63% (synthetic); and (2) the state-of-the-art N-Grams model by ll%-27% (real) and 16%-54% (synthetic).
AB - As large-scale text data become available on the Web, textual errors in a corpus are often inevitable (e.g., digitizing historic documents). Due to the calculation of frequencies of words, however, such textual errors can significantly impact the accuracy of statistical models such as the popular Latent Dirichlet Allocation (LDA) model. To address such an issue, in this paper, we propose two novel extensions to LDA (i.e., TE-LDA and TDE-LDA): (1) The TE-LDA model incorporates textual errors into term generation process; and (2) The TDE-LDA model extends TE-LDA further by taking into account topic dependency to leverage on semantic connections among consecutive words even if parts are typos. Using both real and synthetic data sets with varying degrees of "errors", our TDE-LDA model outperforms: (1) the traditional LDA model by 16%-39% (real) and 20%-63% (synthetic); and (2) the state-of-the-art N-Grams model by ll%-27% (real) and 16%-54% (synthetic).
UR - http://www.scopus.com/inward/record.url?scp=84889578031&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84889578031&partnerID=8YFLogxK
U2 - 10.1145/2505515.2505555
DO - 10.1145/2505515.2505555
M3 - Conference contribution
AN - SCOPUS:84889578031
SN - 9781450322638
T3 - International Conference on Information and Knowledge Management, Proceedings
SP - 2089
EP - 2098
BT - CIKM 2013 - Proceedings of the 22nd ACM International Conference on Information and Knowledge Management
T2 - 22nd ACM International Conference on Information and Knowledge Management, CIKM 2013
Y2 - 27 October 2013 through 1 November 2013
ER -