Topic discovery for short texts using word embeddings

Guangxu Xun, Vishrawas Gopalakrishnan, Fenglong Ma, Jing Gao, Aidong Zhang

Research output: Chapter in Book/Report/Conference proceedingConference contribution

16 Citations (Scopus)

Abstract

Discovering topics in short texts, such as news titles and tweets, has become an important task for many content analysis applications. However, due to the lack of rich context information in short texts, the performance of conventional topic models on short texts is usually unsatisfying. In this paper, we propose a novel topic model for short text corpus using word embeddings. Continuous space word embeddings, which is proven effective at capturing regularities in language, is incorporated into our model to provide additional semantics. Thus we model each short document as a Gaussian topic over word embeddings in the vector space. In addition, considering that background words in a short text are usually not semantically related, we introduce a discrete background mode over word types to complement the continuous Gaussian topics. We evaluate our model on news titles from data sources like abcnews, showing that our model is able to extract more coherent topics from short texts compared with the baseline methods and learn better topic representation for each short document.

Original languageEnglish (US)
Title of host publicationProceedings - 16th IEEE International Conference on Data Mining, ICDM 2016
EditorsFrancesco Bonchi, Xindong Wu, Ricardo Baeza-Yates, Josep Domingo-Ferrer, Zhi-Hua Zhou
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages1299-1304
Number of pages6
ISBN (Electronic)9781509054725
DOIs
StatePublished - Jan 31 2017
Event16th IEEE International Conference on Data Mining, ICDM 2016 - Barcelona, Catalonia, Spain
Duration: Dec 12 2016Dec 15 2016

Publication series

NameProceedings - IEEE International Conference on Data Mining, ICDM
ISSN (Print)1550-4786

Other

Other16th IEEE International Conference on Data Mining, ICDM 2016
CountrySpain
CityBarcelona, Catalonia
Period12/12/1612/15/16

Fingerprint

Vector spaces
Semantics

All Science Journal Classification (ASJC) codes

  • Engineering(all)

Cite this

Xun, G., Gopalakrishnan, V., Ma, F., Gao, J., & Zhang, A. (2017). Topic discovery for short texts using word embeddings. In F. Bonchi, X. Wu, R. Baeza-Yates, J. Domingo-Ferrer, & Z-H. Zhou (Eds.), Proceedings - 16th IEEE International Conference on Data Mining, ICDM 2016 (pp. 1299-1304). [7837989] (Proceedings - IEEE International Conference on Data Mining, ICDM). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/ICDM.2016.33
Xun, Guangxu ; Gopalakrishnan, Vishrawas ; Ma, Fenglong ; Gao, Jing ; Zhang, Aidong. / Topic discovery for short texts using word embeddings. Proceedings - 16th IEEE International Conference on Data Mining, ICDM 2016. editor / Francesco Bonchi ; Xindong Wu ; Ricardo Baeza-Yates ; Josep Domingo-Ferrer ; Zhi-Hua Zhou. Institute of Electrical and Electronics Engineers Inc., 2017. pp. 1299-1304 (Proceedings - IEEE International Conference on Data Mining, ICDM).
@inproceedings{7a61a6e007a340b6b6b903844de7a2b5,
title = "Topic discovery for short texts using word embeddings",
abstract = "Discovering topics in short texts, such as news titles and tweets, has become an important task for many content analysis applications. However, due to the lack of rich context information in short texts, the performance of conventional topic models on short texts is usually unsatisfying. In this paper, we propose a novel topic model for short text corpus using word embeddings. Continuous space word embeddings, which is proven effective at capturing regularities in language, is incorporated into our model to provide additional semantics. Thus we model each short document as a Gaussian topic over word embeddings in the vector space. In addition, considering that background words in a short text are usually not semantically related, we introduce a discrete background mode over word types to complement the continuous Gaussian topics. We evaluate our model on news titles from data sources like abcnews, showing that our model is able to extract more coherent topics from short texts compared with the baseline methods and learn better topic representation for each short document.",
author = "Guangxu Xun and Vishrawas Gopalakrishnan and Fenglong Ma and Jing Gao and Aidong Zhang",
year = "2017",
month = "1",
day = "31",
doi = "10.1109/ICDM.2016.33",
language = "English (US)",
series = "Proceedings - IEEE International Conference on Data Mining, ICDM",
publisher = "Institute of Electrical and Electronics Engineers Inc.",
pages = "1299--1304",
editor = "Francesco Bonchi and Xindong Wu and Ricardo Baeza-Yates and Josep Domingo-Ferrer and Zhi-Hua Zhou",
booktitle = "Proceedings - 16th IEEE International Conference on Data Mining, ICDM 2016",
address = "United States",

}

Xun, G, Gopalakrishnan, V, Ma, F, Gao, J & Zhang, A 2017, Topic discovery for short texts using word embeddings. in F Bonchi, X Wu, R Baeza-Yates, J Domingo-Ferrer & Z-H Zhou (eds), Proceedings - 16th IEEE International Conference on Data Mining, ICDM 2016., 7837989, Proceedings - IEEE International Conference on Data Mining, ICDM, Institute of Electrical and Electronics Engineers Inc., pp. 1299-1304, 16th IEEE International Conference on Data Mining, ICDM 2016, Barcelona, Catalonia, Spain, 12/12/16. https://doi.org/10.1109/ICDM.2016.33

Topic discovery for short texts using word embeddings. / Xun, Guangxu; Gopalakrishnan, Vishrawas; Ma, Fenglong; Gao, Jing; Zhang, Aidong.

Proceedings - 16th IEEE International Conference on Data Mining, ICDM 2016. ed. / Francesco Bonchi; Xindong Wu; Ricardo Baeza-Yates; Josep Domingo-Ferrer; Zhi-Hua Zhou. Institute of Electrical and Electronics Engineers Inc., 2017. p. 1299-1304 7837989 (Proceedings - IEEE International Conference on Data Mining, ICDM).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

TY - GEN

T1 - Topic discovery for short texts using word embeddings

AU - Xun, Guangxu

AU - Gopalakrishnan, Vishrawas

AU - Ma, Fenglong

AU - Gao, Jing

AU - Zhang, Aidong

PY - 2017/1/31

Y1 - 2017/1/31

N2 - Discovering topics in short texts, such as news titles and tweets, has become an important task for many content analysis applications. However, due to the lack of rich context information in short texts, the performance of conventional topic models on short texts is usually unsatisfying. In this paper, we propose a novel topic model for short text corpus using word embeddings. Continuous space word embeddings, which is proven effective at capturing regularities in language, is incorporated into our model to provide additional semantics. Thus we model each short document as a Gaussian topic over word embeddings in the vector space. In addition, considering that background words in a short text are usually not semantically related, we introduce a discrete background mode over word types to complement the continuous Gaussian topics. We evaluate our model on news titles from data sources like abcnews, showing that our model is able to extract more coherent topics from short texts compared with the baseline methods and learn better topic representation for each short document.

AB - Discovering topics in short texts, such as news titles and tweets, has become an important task for many content analysis applications. However, due to the lack of rich context information in short texts, the performance of conventional topic models on short texts is usually unsatisfying. In this paper, we propose a novel topic model for short text corpus using word embeddings. Continuous space word embeddings, which is proven effective at capturing regularities in language, is incorporated into our model to provide additional semantics. Thus we model each short document as a Gaussian topic over word embeddings in the vector space. In addition, considering that background words in a short text are usually not semantically related, we introduce a discrete background mode over word types to complement the continuous Gaussian topics. We evaluate our model on news titles from data sources like abcnews, showing that our model is able to extract more coherent topics from short texts compared with the baseline methods and learn better topic representation for each short document.

UR - http://www.scopus.com/inward/record.url?scp=85014568942&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85014568942&partnerID=8YFLogxK

U2 - 10.1109/ICDM.2016.33

DO - 10.1109/ICDM.2016.33

M3 - Conference contribution

AN - SCOPUS:85014568942

T3 - Proceedings - IEEE International Conference on Data Mining, ICDM

SP - 1299

EP - 1304

BT - Proceedings - 16th IEEE International Conference on Data Mining, ICDM 2016

A2 - Bonchi, Francesco

A2 - Wu, Xindong

A2 - Baeza-Yates, Ricardo

A2 - Domingo-Ferrer, Josep

A2 - Zhou, Zhi-Hua

PB - Institute of Electrical and Electronics Engineers Inc.

ER -

Xun G, Gopalakrishnan V, Ma F, Gao J, Zhang A. Topic discovery for short texts using word embeddings. In Bonchi F, Wu X, Baeza-Yates R, Domingo-Ferrer J, Zhou Z-H, editors, Proceedings - 16th IEEE International Conference on Data Mining, ICDM 2016. Institute of Electrical and Electronics Engineers Inc. 2017. p. 1299-1304. 7837989. (Proceedings - IEEE International Conference on Data Mining, ICDM). https://doi.org/10.1109/ICDM.2016.33