Determining gains acquired from word embedding quantitatively using discrete distribution clustering

Jianbo Ye, Yanran Li, Zhaohui Wu, James Z. Wang, Wenjie Li, Jia Li

Research output: Chapter in Book/Report/Conference proceedingConference contribution

2 Citations (Scopus)

Abstract

Word embeddings have become widely-used in document analysis. While a large number of models for mapping words to vector spaces have been developed, it remains undetermined how much net gain can be achieved over traditional approaches based on bag-of-words. In this paper, we propose a new document clustering approach by combining any word embedding with a state-of-the-art algorithm for clustering empirical distributions. By using the Wasserstein distance between distributions, the word-to-word semantic relationship is taken into account in a principled way. The new clustering method is easy to use and consistently outperforms other methods on a variety of data sets. More importantly, the method provides an effective framework for determining when and how much word embeddings contribute to document analysis. Experimental results with multiple embedding models are reported.

Original languageEnglish (US)
Title of host publicationACL 2017 - 55th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference (Long Papers)
PublisherAssociation for Computational Linguistics (ACL)
Pages1847-1856
Number of pages10
ISBN (Electronic)9781945626753
DOIs
StatePublished - Jan 1 2017
Event55th Annual Meeting of the Association for Computational Linguistics, ACL 2017 - Vancouver, Canada
Duration: Jul 30 2017Aug 4 2017

Publication series

NameACL 2017 - 55th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference (Long Papers)
Volume1

Other

Other55th Annual Meeting of the Association for Computational Linguistics, ACL 2017
CountryCanada
CityVancouver
Period7/30/178/4/17

Fingerprint

document analysis
Vector spaces
Semantics
semantics

All Science Journal Classification (ASJC) codes

  • Language and Linguistics
  • Artificial Intelligence
  • Software
  • Linguistics and Language

Cite this

Ye, J., Li, Y., Wu, Z., Wang, J. Z., Li, W., & Li, J. (2017). Determining gains acquired from word embedding quantitatively using discrete distribution clustering. In ACL 2017 - 55th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference (Long Papers) (pp. 1847-1856). (ACL 2017 - 55th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference (Long Papers); Vol. 1). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/P17-1169
Ye, Jianbo ; Li, Yanran ; Wu, Zhaohui ; Wang, James Z. ; Li, Wenjie ; Li, Jia. / Determining gains acquired from word embedding quantitatively using discrete distribution clustering. ACL 2017 - 55th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference (Long Papers). Association for Computational Linguistics (ACL), 2017. pp. 1847-1856 (ACL 2017 - 55th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference (Long Papers)).
@inproceedings{66d8b321317643338741d0e5f66c5e2c,
title = "Determining gains acquired from word embedding quantitatively using discrete distribution clustering",
abstract = "Word embeddings have become widely-used in document analysis. While a large number of models for mapping words to vector spaces have been developed, it remains undetermined how much net gain can be achieved over traditional approaches based on bag-of-words. In this paper, we propose a new document clustering approach by combining any word embedding with a state-of-the-art algorithm for clustering empirical distributions. By using the Wasserstein distance between distributions, the word-to-word semantic relationship is taken into account in a principled way. The new clustering method is easy to use and consistently outperforms other methods on a variety of data sets. More importantly, the method provides an effective framework for determining when and how much word embeddings contribute to document analysis. Experimental results with multiple embedding models are reported.",
author = "Jianbo Ye and Yanran Li and Zhaohui Wu and Wang, {James Z.} and Wenjie Li and Jia Li",
year = "2017",
month = "1",
day = "1",
doi = "10.18653/v1/P17-1169",
language = "English (US)",
series = "ACL 2017 - 55th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference (Long Papers)",
publisher = "Association for Computational Linguistics (ACL)",
pages = "1847--1856",
booktitle = "ACL 2017 - 55th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference (Long Papers)",

}

Ye, J, Li, Y, Wu, Z, Wang, JZ, Li, W & Li, J 2017, Determining gains acquired from word embedding quantitatively using discrete distribution clustering. in ACL 2017 - 55th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference (Long Papers). ACL 2017 - 55th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference (Long Papers), vol. 1, Association for Computational Linguistics (ACL), pp. 1847-1856, 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, Vancouver, Canada, 7/30/17. https://doi.org/10.18653/v1/P17-1169

Determining gains acquired from word embedding quantitatively using discrete distribution clustering. / Ye, Jianbo; Li, Yanran; Wu, Zhaohui; Wang, James Z.; Li, Wenjie; Li, Jia.

ACL 2017 - 55th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference (Long Papers). Association for Computational Linguistics (ACL), 2017. p. 1847-1856 (ACL 2017 - 55th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference (Long Papers); Vol. 1).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

TY - GEN

T1 - Determining gains acquired from word embedding quantitatively using discrete distribution clustering

AU - Ye, Jianbo

AU - Li, Yanran

AU - Wu, Zhaohui

AU - Wang, James Z.

AU - Li, Wenjie

AU - Li, Jia

PY - 2017/1/1

Y1 - 2017/1/1

N2 - Word embeddings have become widely-used in document analysis. While a large number of models for mapping words to vector spaces have been developed, it remains undetermined how much net gain can be achieved over traditional approaches based on bag-of-words. In this paper, we propose a new document clustering approach by combining any word embedding with a state-of-the-art algorithm for clustering empirical distributions. By using the Wasserstein distance between distributions, the word-to-word semantic relationship is taken into account in a principled way. The new clustering method is easy to use and consistently outperforms other methods on a variety of data sets. More importantly, the method provides an effective framework for determining when and how much word embeddings contribute to document analysis. Experimental results with multiple embedding models are reported.

AB - Word embeddings have become widely-used in document analysis. While a large number of models for mapping words to vector spaces have been developed, it remains undetermined how much net gain can be achieved over traditional approaches based on bag-of-words. In this paper, we propose a new document clustering approach by combining any word embedding with a state-of-the-art algorithm for clustering empirical distributions. By using the Wasserstein distance between distributions, the word-to-word semantic relationship is taken into account in a principled way. The new clustering method is easy to use and consistently outperforms other methods on a variety of data sets. More importantly, the method provides an effective framework for determining when and how much word embeddings contribute to document analysis. Experimental results with multiple embedding models are reported.

UR - http://www.scopus.com/inward/record.url?scp=85040943836&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85040943836&partnerID=8YFLogxK

U2 - 10.18653/v1/P17-1169

DO - 10.18653/v1/P17-1169

M3 - Conference contribution

AN - SCOPUS:85040943836

T3 - ACL 2017 - 55th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference (Long Papers)

SP - 1847

EP - 1856

BT - ACL 2017 - 55th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference (Long Papers)

PB - Association for Computational Linguistics (ACL)

ER -

Ye J, Li Y, Wu Z, Wang JZ, Li W, Li J. Determining gains acquired from word embedding quantitatively using discrete distribution clustering. In ACL 2017 - 55th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference (Long Papers). Association for Computational Linguistics (ACL). 2017. p. 1847-1856. (ACL 2017 - 55th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference (Long Papers)). https://doi.org/10.18653/v1/P17-1169