Effects of creativity and cluster tightness on short text clustering performance

Catherine Finegan-Dollak, Reed Coke, Rui Zhang, Xiangyi Ye, Dragomir Radev

Research output: Chapter in Book/Report/Conference proceedingConference contribution

5 Scopus citations

Abstract

Properties of corpora, such as the diversity of vocabulary and how tightly related texts cluster together, impact the best way to cluster short texts. We examine several such properties in a variety of corpora and track their effects on various combinations of similarity metrics and clustering algorithms. We show that semantic similarity metrics outperform traditional n-gram and dependency similarity metrics for kmeans clustering of a linguistically creative dataset, but do not help with less creative texts. Yet the choice of similarity metric interacts with the choice of clustering method. We find that graphbased clustering methods perform well on tightly clustered data but poorly on loosely clustered data. Semantic similarity metrics generate loosely clustered output even when applied to a tightly clustered dataset. Thus, the best performing clustering systems could not use semantic metrics.

Original languageEnglish (US)
Title of host publication54th Annual Meeting of the Association for Computational Linguistics, ACL 2016 - Long Papers
PublisherAssociation for Computational Linguistics (ACL)
Pages654-665
Number of pages12
ISBN (Electronic)9781510827585
DOIs
StatePublished - 2016
Event54th Annual Meeting of the Association for Computational Linguistics, ACL 2016 - Berlin, Germany
Duration: Aug 7 2016Aug 12 2016

Publication series

Name54th Annual Meeting of the Association for Computational Linguistics, ACL 2016 - Long Papers
Volume2

Other

Other54th Annual Meeting of the Association for Computational Linguistics, ACL 2016
CountryGermany
CityBerlin
Period8/7/168/12/16

All Science Journal Classification (ASJC) codes

  • Language and Linguistics
  • Linguistics and Language

Fingerprint Dive into the research topics of 'Effects of creativity and cluster tightness on short text clustering performance'. Together they form a unique fingerprint.

Cite this