Learning parameters of the K-means algorithm from subjective human annotation

Haimonti Dutta, Rebecca J. Passonneau, Austin Lee, Axinia Radeva, Boyi Xie, David Waltz, Barbara Taranto

Research output: Chapter in Book/Report/Conference proceedingConference contribution

3 Scopus citations

Abstract

The New York Public Library is participating in the Chronicling America initiative to develop an online searchable database of historically significant newspaper articles. Microfilm copies of the papers are scanned and high resolution OCR software is run on them. The text from the OCR provides a wealth of data and opinion for researchers and historians. However, the categorization of articles provided by the OCR engine is rudimentary and a large number of the articles are labeled "editorial" without further categorization. To provide a more refined grouping of articles, unsupervised machine learning algorithms (such as K-Means) are being investigated. The K-Means algorithm requires tuning of parameters such as the number of clusters and mechanism of seeding to ensure that the search is not prone to being caught in a local minima. We designed a pilot study to observe whether humans are adept at finding sub-categories. The subjective labels provided by humans are used as a guide to compare performance of the automated clustering techniques. In addition, seeds provided by annotators are carefully incorporated into a semi-supervised K-Means algorithm (Seeded K-Means); empirical results indicate that this helps to improve performance and provides an intuitive sub-categorization of the articles labeled "editorial" by the OCR engine.

Original languageEnglish (US)
Title of host publicationProceedings of the 24th International Florida Artificial Intelligence Research Society, FLAIRS - 24
Pages465-470
Number of pages6
StatePublished - Sep 9 2011
Event24th International Florida Artificial Intelligence Research Society, FLAIRS - 24 - Palm Beach, FL, United States
Duration: May 18 2011May 20 2011

Publication series

NameProceedings of the 24th International Florida Artificial Intelligence Research Society, FLAIRS - 24

Other

Other24th International Florida Artificial Intelligence Research Society, FLAIRS - 24
CountryUnited States
CityPalm Beach, FL
Period5/18/115/20/11

All Science Journal Classification (ASJC) codes

  • Artificial Intelligence

Fingerprint Dive into the research topics of 'Learning parameters of the K-means algorithm from subjective human annotation'. Together they form a unique fingerprint.

Cite this