Exploiting the value of class labels on high-dimensional feature spaces: topic models for semi-supervised document classification

Hossein Soleimani, David Jonathan Miller

Research output: Contribution to journalArticle


We propose a class-based mixture of topic models for classifying documents using both labeled and unlabeled examples (i.e., in a semi-supervised fashion). Most topic models incorporate documents’ class labels by generating them after generating the words. In these models, the training class labels have small effect on the estimated topics, as they are effectively treated as just another word, amongst a huge set of word features. In this paper, we propose to increase the influence of class labels on topic models by generating the words in each document conditioned on the class label. We show that our specific generative process improves classification performance with small loss in test set log-likelihood. Within our framework, we provide a principled mechanism to control the contributions of the class labels and the word space to the likelihood function. Experiments show our approach achieves better classification accuracy compared to some standard semi-supervised and supervised topic models.

Original languageEnglish (US)
Pages (from-to)299-309
Number of pages11
JournalPattern Analysis and Applications
Issue number2
Publication statusPublished - May 1 2019


All Science Journal Classification (ASJC) codes

  • Computer Vision and Pattern Recognition
  • Artificial Intelligence

Cite this