We propose a class-based mixture of topic models for classifying documents using both labeled and unlabeled examples (i.e., in a semi-supervised fashion). Most topic models incorporate documents’ class labels by generating them after generating the words. In these models, the training class labels have small effect on the estimated topics, as they are effectively treated as just another word, amongst a huge set of word features. In this paper, we propose to increase the influence of class labels on topic models by generating the words in each document conditioned on the class label. We show that our specific generative process improves classification performance with small loss in test set log-likelihood. Within our framework, we provide a principled mechanism to control the contributions of the class labels and the word space to the likelihood function. Experiments show our approach achieves better classification accuracy compared to some standard semi-supervised and supervised topic models.
All Science Journal Classification (ASJC) codes
- Computer Vision and Pattern Recognition
- Artificial Intelligence