Semi-supervised multi-label topic models for document classification and sentence labeling

Hossein Soleimani, David J. Miller

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    11 Scopus citations

    Abstract

    Extracting parts of a text document relevant to a class label is a critical information retrieval task. We propose a semi-supervised multi-label topic model for jointly achieving document and sentence-level class inferences. Under our model, each sentence is associated with only a subset of the document's labels (including possibly none of them), with the label set of the document the union of the labels of all of its sentences. For training, we use both labeled documents, and, typically, a larger set of unlabeled documents. Our model, in a semisupervised fashion, discovers the topics present, learns associations between topics and class labels, predicts labels for new (or unlabeled) documents, and determines label associations for each sentence in every document. For learning, our model does not require any ground-truth labels on sentences. We develop a Hamil-tonian Monte Carlo based algorithm for efficiently sampling from the joint label distribution over all sentences, a very high-dimensional discrete space. Our experiments show that our approach outperforms several benchmark methods with respect to both document and sentence-level classification, as well as test set log-likelihood. All code for replicating our experiments is available from https://github.com/hsoleimani/MLTM.

    Original languageEnglish (US)
    Title of host publicationCIKM 2016 - Proceedings of the 2016 ACM Conference on Information and Knowledge Management
    PublisherAssociation for Computing Machinery
    Pages105-114
    Number of pages10
    ISBN (Electronic)9781450340731
    DOIs
    StatePublished - Oct 24 2016
    Event25th ACM International Conference on Information and Knowledge Management, CIKM 2016 - Indianapolis, United States
    Duration: Oct 24 2016Oct 28 2016

    Publication series

    NameInternational Conference on Information and Knowledge Management, Proceedings
    Volume24-28-October-2016

    Other

    Other25th ACM International Conference on Information and Knowledge Management, CIKM 2016
    CountryUnited States
    CityIndianapolis
    Period10/24/1610/28/16

    All Science Journal Classification (ASJC) codes

    • Business, Management and Accounting(all)
    • Decision Sciences(all)

    Fingerprint Dive into the research topics of 'Semi-supervised multi-label topic models for document classification and sentence labeling'. Together they form a unique fingerprint.

  • Cite this

    Soleimani, H., & Miller, D. J. (2016). Semi-supervised multi-label topic models for document classification and sentence labeling. In CIKM 2016 - Proceedings of the 2016 ACM Conference on Information and Knowledge Management (pp. 105-114). (International Conference on Information and Knowledge Management, Proceedings; Vol. 24-28-October-2016). Association for Computing Machinery. https://doi.org/10.1145/2983323.2983752