A mixture model framework for class discovery and outlier detection in mixed labeled/unlabeled data sets

David Jonathan Miller, John Browning

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    Abstract

    Several authors have addressed learning a classifier given a mixed labeled/unlabeled training set. These works assume each unlabeled sample originates from one of the (known) classes. Here, we consider the scenario in which unlabeled points may belong either to known/predeAned or to heretofore undiscovered classes. There are several practical situations where such data may arise. We earlier proposed a novel statistical mixture model to flt this mixed data. Here we review this method and also introduce an alternative model. Our fundamental strategy is to view as observed data not only the feature vector and the class label, but also the fact of label presence/ahsence for each point. Two types of mixture components are posited to explain label presence/absence. "Predefined" components generate both labeled and unlabeled points and assume labels are missing at random. These components represent the known classes. "Non-predeAned" components only generate unlabeled points-thus, in localized regions, they capture data subsets that are ezclusively unlabeled. Such subsets may represent an outlier distribution, or new classes. The components' predeflnedlnonpredefined natures are data-driven, learned along with the other parameters via an algorithm based on expectation-maximization (EM). There are three natural applications: 1) robust classifier design, given a mixed training set with outliers; 2) classiflcation with rejections; 3) identitication of the unlabeled points (and their representative components) that originate from unknown classes, i.e. new class discovery. The effectiveness of our models in discovering purely unlabeled data components (potential new classes) is evaluated both on synthetic and real data sets. Although each of our models has its own advantages, our original model is found to achieve the best class discovery results.

    Original languageEnglish (US)
    Title of host publication2003 IEEE 13th Workshop on Neural Networks for Signal Processing, NNSP 2003
    PublisherInstitute of Electrical and Electronics Engineers Inc.
    Pages489-498
    Number of pages10
    ISBN (Electronic)0780381777
    DOIs
    StatePublished - Jan 1 2003
    Event13th IEEE Workshop on Neural Networks for Signal Processing, NNSP 2003 - Toulouse, France
    Duration: Sep 17 2003Sep 19 2003

    Publication series

    NameNeural Networks for Signal Processing - Proceedings of the IEEE Workshop
    Volume2003-January

    Other

    Other13th IEEE Workshop on Neural Networks for Signal Processing, NNSP 2003
    CountryFrance
    CityToulouse
    Period9/17/039/19/03

    Fingerprint

    Labels
    Classifiers
    Data acquisition

    All Science Journal Classification (ASJC) codes

    • Electrical and Electronic Engineering
    • Artificial Intelligence
    • Software
    • Computer Networks and Communications
    • Signal Processing

    Cite this

    Miller, D. J., & Browning, J. (2003). A mixture model framework for class discovery and outlier detection in mixed labeled/unlabeled data sets. In 2003 IEEE 13th Workshop on Neural Networks for Signal Processing, NNSP 2003 (pp. 489-498). [1318048] (Neural Networks for Signal Processing - Proceedings of the IEEE Workshop; Vol. 2003-January). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/NNSP.2003.1318048
    Miller, David Jonathan ; Browning, John. / A mixture model framework for class discovery and outlier detection in mixed labeled/unlabeled data sets. 2003 IEEE 13th Workshop on Neural Networks for Signal Processing, NNSP 2003. Institute of Electrical and Electronics Engineers Inc., 2003. pp. 489-498 (Neural Networks for Signal Processing - Proceedings of the IEEE Workshop).
    @inproceedings{ec05ee846190433081d590c3878b1074,
    title = "A mixture model framework for class discovery and outlier detection in mixed labeled/unlabeled data sets",
    abstract = "Several authors have addressed learning a classifier given a mixed labeled/unlabeled training set. These works assume each unlabeled sample originates from one of the (known) classes. Here, we consider the scenario in which unlabeled points may belong either to known/predeAned or to heretofore undiscovered classes. There are several practical situations where such data may arise. We earlier proposed a novel statistical mixture model to flt this mixed data. Here we review this method and also introduce an alternative model. Our fundamental strategy is to view as observed data not only the feature vector and the class label, but also the fact of label presence/ahsence for each point. Two types of mixture components are posited to explain label presence/absence. {"}Predefined{"} components generate both labeled and unlabeled points and assume labels are missing at random. These components represent the known classes. {"}Non-predeAned{"} components only generate unlabeled points-thus, in localized regions, they capture data subsets that are ezclusively unlabeled. Such subsets may represent an outlier distribution, or new classes. The components' predeflnedlnonpredefined natures are data-driven, learned along with the other parameters via an algorithm based on expectation-maximization (EM). There are three natural applications: 1) robust classifier design, given a mixed training set with outliers; 2) classiflcation with rejections; 3) identitication of the unlabeled points (and their representative components) that originate from unknown classes, i.e. new class discovery. The effectiveness of our models in discovering purely unlabeled data components (potential new classes) is evaluated both on synthetic and real data sets. Although each of our models has its own advantages, our original model is found to achieve the best class discovery results.",
    author = "Miller, {David Jonathan} and John Browning",
    year = "2003",
    month = "1",
    day = "1",
    doi = "10.1109/NNSP.2003.1318048",
    language = "English (US)",
    series = "Neural Networks for Signal Processing - Proceedings of the IEEE Workshop",
    publisher = "Institute of Electrical and Electronics Engineers Inc.",
    pages = "489--498",
    booktitle = "2003 IEEE 13th Workshop on Neural Networks for Signal Processing, NNSP 2003",
    address = "United States",

    }

    Miller, DJ & Browning, J 2003, A mixture model framework for class discovery and outlier detection in mixed labeled/unlabeled data sets. in 2003 IEEE 13th Workshop on Neural Networks for Signal Processing, NNSP 2003., 1318048, Neural Networks for Signal Processing - Proceedings of the IEEE Workshop, vol. 2003-January, Institute of Electrical and Electronics Engineers Inc., pp. 489-498, 13th IEEE Workshop on Neural Networks for Signal Processing, NNSP 2003, Toulouse, France, 9/17/03. https://doi.org/10.1109/NNSP.2003.1318048

    A mixture model framework for class discovery and outlier detection in mixed labeled/unlabeled data sets. / Miller, David Jonathan; Browning, John.

    2003 IEEE 13th Workshop on Neural Networks for Signal Processing, NNSP 2003. Institute of Electrical and Electronics Engineers Inc., 2003. p. 489-498 1318048 (Neural Networks for Signal Processing - Proceedings of the IEEE Workshop; Vol. 2003-January).

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    TY - GEN

    T1 - A mixture model framework for class discovery and outlier detection in mixed labeled/unlabeled data sets

    AU - Miller, David Jonathan

    AU - Browning, John

    PY - 2003/1/1

    Y1 - 2003/1/1

    N2 - Several authors have addressed learning a classifier given a mixed labeled/unlabeled training set. These works assume each unlabeled sample originates from one of the (known) classes. Here, we consider the scenario in which unlabeled points may belong either to known/predeAned or to heretofore undiscovered classes. There are several practical situations where such data may arise. We earlier proposed a novel statistical mixture model to flt this mixed data. Here we review this method and also introduce an alternative model. Our fundamental strategy is to view as observed data not only the feature vector and the class label, but also the fact of label presence/ahsence for each point. Two types of mixture components are posited to explain label presence/absence. "Predefined" components generate both labeled and unlabeled points and assume labels are missing at random. These components represent the known classes. "Non-predeAned" components only generate unlabeled points-thus, in localized regions, they capture data subsets that are ezclusively unlabeled. Such subsets may represent an outlier distribution, or new classes. The components' predeflnedlnonpredefined natures are data-driven, learned along with the other parameters via an algorithm based on expectation-maximization (EM). There are three natural applications: 1) robust classifier design, given a mixed training set with outliers; 2) classiflcation with rejections; 3) identitication of the unlabeled points (and their representative components) that originate from unknown classes, i.e. new class discovery. The effectiveness of our models in discovering purely unlabeled data components (potential new classes) is evaluated both on synthetic and real data sets. Although each of our models has its own advantages, our original model is found to achieve the best class discovery results.

    AB - Several authors have addressed learning a classifier given a mixed labeled/unlabeled training set. These works assume each unlabeled sample originates from one of the (known) classes. Here, we consider the scenario in which unlabeled points may belong either to known/predeAned or to heretofore undiscovered classes. There are several practical situations where such data may arise. We earlier proposed a novel statistical mixture model to flt this mixed data. Here we review this method and also introduce an alternative model. Our fundamental strategy is to view as observed data not only the feature vector and the class label, but also the fact of label presence/ahsence for each point. Two types of mixture components are posited to explain label presence/absence. "Predefined" components generate both labeled and unlabeled points and assume labels are missing at random. These components represent the known classes. "Non-predeAned" components only generate unlabeled points-thus, in localized regions, they capture data subsets that are ezclusively unlabeled. Such subsets may represent an outlier distribution, or new classes. The components' predeflnedlnonpredefined natures are data-driven, learned along with the other parameters via an algorithm based on expectation-maximization (EM). There are three natural applications: 1) robust classifier design, given a mixed training set with outliers; 2) classiflcation with rejections; 3) identitication of the unlabeled points (and their representative components) that originate from unknown classes, i.e. new class discovery. The effectiveness of our models in discovering purely unlabeled data components (potential new classes) is evaluated both on synthetic and real data sets. Although each of our models has its own advantages, our original model is found to achieve the best class discovery results.

    UR - http://www.scopus.com/inward/record.url?scp=84945174346&partnerID=8YFLogxK

    UR - http://www.scopus.com/inward/citedby.url?scp=84945174346&partnerID=8YFLogxK

    U2 - 10.1109/NNSP.2003.1318048

    DO - 10.1109/NNSP.2003.1318048

    M3 - Conference contribution

    AN - SCOPUS:84945174346

    T3 - Neural Networks for Signal Processing - Proceedings of the IEEE Workshop

    SP - 489

    EP - 498

    BT - 2003 IEEE 13th Workshop on Neural Networks for Signal Processing, NNSP 2003

    PB - Institute of Electrical and Electronics Engineers Inc.

    ER -

    Miller DJ, Browning J. A mixture model framework for class discovery and outlier detection in mixed labeled/unlabeled data sets. In 2003 IEEE 13th Workshop on Neural Networks for Signal Processing, NNSP 2003. Institute of Electrical and Electronics Engineers Inc. 2003. p. 489-498. 1318048. (Neural Networks for Signal Processing - Proceedings of the IEEE Workshop). https://doi.org/10.1109/NNSP.2003.1318048