A mixture model and EM algorithm for robust classification, outlier rejection, and class discovery

David Jonathan Miller, John Browning

    Research output: Contribution to journalArticle

    1 Citation (Scopus)

    Abstract

    Several authors have addressed learning a classifier given a mixed labeled/unlabeled training set. These works assume each unlabeled sample originates from one of the (known) classes. Here, we consider the scenario in which unlabeled points may belong either to known/predefined or to heretofore undiscovered classes. There are several practical situations where such data may arise. We propose a novel statistical mixture model which views as observed data not only the feature vector and the class label, but also the fact of label presence/absence for each point. Two types of mixture components are posited to explain label presence/absence. "Predefined" components generate both labeled and unlabeled points and assume labels are missing at random. "Nonpredefined" components only generate unlabeled points - thus, in localized regions, they capture data subsets that are exclusively unlabeled. Such subsets may represent an outlier distribution, or new classes. The components' predefined/non-predefmed natures are data-driven, learned along with the other parameters via an algorithm based on expectation-maximization (EM). There are three natural applications: 1) robust classifier design, given a mixed training set with outliers; 2) classification with rejections; 3) identification of the unlabeled points (and their representative components) that originate from unknown classes, i.e. new class discovery. We evaluate our method and alternative approaches on both synthetic and real-world data sets.

    Original languageEnglish (US)
    Pages (from-to)809-812
    Number of pages4
    JournalProceedings - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing
    Volume2
    StatePublished - 2003

    Fingerprint

    Labels
    Classifiers
    Data acquisition

    All Science Journal Classification (ASJC) codes

    • Software
    • Signal Processing
    • Electrical and Electronic Engineering

    Cite this

    @article{965ded112a5c4bdfb5c90452edafd827,
    title = "A mixture model and EM algorithm for robust classification, outlier rejection, and class discovery",
    abstract = "Several authors have addressed learning a classifier given a mixed labeled/unlabeled training set. These works assume each unlabeled sample originates from one of the (known) classes. Here, we consider the scenario in which unlabeled points may belong either to known/predefined or to heretofore undiscovered classes. There are several practical situations where such data may arise. We propose a novel statistical mixture model which views as observed data not only the feature vector and the class label, but also the fact of label presence/absence for each point. Two types of mixture components are posited to explain label presence/absence. {"}Predefined{"} components generate both labeled and unlabeled points and assume labels are missing at random. {"}Nonpredefined{"} components only generate unlabeled points - thus, in localized regions, they capture data subsets that are exclusively unlabeled. Such subsets may represent an outlier distribution, or new classes. The components' predefined/non-predefmed natures are data-driven, learned along with the other parameters via an algorithm based on expectation-maximization (EM). There are three natural applications: 1) robust classifier design, given a mixed training set with outliers; 2) classification with rejections; 3) identification of the unlabeled points (and their representative components) that originate from unknown classes, i.e. new class discovery. We evaluate our method and alternative approaches on both synthetic and real-world data sets.",
    author = "Miller, {David Jonathan} and John Browning",
    year = "2003",
    language = "English (US)",
    volume = "2",
    pages = "809--812",
    journal = "Proceedings - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing",
    issn = "0736-7791",
    publisher = "Institute of Electrical and Electronics Engineers Inc.",

    }

    TY - JOUR

    T1 - A mixture model and EM algorithm for robust classification, outlier rejection, and class discovery

    AU - Miller, David Jonathan

    AU - Browning, John

    PY - 2003

    Y1 - 2003

    N2 - Several authors have addressed learning a classifier given a mixed labeled/unlabeled training set. These works assume each unlabeled sample originates from one of the (known) classes. Here, we consider the scenario in which unlabeled points may belong either to known/predefined or to heretofore undiscovered classes. There are several practical situations where such data may arise. We propose a novel statistical mixture model which views as observed data not only the feature vector and the class label, but also the fact of label presence/absence for each point. Two types of mixture components are posited to explain label presence/absence. "Predefined" components generate both labeled and unlabeled points and assume labels are missing at random. "Nonpredefined" components only generate unlabeled points - thus, in localized regions, they capture data subsets that are exclusively unlabeled. Such subsets may represent an outlier distribution, or new classes. The components' predefined/non-predefmed natures are data-driven, learned along with the other parameters via an algorithm based on expectation-maximization (EM). There are three natural applications: 1) robust classifier design, given a mixed training set with outliers; 2) classification with rejections; 3) identification of the unlabeled points (and their representative components) that originate from unknown classes, i.e. new class discovery. We evaluate our method and alternative approaches on both synthetic and real-world data sets.

    AB - Several authors have addressed learning a classifier given a mixed labeled/unlabeled training set. These works assume each unlabeled sample originates from one of the (known) classes. Here, we consider the scenario in which unlabeled points may belong either to known/predefined or to heretofore undiscovered classes. There are several practical situations where such data may arise. We propose a novel statistical mixture model which views as observed data not only the feature vector and the class label, but also the fact of label presence/absence for each point. Two types of mixture components are posited to explain label presence/absence. "Predefined" components generate both labeled and unlabeled points and assume labels are missing at random. "Nonpredefined" components only generate unlabeled points - thus, in localized regions, they capture data subsets that are exclusively unlabeled. Such subsets may represent an outlier distribution, or new classes. The components' predefined/non-predefmed natures are data-driven, learned along with the other parameters via an algorithm based on expectation-maximization (EM). There are three natural applications: 1) robust classifier design, given a mixed training set with outliers; 2) classification with rejections; 3) identification of the unlabeled points (and their representative components) that originate from unknown classes, i.e. new class discovery. We evaluate our method and alternative approaches on both synthetic and real-world data sets.

    UR - http://www.scopus.com/inward/record.url?scp=0141788478&partnerID=8YFLogxK

    UR - http://www.scopus.com/inward/citedby.url?scp=0141788478&partnerID=8YFLogxK

    M3 - Article

    VL - 2

    SP - 809

    EP - 812

    JO - Proceedings - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing

    JF - Proceedings - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing

    SN - 0736-7791

    ER -