In numerous industries such as manufacturing, health care or energy production, current sensor technology can generate enormous quantities of measurements of an object at low cost. Each measurement consists of several instances of interrelated variables, and the goal is to use the data to build a computer model that permits one to predict the class of an object (such as the health condition of a patient or the quality of a manufactured part). Along with the sensor data, the class labels for some objects are needed to train the computer model. While the sensor variables can frequently be obtained rapidly and inexpensively (e.g., medical images or chemical analyses) the class label associated with each object might require human effort that is time-consuming and expensive. Therefore, care should be taken to select the objects to label that are most informative for building the predictive computer model. Often one selects objects iteratively, where the class labels from the previously selected batch guides the next batch of objects to label. This is the purpose of a so-called active learning strategy. The purpose of this research is to find new active learning methods that accelerate model building and provide better predictions in systems where large datasets of attribute measurements are available. This will result in more efficient and productive systems that will benefit the U.S. economy and society.
Existing active learning methods are often based on strong assumptions for the joint input/output distribution or use a distance-based approach. These methods are susceptible to noise in the input space, assume numerical inputs only, and often work poorly in high dimensions. In applications, data sets are often large, noisy, contain missing values and mixed variable types. In this research, a non-parametric approach to the active learning problem is planned to address these challenges. The algorithm is based on a batch diversification strategy applied to an ensemble of decision trees. A novel active learning strategy that considers the geometric structure of the manifold where the unlabeled data resides will also be considered. The geometric properties of the data space may result in more informative active learning solutions. This is a collaborative effort between Arizona State University, Pennsylvania State University, and Intel Corporation with complementary expertise in machine learning and optimal design. The participation of Intel will help ensure the successful dissemination and broad applicability of the results.
|Effective start/end date||9/1/15 → 8/31/19|
- National Science Foundation: $175,000.00