Summary

Normally, when training a binary classifier, the input dataset will consist of two sets of examples each with a positive label and a negative label. However, often the available training data are an incomplete set of positive examples and a set of unlabeled examples (some might be positive and some might be negative) because sometimes, in reality, there are not enough resources to label all the data. This brought our attention to positive-unlabeled learning (PU learning), a type of semi-supervised learning, in which a binary classifier is learned from only positive and unlabeled data. 

In PU learning, there are two available sets for training, one positive set, and one unlabeled set, which is assumed to contain both positive and negative samples but without these samples being labeled as such. In "Learning classifiers from only positive and unlabeled data" by Elkan, Charles, and Keith Noto (http://www.eecs.tufts.edu/~noto/pub/kdd08/elkan.kdd08.pdf), they presented a way to learn a standard binary classifier with a nontraditional training dataset, a positive-unlabeled data set.

"Under the assumption that the labeled examples are selected randomly from the positive examples, we show that a classifier trained on positive and unlabeled examples predicts probabilities that only differ by only a constant factor from the true conditional probabilities of being positive." The project is heavily based on this statement from the research paper. We want to show that a positive-unlabeled adapted estimator will have a reasonable or even higher performance when compared to a regular estimator.

Comments