Procedure

We are using an open source program developed by Alexandre Drouin(https://github.com/aldro61/pu-learning) to verify that a PU adapted classifier will achieve better accuracy in the case where the "negative" examples are contaminated with a number of positive examples.

The puAdapter.py adapts any probabilistic binary classifier to positive-unlabeled learning using the methods proposed by Elkan and Noto in their research paper.  Let x be an example and let y ∈ {0, 1} be a binary label. Let s = 1 if the example x is labeled, and let s = 0 if x is unlabeled. Only positive examples are labeled, so y = 1 is certain when s = 1, but when s = 0, then either y = 1 or y = 0 may be true. The puAdapter will fit an estimator of p(s=1 | x) and estimates the value of p(s=1 | y=1 ,x) and predicts p(y=1 | x) using the estimator and the value of p(s=1 | y=1) estimated in fit.

Alexandre Drouin also created a test using a breast cancer dataset from the UCI Machine Learning Repository (http://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+%28original%29), where it consists of 10 different attributes (clump thickness, cell size, cell shape, etc) and 2 classes (benign or malignant). He took a few malignant examples and assigned them the benign label and considered benign examples as unlabeled. He gradually increased the contamination ratio (more and more positive samples in the unlabeled set) to see how it affects our testing classifiers. Then, he compared the performance of the estimator while using the puAdapter and without using the puAdapter. He assessed the performance using the F1 score, precision, and recall.

We will use the code and test provided by Alexandre Drouin to analyze the performances of different types of estimators(SVMs, Random Forest, Decision Trees, etc) while using the puAdapter and without using the puAdapter.

Comments