We’ll first demonstrate regression with noisy labels via the CleanLearning class that can wrap any scikit-learn compatible regression model you have. CleanLearning uses your model to estimate label issues (i.e. noisy y-values) and train a more robust version of the same model when the original data contains noisy labels. (View Highlight)
ere we define a CleanLearning object with a histogram-based gradient boosting model (sklearn version of XGBoost) and use the find_label_issues method to find potential errors in our dataset’s numeric label column. Any other sklearn-compatible regression model could be used, such as LinearRegression or RandomForestRegressor (or you can easily wrap arbitrary custom models to be compatible with the sklearn API). (View Highlight)
CleanLearning internally fits multiple copies of our regression model via cross-validation and bootstrapping in order to compute predictions and uncertainty estimates for the dataset. These are used to identify label issues (i.e. likely corrupted y-values).
This method returns a Dataframe containing a label quality score (between 0 and 1) for each example in your dataset. Lower scores indicate examples more likely to be mislabeled with an erroneous y value. The Dataframe also contains a boolean column specifying whether or not each example is identified to have a label issue (indicating its y-value appears potentially corrupted). (View Highlight)
Fixing the label issues manually may be time-consuming, but cleanlab can filter these noisy examples and train a model on the remaining clean data for you automatically. (View Highlight)
Now that we have a baseline, let’s check if using CleanLearning improves our test accuracy.
CleanLearning provides a wrapper that can be applied to any scikit-learn compatible model. The resulting model object can be used in the same manner, but it will now train more robustly if the data has noisy labels. (View Highlight)
The CleanLearning workflow above requires a sklearn-compatible model. If your model or data format is not compatible with the requirements for using CleanLearning, you can instead run cross-validation on your regression model to get out-of-sample predictions, and then use the Datalab audit to estimate label quality scores for each example in your dataset. (View Highlight)