Once a classifier is trained, the output of the predict method outputs class label predictions corresponding to a thresholding of either the decision_function or the predict_proba output. For a binary classifier, the default threshold is defined as a posterior probability estimate of 0.5 or a decision score of 0.0. (View Highlight)
However, this default strategy is most likely not optimal for the task at hand. Here, we use the “Statlog” German credit dataset [1] to illustrate a use case. In this dataset, the task is to predict whether a person has a “good” or “bad” credit. In addition, a cost-matrix is provided that specifies the cost of misclassification. Specifically, misclassifying a “bad” credit as “good” is five times more costly on average than misclassifying a “good” credit as “bad”. (View Highlight)
In this first section, we illustrate the use of the TunedThresholdClassifierCV in a setting of cost-sensitive learning when the gains and costs associated to each entry of the confusion matrix are constant. We use the problematic presented in [2] using the “Statlog” German credit dataset [1]. (View Highlight)
Another observation is that the dataset is imbalanced. We would need to be careful when evaluating our predictive model and use a family of metrics that are adapted to this setting. (View Highlight)
In this section, we define a set of metrics that we use later. To see the effect of tuning the cut-off point, we evaluate the predictive model using the Receiver Operating Characteristic (ROC) curve and the Precision-Recall curve. The values reported on these plots are therefore the true positive rate (TPR), also known as the recall or the sensitivity, and the false positive rate (FPR), also known as the specificity, for the ROC curve and the precision and recall for the Precision-Recall curve. (View Highlight)
As previously stated, the “positive label” is not defined as the value “1” and calling some of the metrics with this non-standard value raise an error. We need to provide the indication of the “positive label” to the metrics. (View Highlight)
In addition, the original research [1] defines a custom business metric. We call a “business metric” any metric function that aims at quantifying how the predictions (correct or wrong) might impact the business value of deploying a given machine learning model in a specific application context. For our credit prediction task, the authors provide a custom cost-matrix which encodes that classifying a a “bad” credit as “good” is 5 times more costly on average than the opposite: it is less costly for the financing institution to not grant a credit to a potential customer that will not default (and therefore miss a good customer that would have otherwise both reimbursed the credit and payed interests) than to grant a credit to a customer that will default. (View Highlight)
We recall that these curves give insights on the statistical performance of the predictive model for different cut-off points. For the Precision-Recall curve, the reported metrics are the precision and recall and for the ROC curve, the reported metrics are the TPR (same as recall) and FPR. (View Highlight)
Here, the different cut-off points correspond to different levels of posterior probability estimates ranging between 0 and 1. By default, model.predict uses a cut-off point at a probability estimate of 0.5. The metrics for such a cut-off point are reported with the blue dot on the curves: it corresponds to the statistical performance of the model when using model.predict. (View Highlight)
At this stage we don’t know if any other cut-off can lead to a greater gain. To find the optimal one, we need to compute the cost-gain using the business metric for all possible cut-off points and choose the best. This strategy can be quite tedious to implement by hand, but the TunedThresholdClassifierCV class is here to help us. It automatically computes the cost-gain for all possible cut-off points and optimizes for the scoring. (View Highlight)
We use TunedThresholdClassifierCV to tune the cut-off point. We need to provide the business metric to optimize as well as the positive label. Internally, the optimum cut-off point is chosen such that it maximizes the business metric via cross-validation. By default a 5-fold stratified cross-validation is used. (View Highlight)
We plot the ROC and Precision-Recall curves for the vanilla model and the tuned model. Also we plot the cut-off points that would be used by each model. Because, we are reusing the same code later, we define a function that generates the plots. (View Highlight)
The first remark is that both classifiers have exactly the same ROC and Precision-Recall curves. It is expected because by default, the classifier is fitted on the same training data. In a later section, we discuss more in detail the available options regarding model refitting and cross-validation. (View Highlight)
The second remark is that the cut-off points of the vanilla and tuned model are different. To understand why the tuned model has chosen this cut-off point, we can look at the right-hand side plot that plots the objective score that is our exactly the same as our business metric. We see that the optimum threshold corresponds to the maximum of the objective score. This maximum is reached for a decision threshold much lower than 0.5: the tuned model enjoys a much higher recall at the cost of of significantly lower precision: the tuned model is much more eager to predict the “bad” class label to larger fraction of individuals. (View Highlight)
In the above experiment, we used the default setting of the TunedThresholdClassifierCV. In particular, the cut-off point is tuned using a 5-fold stratified cross-validation. Also, the underlying predictive model is refitted on the entire training data once the cut-off point is chosen. (View Highlight)
We observe the that the optimum cut-off point is different from the one found in the previous experiment. If we look at the right-hand side plot, we observe that the business gain has large plateau of near-optimal 0 gain for a large span of decision thresholds. This behavior is symptomatic of an overfitting. Because we disable cross-validation, we tuned the cut-off point on the same set as the model was trained on, and this is the reason for the observed overfitting. (View Highlight)
This option should therefore be used with caution. One needs to make sure that the data provided at fitting time to the TunedThresholdClassifierCV is not the same as the data used to train the underlying classifier. This could happen sometimes when the idea is just to tune the predictive model on a completely new validation set without a costly complete refit. (View Highlight)
As stated in [2], gains and costs are generally not constant in real-world problems. In this section, we use a similar example as in [2] for the problem of detecting fraud in credit card transaction records. (View Highlight)
The dataset contains information about credit card records from which some are fraudulent and others are legitimate. The goal is therefore to predict whether or not a credit card record is fraudulent. (View Highlight)
The dataset is highly imbalanced with fraudulent transaction representing only 0.17% of the data. Since we are interested in training a machine learning model, we should also make sure that we have enough samples in the minority class to train the model. (View Highlight)
We observe that we have around 500 samples that is on the low end of the number of samples required to train a machine learning model. In addition of the target distribution, we check the distribution of the amount of the fraudulent transactions. (View Highlight)
Now, we create the business metric that depends on the amount of each transaction. We define the cost matrix similarly to [2]. Accepting a legitimate transaction provides a gain of 2% of the amount of the transaction. However, accepting a fraudulent transaction result in a loss of the amount of the transaction. As stated in [2], the gain and loss related to refusals (of fraudulent and legitimate transactions) are not trivial to define. Here, we define that a refusal of a legitimate transaction is estimated to a loss of 5€ while the refusal of a fraudulent transaction is estimated to a gain of 50€ and the amount of the transaction. Therefore, we define the following function to compute the total benefit of a given decision: (View Highlight)
From this business metric, we create a scikit-learn scorer that given a fitted classifier and a test set compute the business metric. In this regard, we use the make_scorer factory. The variable amount is an additional metadata to be passed to the scorer and we need to use metadata routing to take into account this information. (View Highlight)
So at this stage, we observe that the amount of the transaction is used twice: once as a feature to train our predictive model and once as a metadata to compute the the business metric and thus the statistical performance of our model. When used as a feature, we are only required to have a column in data that contains the amount of each transaction. To use this information as metadata, we need to have an external variable that we can pass to the scorer or the model that internally routes this metadata to the scorer. So let’s create this variable. (View Highlight)
This is not a surprise that the balanced accuracy is at 0.5 for both classifiers. However, we need to be careful in the rest of the evaluation: we potentially can obtain a model with a decent balanced accuracy that does not make any profit. In this case, the model would be harmful for our business. (View Highlight)
By observing the balanced accuracy, we see that our predictive model is learning some associations between the features and the target. The business metric also shows that our model is beating the baseline in terms of profit and it would be already beneficial to use it instead of ignoring the fraud detection problem. (View Highlight)
Now the question is: is our model optimum for the type of decision that we want to do? Up to now, we did not optimize the decision threshold. We use the TunedThresholdClassifierCV to optimize the decision given our business scorer. To avoid a nested cross-validation, we will use the best estimator found during the previous grid-search. (View Highlight)
Since our business scorer requires the amount of each transaction, we need to pass this information in the fit method. The TunedThresholdClassifierCV is in charge of automatically dispatching this metadata to the underlying scorer. (View Highlight)
We observe that tuning the decision threshold increases the expected profit of deploying our model as estimated by the business metric. Eventually, the balanced accuracy also increased. Note that it might not always be the case because the statistical metric is not necessarily a surrogate of the business metric. It is therefore important, whenever possible, optimize the decision threshold with respect to the business metric. (View Highlight)
Finally, the estimate of the business metric itself can be unreliable, in particular when the number of data points in the minority class is so small. Any business impact estimated by cross-validation of a business metric on historical data (offline evaluation) should ideally be confirmed by A/B testing on live data (online evaluation). Note however that A/B testing models is beyond the scope of the scikit-learn library itself. (View Highlight)
In the previous example, we used the TunedThresholdClassifierCV to find the optimal decision threshold. However, in some cases, we might have some prior knowledge about the problem at hand and we might be happy to set the decision threshold manually. (View Highlight)
The class FixedThresholdClassifier allows us to manually set the decision threshold. At prediction time, it behave as the previous tuned model but no search is performed during the fitting process. (View Highlight)
We observe that we obtained the exact same results but the fitting process was much faster since we did not perform any search. (View Highlight)