Friday, June 14, 2024

Tune In: Resolution Threshold Optimization with scikit-learn’s TunedThresholdClassifierCV | by Kevin Arvai | Could, 2024

Must read

Use instances and code to discover the brand new class that helps tune choice thresholds in scikit-learn

Towards Data Science

The 1.5 launch of scikit-learn features a new class, TunedThresholdClassifierCV, making optimizing choice thresholds from scikit-learn classifiers simpler. A choice threshold is a cut-off level that converts predicted chances output by a machine studying mannequin into discrete lessons. The default choice threshold of the .predict() methodology from scikit-learn classifiers in a binary classification setting is 0.5. Though it is a smart default, it’s hardly ever your best option for classification duties.

This submit introduces the TunedThresholdClassifierCV class and demonstrates the way it can optimize choice thresholds for numerous binary classification duties. This new class will assist bridge the hole between information scientists who construct fashions and enterprise stakeholders who make selections primarily based on the mannequin’s output. By fine-tuning the choice thresholds, information scientists can improve mannequin efficiency and higher align with enterprise aims.

This submit will cowl the next conditions the place tuning choice thresholds is helpful:

  1. Maximizing a metric: Use this when selecting a threshold that maximizes a scoring metric, just like the F1 rating.
  2. Price-sensitive studying: Regulate the edge when the price of misclassifying a false constructive is just not equal to the price of misclassifying a false adverse, and you’ve got an estimate of the prices.
  3. Tuning beneath constraints: Optimize the working level on the ROC or precision-recall curve to satisfy particular efficiency constraints.

The code used on this submit and hyperlinks to datasets can be found on GitHub.

Let’s get began! First, import the required libraries, learn the info, and cut up coaching and take a look at information.

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.compose import make_column_selector as selector
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (
from sklearn.model_selection import TunedThresholdClassifierCV, train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler


Maximizing a metric

Earlier than beginning the model-building course of in any machine studying venture, it’s essential to work with stakeholders to find out which metric(s) to optimize. Making this choice early ensures that the venture aligns with its supposed targets.

Utilizing an accuracy metric in fraud detection use instances to guage mannequin efficiency is just not perfect as a result of the info is usually imbalanced, with most transactions being non-fraudulent. The F1 rating is the harmonic imply of precision and recall and is a greater metric for imbalanced datasets like fraud detection. Let’s use the TunedThresholdClassifierCV class to optimize the choice threshold of a logistic regression mannequin to maximise the F1 rating.

We’ll use the Kaggle Credit score Card Fraud Detection dataset to introduce the primary scenario the place we have to tune a choice threshold. First, cut up the info into prepare and take a look at units, then create a scikit-learn pipeline to scale the info and prepare a logistic regression mannequin. Match the pipeline on the coaching information so we are able to examine the unique mannequin efficiency with the tuned mannequin efficiency.

creditcard = pd.read_csv("information/creditcard.csv")
y = creditcard["Class"]
X = creditcard.drop(columns=["Class"])

X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=RANDOM_STATE, stratify=y

# Solely Time and Quantity must be scaled
original_fraud_model = make_pipeline(
[("scaler", StandardScaler(), ["Time", "Amount"])],
the rest="passthrough",
original_fraud_model.match(X_train, y_train)

No tuning has occurred but, but it surely’s coming within the subsequent code block. The arguments for TunedThresholdClassifierCV are just like different CV lessons in scikit-learn, similar to GridSearchCV. At a minimal, the consumer solely must move the unique estimator and TunedThresholdClassifierCV will retailer the choice threshold that maximizes balanced accuracy (default) utilizing 5-fold stratified Okay-fold cross-validation (default). It additionally makes use of this threshold when calling .predict(). Nevertheless, any scikit-learn metric (or callable) can be utilized because the scoring metric. Moreover, the consumer can move the acquainted cv argument to customise the cross-validation technique.

Create the TunedThresholdClassifierCV occasion and match the mannequin on the coaching information. Cross the unique mannequin and set the scoring to be “f1”. We’ll additionally wish to set store_cv_results=True to entry the thresholds evaluated throughout cross-validation for visualization.

tuned_fraud_model = TunedThresholdClassifierCV(

tuned_fraud_model.match(X_train, y_train)

# common F1 throughout folds
avg_f1_train = tuned_fraud_model.best_score_
# Evaluate F1 within the take a look at set for the tuned mannequin and the unique mannequin
f1_test = f1_score(y_test, tuned_fraud_model.predict(X_test))
f1_test_original = f1_score(y_test, original_fraud_model.predict(X_test))

print(f"Common F1 on the coaching set: {avg_f1_train:.3f}")
print(f"F1 on the take a look at set: {f1_test:.3f}")
print(f"F1 on the take a look at set (authentic mannequin): {f1_test_original:.3f}")
print(f"Threshold: {tuned_fraud_model.best_threshold_: .3f}")

Common F1 on the coaching set: 0.784
F1 on the take a look at set: 0.796
F1 on the take a look at set (authentic mannequin): 0.733
Threshold: 0.071

Now that we’ve discovered the edge that maximizes the F1 rating test tuned_fraud_model.best_score_ to seek out out what the most effective common F1 rating was throughout folds in cross-validation. We are able to additionally see which threshold generated these outcomes utilizing tuned_fraud_model.best_threshold_. You possibly can visualize the metric scores throughout the choice thresholds throughout cross-validation utilizing the objective_scores_ and decision_thresholds_ attributes:

fig, ax = plt.subplots(figsize=(5, 5))
label=f"Optimum cut-off level = {tuned_fraud_model.best_threshold_:.2f}",
label="Default threshold: 0.5",
ax.legend(fontsize=8, loc="decrease heart")
ax.set_xlabel("Resolution threshold", fontsize=10)
ax.set_ylabel("F1 rating", fontsize=10)
ax.set_title("F1 rating vs. Resolution threshold -- Cross-validation", fontsize=12)
Picture created by the writer.
# Examine that the coefficients from the unique mannequin and the tuned mannequin are the identical
assert (tuned_fraud_model.estimator_[-1].coef_ ==

We’ve used the identical underlying logistic regression mannequin to guage two totally different choice thresholds. The underlying fashions are the identical, evidenced by the coefficient equality within the assert assertion above. Optimization in TunedThresholdClassifierCV is achieved utilizing post-processing strategies, that are utilized on to the expected chances output by the mannequin. Nevertheless, it is essential to notice that TunedThresholdClassifierCV makes use of cross-validation by default to seek out the choice threshold to keep away from overfitting to the coaching information.

Price-sensitive studying

Price-sensitive studying is a kind of machine studying that assigns a price to every sort of misclassification. This interprets mannequin efficiency into items that stakeholders perceive, like {dollars} saved.

We are going to use the TELCO buyer churn dataset, a binary classification dataset, to display the worth of cost-sensitive studying. The aim is to foretell whether or not a buyer will churn or not, given options concerning the buyer’s demographics, contract particulars, and different technical details about the client’s account. The motivation to make use of this dataset (and among the code) is from Dan Becker’s course on choice threshold optimization.

information = pd.read_excel("information/Telco_customer_churn.xlsx")
drop_cols = [
"Count", "Country", "State", "Lat Long", "Latitude", "Longitude",
"Zip Code", "Churn Value", "Churn Score", "CLTV", "Churn Reason"
information.drop(columns=drop_cols, inplace=True)

# Preprocess the info
information["Churn Label"] = information["Churn Label"].map({"Sure": 1, "No": 0})
information.drop(columns=["Total Charges"], inplace=True)

X_train, X_test, y_train, y_test = train_test_split(
information.drop(columns=["Churn Label"]),
information["Churn Label"],
stratify=information["Churn Label"],

Arrange a fundamental pipeline for processing the info and producing predicted chances with a random forest mannequin. It will function a baseline to check to the TunedThresholdClassifierCV.

preprocessor = ColumnTransformer(
transformers=[("one_hot", OneHotEncoder(),
the rest="passthrough",

original_churn_model = make_pipeline(
preprocessor, RandomForestClassifier(random_state=RANDOM_STATE)
original_churn_model.match(X_train.drop(columns=["customerID"]), y_train);

The selection of preprocessing and mannequin sort is just not essential for this tutorial. The corporate needs to supply reductions to clients who’re predicted to churn. Throughout collaboration with stakeholders, you be taught that giving a reduction to a buyer who is not going to churn (a false constructive) would value $80. You additionally be taught that it’s price $200 to supply a reduction to a buyer who would have churned. You possibly can symbolize this relationship in a price matrix:

def cost_function(y, y_pred, neg_label, pos_label):
cm = confusion_matrix(y, y_pred, labels=[neg_label, pos_label])
cost_matrix = np.array([[0, -80], [0, 200]])
return np.sum(cm * cost_matrix)

cost_scorer = make_scorer(cost_function, neg_label=0, pos_label=1)

We additionally wrapped the price operate in a scikit-learn customized scorer. This scorer might be used because the scoring argument within the TunedThresholdClassifierCV and to guage revenue on the take a look at set.

tuned_churn_model = TunedThresholdClassifierCV(

tuned_churn_model.match(X_train.drop(columns=["CustomerID"]), y_train)

# Calculate the revenue on the take a look at set
original_model_profit = cost_scorer(
original_churn_model, X_test.drop(columns=["CustomerID"]), y_test
tuned_model_profit = cost_scorer(
tuned_churn_model, X_test.drop(columns=["CustomerID"]), y_test

print(f"Unique mannequin revenue: {original_model_profit}")
print(f"Tuned mannequin revenue: {tuned_model_profit}")

Unique mannequin revenue: 29640
Tuned mannequin revenue: 35600

The revenue is larger within the tuned mannequin in comparison with the unique. Once more, we are able to plot the target metric towards the choice thresholds to visualise the choice threshold choice on coaching information throughout cross-validation:

fig, ax = plt.subplots(figsize=(5, 5))
label="Goal rating (utilizing cost-matrix)",
label="Optimum cut-off level for the enterprise metric",
ax.set_xlabel("Resolution threshold (chance)")
ax.set_ylabel("Goal rating (utilizing cost-matrix)")
ax.set_title("Goal rating as a operate of the choice threshold")
Picture created by the writer.

In actuality, assigning a static value to all cases which are misclassified in the identical method is just not practical from a enterprise perspective. There are extra superior strategies to tune the edge by assigning a weight to every occasion within the dataset. That is lined in scikit-learn’s cost-sensitive studying instance.

Tuning beneath constraints

This methodology is just not lined within the scikit-learn documentation at present, however is a typical enterprise case for binary classification use instances. The tuning beneath constraint methodology finds a choice threshold by figuring out a degree on both the ROC or precision-recall curves. The purpose on the curve is the utmost worth of 1 axis whereas constraining the opposite axis. For this walkthrough, we’ll be utilizing the Pima Indians diabetes dataset. This can be a binary classification process to foretell if a person has diabetes.

Think about that your mannequin might be used as a screening take a look at for an average-risk inhabitants utilized to tens of millions of individuals. There are an estimated 38 million folks with diabetes within the US. That is roughly 11.6% of the inhabitants, so the mannequin’s specificity needs to be excessive so it doesn’t misdiagnose tens of millions of individuals with diabetes and refer them to pointless confirmatory testing. Suppose your imaginary CEO has communicated that they won’t tolerate greater than a 2% false constructive fee. Let’s construct a mannequin that achieves this utilizing TunedThresholdClassifierCV.

For this a part of the tutorial, we’ll outline a constraint operate that might be used to seek out the utmost true constructive fee at a 2% false constructive fee.

def max_tpr_at_tnr_constraint_score(y_true, y_pred, max_tnr=0.5):
fpr, tpr, thresholds = roc_curve(y_true, y_pred, drop_intermediate=False)
tnr = 1 - fpr
tpr_at_tnr_constraint = tpr[tnr >= max_tnr].max()
return tpr_at_tnr_constraint

max_tpr_at_tnr_scorer = make_scorer(
max_tpr_at_tnr_constraint_score, max_tnr=0.98)
information = pd.read_csv("information/diabetes.csv")

X_train, X_test, y_train, y_test = train_test_split(

Construct two fashions, one logistic regression to function a baseline mannequin and the opposite, TunedThresholdClassifierCV which can wrap the baseline logistic regression mannequin to realize the aim outlined by the CEO. Within the tuned mannequin, set scoring=max_tpr_at_tnr_scorer. Once more, the selection of mannequin and preprocessing is just not essential for this tutorial.

# A baseline mannequin
original_model = make_pipeline(
StandardScaler(), LogisticRegression(random_state=RANDOM_STATE)
original_model.match(X_train, y_train)

# A tuned mannequin
tuned_model = TunedThresholdClassifierCV(
thresholds=np.linspace(0, 1, 150),
tuned_model.match(X_train, y_train)

Evaluate the distinction between the default choice threshold from scikit-learn estimators, 0.5, and one discovered utilizing the tuning beneath constraint method on the ROC curve.

# Get the fpr and tpr of the unique mannequin
original_model_proba = original_model.predict_proba(X_test)[:, 1]
fpr, tpr, thresholds = roc_curve(y_test, original_model_proba)
closest_threshold_to_05 = (np.abs(thresholds - 0.5)).argmin()
fpr_orig = fpr[closest_threshold_to_05]
tpr_orig = tpr[closest_threshold_to_05]

# Get the tnr and tpr of the tuned mannequin
max_tpr = tuned_model.best_score_
constrained_tnr = 0.98

# Plot the ROC curve and examine the default threshold to the tuned threshold
fig, ax = plt.subplots(figsize=(5, 5))
# Be aware that this would be the identical for each fashions
disp = RocCurveDisplay.from_estimator(
identify="Logistic Regression",
1 - constrained_tnr,
label=f"Tuned threshold: {tuned_model.best_threshold_:.2f}",
label="Default threshold: 0.5",
disp.ax_.set_ylabel("True Optimistic Price", fontsize=8)
disp.ax_.set_xlabel("False Optimistic Price", fontsize=8)

Picture created by the writer.

The tuned beneath constraint methodology discovered a threshold of 0.80, which resulted in a median sensitivity of 19.2% throughout cross-validation of the coaching information. Evaluate the sensitivity and specificity to see how the edge holds up within the take a look at set. Did the mannequin meet the CEO’s specificity requirement within the take a look at set?

# Common sensitivity and specificity on the coaching set
avg_sensitivity_train = tuned_model.best_score_

# Name predict from tuned_model to calculate sensitivity and specificity on the take a look at set
specificity_test = recall_score(
y_test, tuned_model.predict(X_test), pos_label=0)
sensitivity_test = recall_score(y_test, tuned_model.predict(X_test))

print(f"Common sensitivity on the coaching set: {avg_sensitivity_train:.3f}")
print(f"Sensitivity on the take a look at set: {sensitivity_test:.3f}")
print(f"Specificity on the take a look at set: {specificity_test:.3f}")

Common sensitivity on the coaching set: 0.192
Sensitivity on the take a look at set: 0.148
Specificity on the take a look at set: 0.990


The brand new TunedThresholdClassifierCV class is a robust instrument that may aid you change into a greater information scientist by sharing with enterprise leaders the way you arrived at a choice threshold. You realized methods to use the brand new scikit-learn TunedThresholdClassifierCV class to maximise a metric, carry out cost-sensitive studying, and tune a metric beneath constraint. This tutorial was not supposed to be complete or superior. I needed to introduce the brand new characteristic and spotlight its energy and suppleness in fixing binary classification issues. Please take a look at the scikit-learn documentation, consumer information, and examples for thorough utilization examples.

An enormous shoutout to Guillaume Lemaitre for his work on this characteristic.

Thanks for studying. Glad tuning.

Information Licenses:
Bank card fraud: DbCL
Pima Indians diabetes: CC0
TELCO churn: industrial use OK

Supply hyperlink

More articles


Please enter your comment!
Please enter your name here

Latest article