Tuesday, September 10, 2024

Logistic Regression, Defined: A Visible Information with Code Examples for Newcomers | by Samy Baladram | Sep, 2024

Must read


CLASSIFICATION ALGORITHM

Discovering the right weights to suit the information in

Towards Data Science

Whereas some probabilistic-based machine studying fashions (like Naive Bayes) make daring assumptions about characteristic independence, logistic regression takes a extra measured strategy. Consider it as drawing a line (or aircraft) that separates two outcomes, permitting us to foretell chances with a bit extra flexibility.

All visuals: Creator-created utilizing Canva Professional. Optimized for cell; might seem outsized on desktop.

Logistic regression is a statistical methodology used for predicting binary outcomes. Regardless of its title, it’s used for classification moderately than regression. It estimates the chance that an occasion belongs to a specific class. If the estimated chance is larger than 50%, the mannequin predicts that the occasion belongs to that class (or vice versa).

All through this text, we’ll use this synthetic golf dataset (impressed by [1]) for instance. This dataset predicts whether or not an individual will play golf primarily based on climate situations.

Identical to in KNN, logistic regression requires the information to be scaled first. Convert categorical columns into 0 & 1 and likewise scale the numerical options in order that no single characteristic dominates the space metric.

Columns: ‘Outlook’, ‘Temperature’, ‘Humidity’, ‘Wind’ and ‘Play’ (goal characteristic). The explicit columns (Outlook & Windy) are encoded utilizing one-hot encoding whereas the numerical columns are scaled utilizing commonplace scaling (z-normalization).
# Import required libraries
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler
import pandas as pd
import numpy as np

# Create dataset from dictionary
dataset_dict = {
'Outlook': ['sunny', 'sunny', 'overcast', 'rainy', 'rainy', 'rainy', 'overcast', 'sunny', 'sunny', 'rainy', 'sunny', 'overcast', 'overcast', 'rainy', 'sunny', 'overcast', 'rainy', 'sunny', 'sunny', 'rainy', 'overcast', 'rainy', 'sunny', 'overcast', 'sunny', 'overcast', 'rainy', 'overcast'],
'Temperature': [85.0, 80.0, 83.0, 70.0, 68.0, 65.0, 64.0, 72.0, 69.0, 75.0, 75.0, 72.0, 81.0, 71.0, 81.0, 74.0, 76.0, 78.0, 82.0, 67.0, 85.0, 73.0, 88.0, 77.0, 79.0, 80.0, 66.0, 84.0],
'Humidity': [85.0, 90.0, 78.0, 96.0, 80.0, 70.0, 65.0, 95.0, 70.0, 80.0, 70.0, 90.0, 75.0, 80.0, 88.0, 92.0, 85.0, 75.0, 92.0, 90.0, 85.0, 88.0, 65.0, 70.0, 60.0, 95.0, 70.0, 78.0],
'Wind': [False, True, False, False, False, True, True, False, False, False, True, True, False, True, True, False, False, True, False, True, True, False, True, False, False, True, False, False],
'Play': ['No', 'No', 'Yes', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'No', 'No', 'Yes', 'Yes', 'No', 'No', 'No', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'No', 'Yes']
}
df = pd.DataFrame(dataset_dict)

# Put together knowledge: encode categorical variables
df = pd.get_dummies(df, columns=['Outlook'], prefix='', prefix_sep='', dtype=int)
df['Wind'] = df['Wind'].astype(int)
df['Play'] = (df['Play'] == 'Sure').astype(int)

# Rearrange columns
column_order = ['sunny', 'overcast', 'rainy', 'Temperature', 'Humidity', 'Wind', 'Play']
df = df[column_order]

# Break up knowledge into options and goal
X, y = df.drop(columns='Play'), df['Play']

# Break up knowledge into coaching and testing units
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.5, shuffle=False)

# Scale numerical options
scaler = StandardScaler()
X_train[['Temperature', 'Humidity']] = scaler.fit_transform(X_train[['Temperature', 'Humidity']])
X_test[['Temperature', 'Humidity']] = scaler.rework(X_test[['Temperature', 'Humidity']])

# Print outcomes
print("Coaching set:")
print(pd.concat([X_train, y_train], axis=1), 'n')
print("Take a look at set:")
print(pd.concat([X_test, y_test], axis=1))

Logistic regression works by making use of the logistic operate to a linear mixture of the enter options. Right here’s the way it operates:

  1. Calculate a weighted sum of the enter options (much like linear regression).
  2. Apply the logistic operate (additionally known as sigmoid operate) to this sum, which maps any actual quantity to a worth between 0 and 1.
  3. Interpret this worth because the chance of belonging to the optimistic class.
  4. Use a threshold (usually 0.5) to make the ultimate classification choice.
For our golf dataset, logistic regression would possibly mix the climate elements right into a single rating, then rework this rating right into a chance of taking part in golf.

The coaching course of for logistic regression includes discovering one of the best weights for the enter options. Right here’s the overall define:

  1. Initialize the weights (typically to small random values).
# Initialize weights (together with bias) to 0.1
initial_weights = np.full(X_train_np.form[1], 0.1)

# Create and show DataFrame for preliminary weights
print(f"Preliminary Weights: {initial_weights}")

2. For every coaching instance:
a. Calculate the anticipated chance utilizing the present weights.

def sigmoid(z):
return 1 / (1 + np.exp(-z))

def calculate_probabilities(X, weights):
z = np.dot(X, weights)
return sigmoid(z)

def calculate_log_loss(chances, y):
return -y * np.log(chances) - (1 - y) * np.log(1 - chances)

def create_output_dataframe(X, y, weights):
chances = calculate_probabilities(X, weights)
log_losses = calculate_log_loss(chances, y)

df = pd.DataFrame({
'Chance': chances,
'Label': y,
'Log Loss': log_losses
})

return df

def calculate_average_log_loss(X, y, weights):
chances = calculate_probabilities(X, weights)
log_losses = calculate_log_loss(chances, y)
return np.imply(log_losses)

# Convert X_train and y_train to numpy arrays for simpler computation
X_train_np = X_train.to_numpy()
y_train_np = y_train.to_numpy()

# Add a column of 1s to X_train_np for the bias time period
X_train_np = np.column_stack((np.ones(X_train_np.form[0]), X_train_np))

# Create and show DataFrame for preliminary weights
initial_df = create_output_dataframe(X_train_np, y_train_np, initial_weights)
print(initial_df.to_string(index=False, float_format=lambda x: f"{x:.6f}"))
print(f"nAverage Log Loss: {calculate_average_log_loss(X_train_np, y_train_np, initial_weights):.6f}")

b. Evaluate this chance to the precise class label by calculating its log loss.

3. Replace the weights to reduce the loss (normally utilizing some optimization algorithm, like gradient descent. This embody repeatedly do Step 2 till log loss can’t get smaller).

def gradient_descent_step(X, y, weights, learning_rate):
m = len(y)
chances = calculate_probabilities(X, weights)
gradient = np.dot(X.T, (chances - y)) / m
new_weights = weights - learning_rate * gradient # Create new array for up to date weights
return new_weights

# Carry out one step of gradient descent (one of many easiest optimization algorithm)
learning_rate = 0.1
updated_weights = gradient_descent_step(X_train_np, y_train_np, initial_weights, learning_rate)

# Print preliminary and up to date weights
print("nInitial weights:")
for characteristic, weight in zip(['Bias'] + checklist(X_train.columns), initial_weights):
print(f"{characteristic:11}: {weight:.2f}")

print("nUpdated weights after one iteration:")
for characteristic, weight in zip(['Bias'] + checklist(X_train.columns), updated_weights):
print(f"{characteristic:11}: {weight:.2f}")

# With sklearn, you may get the ultimate weights (coefficients)
# and ultimate bias (intercepts) simply.
# The result's virtually the identical as doing it manually above.

from sklearn.linear_model import LogisticRegression

lr_clf = LogisticRegression(penalty=None, solver='saga')
lr_clf.match(X_train, y_train)

coefficients = lr_clf.coef_
intercept = lr_clf.intercept_

y_train_prob = lr_clf.predict_proba(X_train)[:, 1]
loss = -np.imply(y_train * np.log(y_train_prob) + (1 - y_train) * np.log(1 - y_train_prob))

print(f"Weights & Bias Last: {coefficients[0].spherical(2)}, {spherical(intercept[0],2)}")
print("Loss Last:", loss.spherical(3))

As soon as the mannequin is educated:
1. For a brand new occasion, calculate the chance with the ultimate weights (additionally known as coefficients), similar to throughout the coaching step.

2. Interpret the output by seeing the chance: if p ≥ 0.5, predict class 1; in any other case, predict class 0

# Calculate prediction chance
predicted_probs = lr_clf.predict_proba(X_test)[:, 1]

z_values = np.log(predicted_probs / (1 - predicted_probs))

result_df = pd.DataFrame({
'ID': X_test.index,
'Z-Values': z_values.spherical(3),
'Chances': predicted_probs.spherical(3)
}).set_index('ID')

print(result_df)

# Make predictions
y_pred = lr_clf.predict(X_test)
print(y_pred)

Analysis Step

result_df = pd.DataFrame({
'ID': X_test.index,
'Label': y_test,
'Chances': predicted_probs.spherical(2),
'Prediction': y_pred,
}).set_index('ID')

print(result_df)

Logistic regression has a number of vital parameters that management its habits:

1.Penalty: The kind of regularization to make use of (‘l1’, ‘l2’, ‘elasticnet’, or ‘none’). Regularization in logistic regression prevents overfitting by including a penalty time period to the mannequin’s loss operate, that encourages less complicated fashions.

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

regs = [None, 'l1', 'l2']
coeff_dict = {}

for reg in regs:
lr_clf = LogisticRegression(penalty=reg, solver='saga')
lr_clf.match(X_train, y_train)
coefficients = lr_clf.coef_
intercept = lr_clf.intercept_
predicted_probs = lr_clf.predict_proba(X_train)[:, 1]
loss = -np.imply(y_train * np.log(predicted_probs) + (1 - y_train) * np.log(1 - predicted_probs))
predictions = lr_clf.predict(X_test)
accuracy = accuracy_score(y_test, predictions)

coeff_dict[reg] = {
'Coefficients': coefficients,
'Intercept': intercept,
'Loss': loss,
'Accuracy': accuracy
}

for reg, vals in coeff_dict.objects():
print(f"{reg}: Coeff: {vals['Coefficients'][0].spherical(2)}, Intercept: {vals['Intercept'].spherical(2)}, Loss: {vals['Loss'].spherical(3)}, Accuracy: {vals['Accuracy'].spherical(3)}")

2. Regularization Power (C): Controls the trade-off between becoming the coaching knowledge and preserving the mannequin easy. A smaller C means stronger regularization.

# Listing of regularization strengths to strive for L1
strengths = [0.001, 0.01, 0.1, 1, 10, 100]

coeff_dict = {}

for energy in strengths:
lr_clf = LogisticRegression(penalty='l1', C=energy, solver='saga')
lr_clf.match(X_train, y_train)

coefficients = lr_clf.coef_
intercept = lr_clf.intercept_

predicted_probs = lr_clf.predict_proba(X_train)[:, 1]
loss = -np.imply(y_train * np.log(predicted_probs) + (1 - y_train) * np.log(1 - predicted_probs))
predictions = lr_clf.predict(X_test)

accuracy = accuracy_score(y_test, predictions)

coeff_dict[f'L1_{strength}'] = {
'Coefficients': coefficients[0].spherical(2),
'Intercept': spherical(intercept[0],2),
'Loss': spherical(loss,3),
'Accuracy': spherical(accuracy*100,2)
}

print(pd.DataFrame(coeff_dict).T)

# Listing of regularization strengths to strive for L2
strengths = [0.001, 0.01, 0.1, 1, 10, 100]

coeff_dict = {}

for energy in strengths:
lr_clf = LogisticRegression(penalty='l2', C=energy, solver='saga')
lr_clf.match(X_train, y_train)

coefficients = lr_clf.coef_
intercept = lr_clf.intercept_

predicted_probs = lr_clf.predict_proba(X_train)[:, 1]
loss = -np.imply(y_train * np.log(predicted_probs) + (1 - y_train) * np.log(1 - predicted_probs))
predictions = lr_clf.predict(X_test)
accuracy = accuracy_score(y_test, predictions)

coeff_dict[f'L2_{strength}'] = {
'Coefficients': coefficients[0].spherical(2),
'Intercept': spherical(intercept[0],2),
'Loss': spherical(loss,3),
'Accuracy': spherical(accuracy*100,2)
}

print(pd.DataFrame(coeff_dict).T)

3. Solver: The algorithm to make use of for optimization (‘liblinear’, ‘newton-cg’, ‘lbfgs’, ‘sag’, ‘saga’). Some regularization would possibly require a specific algorithm.

4. Max Iterations: The utmost variety of iterations for the solver to converge.

For our golf dataset, we would begin with ‘l2’ penalty, ‘liblinear’ solver, and C=1.0 as a baseline.

Like every algorithm in machine studying, logistic regression has its strengths and limitations.

Professionals:

  1. Simplicity: Simple to implement and perceive.
  2. Interpretability: The weights straight present the significance of every characteristic.
  3. Effectivity: Doesn’t require an excessive amount of computational energy.
  4. Probabilistic Output: Offers chances moderately than simply classifications.

Cons:

  1. Linearity Assumption: Assumes a linear relationship between options and log-odds of the result.
  2. Characteristic Independence: Assumes options are usually not extremely correlated.
  3. Restricted Complexity: Might underfit in circumstances the place the choice boundary is extremely non-linear.
  4. Requires Extra Knowledge: Wants a comparatively giant pattern dimension for secure outcomes.

In our golf instance, logistic regression would possibly present a transparent, interpretable mannequin of how every climate issue influences the choice to play golf. Nonetheless, it would battle if the choice includes advanced interactions between climate situations that may’t be captured by a linear mannequin.

Logistic regression shines as a robust but easy classification instrument. It stands out for its capability to deal with advanced knowledge whereas remaining straightforward to interpret. In contrast to another primary fashions, it offers easy chance estimates and works effectively with many options. In the true world, from predicting buyer habits to medical diagnoses, logistic regression typically performs surprisingly effectively. It’s not only a stepping stone — it’s a dependable mannequin that may match extra advanced fashions in lots of conditions.

# Import required libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

# Load the dataset
dataset_dict = {
'Outlook': ['sunny', 'sunny', 'overcast', 'rainy', 'rainy', 'rainy', 'overcast', 'sunny', 'sunny', 'rainy', 'sunny', 'overcast', 'overcast', 'rainy', 'sunny', 'overcast', 'rainy', 'sunny', 'sunny', 'rainy', 'overcast', 'rainy', 'sunny', 'overcast', 'sunny', 'overcast', 'rainy', 'overcast'],
'Temperature': [85.0, 80.0, 83.0, 70.0, 68.0, 65.0, 64.0, 72.0, 69.0, 75.0, 75.0, 72.0, 81.0, 71.0, 81.0, 74.0, 76.0, 78.0, 82.0, 67.0, 85.0, 73.0, 88.0, 77.0, 79.0, 80.0, 66.0, 84.0],
'Humidity': [85.0, 90.0, 78.0, 96.0, 80.0, 70.0, 65.0, 95.0, 70.0, 80.0, 70.0, 90.0, 75.0, 80.0, 88.0, 92.0, 85.0, 75.0, 92.0, 90.0, 85.0, 88.0, 65.0, 70.0, 60.0, 95.0, 70.0, 78.0],
'Wind': [False, True, False, False, False, True, True, False, False, False, True, True, False, True, True, False, False, True, False, True, True, False, True, False, False, True, False, False],
'Play': ['No', 'No', 'Yes', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'No', 'No', 'Yes', 'Yes', 'No', 'No', 'No', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'No', 'Yes']
}
df = pd.DataFrame(dataset_dict)

# Put together knowledge: encode categorical variables
df = pd.get_dummies(df, columns=['Outlook'], prefix='', prefix_sep='', dtype=int)
df['Wind'] = df['Wind'].astype(int)
df['Play'] = (df['Play'] == 'Sure').astype(int)

# Break up knowledge into coaching and testing units
X, y = df.drop(columns='Play'), df['Play']
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.5, shuffle=False)

# Scale numerical options
scaler = StandardScaler()
float_cols = X_train.select_dtypes(embody=['float64']).columns
X_train[float_cols] = scaler.fit_transform(X_train[float_cols])
X_test[float_cols] = scaler.rework(X_test[float_cols])

# Practice the mannequin
lr_clf = LogisticRegression(penalty='l2', C=1, solver='saga')
lr_clf.match(X_train, y_train)

# Make predictions
y_pred = lr_clf.predict(X_test)

# Consider the mannequin
print(f"Accuracy: {accuracy_score(y_test, y_pred)}")



Supply hyperlink

More articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest article