Implementing Logistic Regression in Python: A Comprehensive Guide

Unlock the power of Logistic Regression with Python’s Scikit-Learn library. Learn how to preprocess data, handle missing values, perform feature selection, and build efficient classification models. Enhance your machine learning skills with this step-by-step tutorial.

Logistic Regression

Introduction to Logistic Regression

Logistic Regression is a foundational algorithm in machine learning, primarily used for binary classification tasks. Unlike linear regression, which predicts continuous outcomes, logistic regression estimates the probability of a binary outcome based on one or more predictor variables.

In this comprehensive guide, we’ll walk through implementing a Logistic Regression model in Python using Scikit-Learn. We’ll cover data preprocessing, handling missing values, encoding categorical variables, feature selection, scaling, and model evaluation. Additionally, we’ll compare Logistic Regression’s performance with the K-Nearest Neighbors (KNN) classifier.

Understanding Logistic Regression
Setting Up the Environment
Data Exploration and Preprocessing
Handling Missing Data
Encoding Categorical Variables
Feature Selection
Scaling Features
Training the Models
Evaluating Model Performance
Hyperparameter Tuning
Conclusion

Understanding Logistic Regression

Logistic Regression is a linear model used for classification tasks. It predicts the probability that a given input belongs to a particular class. The output is transformed using the logistic function (sigmoid), which ensures the output values lie between 0 and 1.

Key Characteristics:

Binary Classification: Ideal for scenarios where the target variable has two classes.
Probability Estimates: Provides probabilities for class memberships.
Linear Decision Boundary: Assumes a linear relationship between the input features and the log-odds of the outcome.

Setting Up the Environment

Before diving into coding, ensure you have the necessary libraries installed. We’ll use Pandas for data manipulation, NumPy for numerical operations, Scikit-Learn for machine learning algorithms, and Seaborn for data visualization.

import pandas as pd
import numpy as np
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, LabelEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.metrics import accuracy_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
import matplotlib.pyplot as plt

import pandas as pd

import numpy as np

import seaborn as sns

from sklearn.model_selection import train_test_split

from sklearn.impute import SimpleImputer

from sklearn.preprocessing import OneHotEncoder, LabelEncoder, StandardScaler

from sklearn.compose import ColumnTransformer

from sklearn.feature_selection import SelectKBest, chi2

from sklearn.metrics import accuracy_score

from sklearn.neighbors import KNeighborsClassifier

from sklearn.linear_model import LogisticRegression

import matplotlib.pyplot as plt

Data Exploration and Preprocessing

For this tutorial, we’ll use the Weather Australia Dataset. This dataset contains records of weather observations across various Australian cities.

Loading the Data

# Load the dataset
data = pd.read_csv('weatherAUS.csv')

1 2	# Load the dataset data = pd.read_csv('weatherAUS.csv')

Let’s take a peek at the last few rows to understand the data structure:

data.tail()

1	data.tail()

Sample Output:

Date	Location	MinTemp	MaxTemp	Evaporation	…	RainToday	RainTomorrow
2017-06-20	Uluru	3.5	21.8	NaN	…	No	No
2017-06-21	Uluru	2.8	23.4	NaN	…	No	No
2017-06-22	Uluru	3.6	25.3	NaN	…	No	No
2017-06-23	Uluru	5.4	26.9	NaN	…	No	No
2017-06-24	Uluru	7.8	27.0	NaN	…	No	No

Separating Features and Target Variable

# Features
X = data.iloc[:, :-1]

# Target variable
y = data.iloc[:, -1]

# Features

X = data.iloc[:, :-1]

# Target variable

y = data.iloc[:, -1]

Handling a Specific Dataset Requirement:

If you’re working exclusively with the Weather Australia dataset, you might need to drop specific columns:

X.drop('RISK_MM', axis=1, inplace=True)

1	X.drop('RISK_MM', axis=1, inplace=True)

Handling Missing Data

Real-world datasets often contain missing values. Proper handling is crucial to ensure model accuracy.

Handling Numeric Data

We’ll use the SimpleImputer from Scikit-Learn to replace missing numeric values with the mean of each column.

from sklearn.impute import SimpleImputer

# Identify numerical columns
numerical_cols = X.select_dtypes(include=['int64', 'float64']).columns

# Initialize the imputer
imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean')

# Fit and transform the data
X[numerical_cols] = imp_mean.fit_transform(X[numerical_cols])

from sklearn.impute import SimpleImputer

# Identify numerical columns

numerical_cols = X.select_dtypes(include=['int64', 'float64']).columns

# Initialize the imputer

imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean')

# Fit and transform the data

X[numerical_cols] = imp_mean.fit_transform(X[numerical_cols])

Handling Categorical Data

For categorical variables, we’ll replace missing values with the most frequent category.

# Identify string columns
string_cols = X.select_dtypes(include=['object']).columns

# Initialize the imputer
imp_mode = SimpleImputer(missing_values=np.nan, strategy='most_frequent')

# Fit and transform the data
X[string_cols] = imp_mode.fit_transform(X[string_cols])

# Identify string columns

string_cols = X.select_dtypes(include=['object']).columns

# Initialize the imputer

imp_mode = SimpleImputer(missing_values=np.nan, strategy='most_frequent')

# Fit and transform the data

X[string_cols] = imp_mode.fit_transform(X[string_cols])

Encoding Categorical Variables

Machine learning models require numerical input. We’ll transform categorical variables using One-Hot Encoding and Label Encoding based on the number of unique categories.

One-Hot Encoding

Ideal for categorical variables with a small number of unique categories.

def OneHotEncoderMethod(indices, data):
    columnTransformer = ColumnTransformer(
        transformers=[('encoder', OneHotEncoder(), indices)],
        remainder='passthrough'
    )
    return columnTransformer.fit_transform(data)

one_hot_indices = [X.columns.get_loc(col) for col in string_cols if X[col].nunique() &lt;= 10]
X = OneHotEncoderMethod(one_hot_indices, X)

def OneHotEncoderMethod(indices, data):

columnTransformer = ColumnTransformer(

transformers=[('encoder', OneHotEncoder(), indices)],

remainder='passthrough'

)

return columnTransformer.fit_transform(data)

one_hot_indices = [X.columns.get_loc(col) for col in string_cols if X[col].nunique() <= 10]

X = OneHotEncoderMethod(one_hot_indices, X)

Label Encoding

Suitable for binary categorical variables.

def LabelEncoderMethod(series):
    le = LabelEncoder()
    return le.fit_transform(series)

binary_cols = [col for col in string_cols if X[col].nunique() == 2]
for col in binary_cols:
    X[col] = LabelEncoderMethod(X[col])

def LabelEncoderMethod(series):

le = LabelEncoder()

return le.fit_transform(series)

binary_cols = [col for col in string_cols if X[col].nunique() == 2]

for col in binary_cols:

X[col] = LabelEncoderMethod(X[col])

Encoding Selection for X

For categorical variables with more than two categories (and above a certain threshold), we’ll use Label Encoding. Otherwise, we’ll apply One-Hot Encoding.

def EncodingSelection(X, threshold=10):
    string_cols = X.select_dtypes(include=['object']).columns
    one_hot_encoding_indices = []
    
    for col in string_cols:
        unique_count = X[col].nunique()
        if unique_count == 2 or unique_count &gt; threshold:
            X[col] = LabelEncoderMethod(X[col])
        else:
            one_hot_encoding_indices.append(X.columns.get_loc(col))
    
    X = OneHotEncoderMethod(one_hot_encoding_indices, X)
    return X

X = EncodingSelection(X)

def EncodingSelection(X, threshold=10):

string_cols = X.select_dtypes(include=['object']).columns

one_hot_encoding_indices = []

for col in string_cols:

unique_count = X[col].nunique()

if unique_count == 2 or unique_count > threshold:

X[col] = LabelEncoderMethod(X[col])

else:

one_hot_encoding_indices.append(X.columns.get_loc(col))

X = OneHotEncoderMethod(one_hot_encoding_indices, X)

return X

X = EncodingSelection(X)

Feature Selection

To enhance model performance and reduce overfitting, we’ll select the top features using the Chi-Squared test.

from sklearn.feature_selection import SelectKBest, chi2
from sklearn.preprocessing import MinMaxScaler

# Initialize SelectKBest
kbest = SelectKBest(score_func=chi2, k=10)

# Scale features
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)

# Fit SelectKBest
kbest.fit(X_scaled, y)

# Get top features
best_features = np.argsort(kbest.scores_)[-2:]  # Selecting top 2 features
features_to_delete = np.argsort(kbest.scores_)[:-2]
X = np.delete(X, features_to_delete, axis=1)

print(f"Shape after feature selection: {X.shape}")

from sklearn.feature_selection import SelectKBest, chi2

from sklearn.preprocessing import MinMaxScaler

# Initialize SelectKBest

kbest = SelectKBest(score_func=chi2, k=10)

# Scale features

scaler = MinMaxScaler()

X_scaled = scaler.fit_transform(X)

# Fit SelectKBest

kbest.fit(X_scaled, y)

# Get top features

best_features = np.argsort(kbest.scores_)[-2:] # Selecting top 2 features

features_to_delete = np.argsort(kbest.scores_)[:-2]

X = np.delete(X, features_to_delete, axis=1)

print(f"Shape after feature selection: {X.shape}")

Output:

Shape after feature selection: (142193, 2)

1	Shape after feature selection: (142193, 2)

Scaling Features

Scaling ensures that features contribute equally to the model’s performance.

Standardization

Transforms data to have a mean of zero and a standard deviation of one.

from sklearn.preprocessing import StandardScaler

sc = StandardScaler(with_mean=False)
sc.fit(X_train)

X_train = sc.transform(X_train)
X_test = sc.transform(X_test)

from sklearn.preprocessing import StandardScaler

sc = StandardScaler(with_mean=False)

sc.fit(X_train)

X_train = sc.transform(X_train)

X_test = sc.transform(X_test)

Training the Models

We’ll compare two classification models: K-Nearest Neighbors (KNN) and Logistic Regression.

Train-Test Split

Splitting the data into training and testing sets ensures that we can evaluate model performance effectively.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.20, random_state=1
)

print(f"Training set shape: {X_train.shape}")
print(f"Testing set shape: {X_test.shape}")

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(

X, y, test_size=0.20, random_state=1

)

print(f"Training set shape: {X_train.shape}")

print(f"Testing set shape: {X_test.shape}")

Output:

Training set shape: (113754, 2)
Testing set shape: (28439, 2)

1 2	Training set shape: (113754, 2) Testing set shape: (28439, 2)

K-Nearest Neighbors (KNN)

KNN is a simple, instance-based learning algorithm used for classification and regression.

from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Initialize KNN with 3 neighbors
knnClassifier = KNeighborsClassifier(n_neighbors=3)
knnClassifier.fit(X_train, y_train)

# Predict and evaluate
y_pred_knn = knnClassifier.predict(X_test)
knn_accuracy = accuracy_score(y_pred_knn, y_test)
print(f"KNN Accuracy: {knn_accuracy:.2%}")

from sklearn.neighbors import KNeighborsClassifier

from sklearn.metrics import accuracy_score

# Initialize KNN with 3 neighbors

knnClassifier = KNeighborsClassifier(n_neighbors=3)

knnClassifier.fit(X_train, y_train)

# Predict and evaluate

y_pred_knn = knnClassifier.predict(X_test)

knn_accuracy = accuracy_score(y_pred_knn, y_test)

print(f"KNN Accuracy: {knn_accuracy:.2%}")

Output:

KNN Accuracy: 80.03%

1	KNN Accuracy: 80.03%

Logistic Regression

A powerful algorithm for binary classification tasks, Logistic Regression estimates the probability of a binary outcome.

from sklearn.linear_model import LogisticRegression

# Initialize Logistic Regression
LRM = LogisticRegression(random_state=0, max_iter=200)
LRM.fit(X_train, y_train)

# Predict and evaluate
y_pred_lr = LRM.predict(X_test)
lr_accuracy = accuracy_score(y_pred_lr, y_test)
print(f"Logistic Regression Accuracy: {lr_accuracy:.2%}")

from sklearn.linear_model import LogisticRegression

# Initialize Logistic Regression

LRM = LogisticRegression(random_state=0, max_iter=200)

LRM.fit(X_train, y_train)

# Predict and evaluate

y_pred_lr = LRM.predict(X_test)

lr_accuracy = accuracy_score(y_pred_lr, y_test)

print(f"Logistic Regression Accuracy: {lr_accuracy:.2%}")

Output:

Logistic Regression Accuracy: 82.97%

1	Logistic Regression Accuracy: 82.97%

Evaluating Model Performance

Both KNN and Logistic Regression provide substantial accuracy on the dataset, but Logistic Regression outperforms KNN in this scenario.

Model	Accuracy
K-Nearest Neighbors	80.03%
Logistic Regression	82.97%

Hyperparameter Tuning

Optimizing hyperparameters can further enhance model performance. For Logistic Regression, parameters like C (inverse of regularization strength) and solver can be tuned. Similarly, KNN’s n_neighbors can be varied.

Example: GridSearchCV for Logistic Regression

from sklearn.model_selection import GridSearchCV

# Define parameter grid
param_grid = {
    'C': [0.01, 0.1, 1, 10, 100],
    'solver': ['liblinear', 'lbfgs']
}

# Initialize GridSearchCV
grid = GridSearchCV(LogisticRegression(max_iter=200), param_grid, cv=5)
grid.fit(X_train, y_train)

print(f"Best Parameters: {grid.best_params_}")
print(f"Best Cross-Validation Accuracy: {grid.best_score_:.2%}")

from sklearn.model_selection import GridSearchCV

# Define parameter grid

param_grid = {

'C': [0.01, 0.1, 1, 10, 100],

'solver': ['liblinear', 'lbfgs']

}

# Initialize GridSearchCV

grid = GridSearchCV(LogisticRegression(max_iter=200), param_grid, cv=5)

grid.fit(X_train, y_train)

print(f"Best Parameters: {grid.best_params_}")

print(f"Best Cross-Validation Accuracy: {grid.best_score_:.2%}")

Output:

Best Parameters: {'C': 1, 'solver': 'lbfgs'}
Best Cross-Validation Accuracy: 83.25%

1 2	Best Parameters: {'C': 1, 'solver': 'lbfgs'} Best Cross-Validation Accuracy: 83.25%

Implementing the Best Parameters:

# Initialize Logistic Regression with best parameters
best_lr = grid.best_estimator_
best_lr.fit(X_train, y_train)

# Predict and evaluate
y_pred_best_lr = best_lr.predict(X_test)
best_lr_accuracy = accuracy_score(y_pred_best_lr, y_test)
print(f"Optimized Logistic Regression Accuracy: {best_lr_accuracy:.2%}")

# Initialize Logistic Regression with best parameters

best_lr = grid.best_estimator_

best_lr.fit(X_train, y_train)

# Predict and evaluate

y_pred_best_lr = best_lr.predict(X_test)

best_lr_accuracy = accuracy_score(y_pred_best_lr, y_test)

print(f"Optimized Logistic Regression Accuracy: {best_lr_accuracy:.2%}")

Output:

Optimized Logistic Regression Accuracy: 83.00%

1	Optimized Logistic Regression Accuracy: 83.00%

Conclusion

In this guide, we’ve successfully implemented a Logistic Regression model in Python, showcasing the entire machine learning pipeline from data preprocessing to model evaluation. By handling missing data, encoding categorical variables, selecting relevant features, and scaling, we’ve optimized the dataset for superior model performance. Additionally, comparing Logistic Regression with KNN highlighted the strengths of each algorithm, with Logistic Regression slightly outperforming in this context.

Key Takeaways:

Data Preprocessing: Crucial for achieving high model accuracy.
Feature Selection: Helps in reducing overfitting and improving performance.
Model Comparison: Always compare multiple models to identify the best performer.
Hyperparameter Tuning: Essential for optimizing model performance.

Embrace these techniques to build robust and efficient classification models tailored to your specific datasets and requirements.

Keywords: Logistic Regression, Python, Scikit-Learn, Machine Learning, Data Preprocessing, Classification Models, K-Nearest Neighbors, Feature Selection, Hyperparameter Tuning, Data Science Tutorial

Meta Description: Learn how to implement Logistic Regression in Python with Scikit-Learn. This comprehensive guide covers data preprocessing, handling missing values, feature selection, and model evaluation, comparing Logistic Regression with KNN for optimal performance.

S20L03 -Logistic regression under python

Implementing Logistic Regression in Python: A Comprehensive Guide

Introduction to Logistic Regression

Table of Contents

Understanding Logistic Regression

Setting Up the Environment

Data Exploration and Preprocessing

Loading the Data

Separating Features and Target Variable

Handling Missing Data

Handling Numeric Data

Handling Categorical Data

Encoding Categorical Variables

One-Hot Encoding

Label Encoding

Encoding Selection for X

Feature Selection

Scaling Features

Standardization

Training the Models

Train-Test Split

K-Nearest Neighbors (KNN)

Logistic Regression

Evaluating Model Performance

Hyperparameter Tuning

Conclusion