Implementing Logistic Regression in Python: A Comprehensive Guide
Unlock the power of Logistic Regression with Python’s Scikit-Learn library. Learn how to preprocess data, handle missing values, perform feature selection, and build efficient classification models. Enhance your machine learning skills with this step-by-step tutorial.
Introduction to Logistic Regression
Logistic Regression is a foundational algorithm in machine learning, primarily used for binary classification tasks. Unlike linear regression, which predicts continuous outcomes, logistic regression estimates the probability of a binary outcome based on one or more predictor variables.
In this comprehensive guide, we’ll walk through implementing a Logistic Regression model in Python using Scikit-Learn. We’ll cover data preprocessing, handling missing values, encoding categorical variables, feature selection, scaling, and model evaluation. Additionally, we’ll compare Logistic Regression’s performance with the K-Nearest Neighbors (KNN) classifier.
Table of Contents
- Understanding Logistic Regression
- Setting Up the Environment
- Data Exploration and Preprocessing
- Handling Missing Data
- Encoding Categorical Variables
- Feature Selection
- Scaling Features
- Training the Models
- Evaluating Model Performance
- Hyperparameter Tuning
- Conclusion
Understanding Logistic Regression
Logistic Regression is a linear model used for classification tasks. It predicts the probability that a given input belongs to a particular class. The output is transformed using the logistic function (sigmoid), which ensures the output values lie between 0 and 1.
Key Characteristics:
- Binary Classification: Ideal for scenarios where the target variable has two classes.
- Probability Estimates: Provides probabilities for class memberships.
- Linear Decision Boundary: Assumes a linear relationship between the input features and the log-odds of the outcome.
Setting Up the Environment
Before diving into coding, ensure you have the necessary libraries installed. We’ll use Pandas for data manipulation, NumPy for numerical operations, Scikit-Learn for machine learning algorithms, and Seaborn for data visualization.
1 2 3 4 5 6 7 8 9 10 11 12 |
import pandas as pd import numpy as np import seaborn as sns from sklearn.model_selection import train_test_split from sklearn.impute import SimpleImputer from sklearn.preprocessing import OneHotEncoder, LabelEncoder, StandardScaler from sklearn.compose import ColumnTransformer from sklearn.feature_selection import SelectKBest, chi2 from sklearn.metrics import accuracy_score from sklearn.neighbors import KNeighborsClassifier from sklearn.linear_model import LogisticRegression import matplotlib.pyplot as plt |
Data Exploration and Preprocessing
For this tutorial, we’ll use the Weather Australia Dataset. This dataset contains records of weather observations across various Australian cities.
Loading the Data
1 2 |
# Load the dataset data = pd.read_csv('weatherAUS.csv') |
Let’s take a peek at the last few rows to understand the data structure:
1 |
data.tail() |
Sample Output:
Date | Location | MinTemp | MaxTemp | Rainfall | Evaporation | … | RainToday | RISK_MM | RainTomorrow |
---|---|---|---|---|---|---|---|---|---|
2017-06-20 | Uluru | 3.5 | 21.8 | 0.0 | NaN | … | No | 0.0 | No |
2017-06-21 | Uluru | 2.8 | 23.4 | 0.0 | NaN | … | No | 0.0 | No |
2017-06-22 | Uluru | 3.6 | 25.3 | 0.0 | NaN | … | No | 0.0 | No |
2017-06-23 | Uluru | 5.4 | 26.9 | 0.0 | NaN | … | No | 0.0 | No |
2017-06-24 | Uluru | 7.8 | 27.0 | 0.0 | NaN | … | No | 0.0 | No |
Separating Features and Target Variable
1 2 3 4 5 |
# Features X = data.iloc[:, :-1] # Target variable y = data.iloc[:, -1] |
Handling a Specific Dataset Requirement:
If you’re working exclusively with the Weather Australia dataset, you might need to drop specific columns:
1 |
X.drop('RISK_MM', axis=1, inplace=True) |
Handling Missing Data
Real-world datasets often contain missing values. Proper handling is crucial to ensure model accuracy.
Handling Numeric Data
We’ll use the SimpleImputer
from Scikit-Learn to replace missing numeric values with the mean of each column.
1 2 3 4 5 6 7 8 9 10 |
from sklearn.impute import SimpleImputer # Identify numerical columns numerical_cols = X.select_dtypes(include=['int64', 'float64']).columns # Initialize the imputer imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean') # Fit and transform the data X[numerical_cols] = imp_mean.fit_transform(X[numerical_cols]) |
Handling Categorical Data
For categorical variables, we’ll replace missing values with the most frequent category.
1 2 3 4 5 6 7 8 |
# Identify string columns string_cols = X.select_dtypes(include=['object']).columns # Initialize the imputer imp_mode = SimpleImputer(missing_values=np.nan, strategy='most_frequent') # Fit and transform the data X[string_cols] = imp_mode.fit_transform(X[string_cols]) |
Encoding Categorical Variables
Machine learning models require numerical input. We’ll transform categorical variables using One-Hot Encoding and Label Encoding based on the number of unique categories.
One-Hot Encoding
Ideal for categorical variables with a small number of unique categories.
1 2 3 4 5 6 7 8 9 |
def OneHotEncoderMethod(indices, data): columnTransformer = ColumnTransformer( transformers=[('encoder', OneHotEncoder(), indices)], remainder='passthrough' ) return columnTransformer.fit_transform(data) one_hot_indices = [X.columns.get_loc(col) for col in string_cols if X[col].nunique() <= 10] X = OneHotEncoderMethod(one_hot_indices, X) |
Label Encoding
Suitable for binary categorical variables.
1 2 3 4 5 6 7 |
def LabelEncoderMethod(series): le = LabelEncoder() return le.fit_transform(series) binary_cols = [col for col in string_cols if X[col].nunique() == 2] for col in binary_cols: X[col] = LabelEncoderMethod(X[col]) |
Encoding Selection for X
For categorical variables with more than two categories (and above a certain threshold), we’ll use Label Encoding. Otherwise, we’ll apply One-Hot Encoding.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
def EncodingSelection(X, threshold=10): string_cols = X.select_dtypes(include=['object']).columns one_hot_encoding_indices = [] for col in string_cols: unique_count = X[col].nunique() if unique_count == 2 or unique_count > threshold: X[col] = LabelEncoderMethod(X[col]) else: one_hot_encoding_indices.append(X.columns.get_loc(col)) X = OneHotEncoderMethod(one_hot_encoding_indices, X) return X X = EncodingSelection(X) |
Feature Selection
To enhance model performance and reduce overfitting, we’ll select the top features using the Chi-Squared test.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
from sklearn.feature_selection import SelectKBest, chi2 from sklearn.preprocessing import MinMaxScaler # Initialize SelectKBest kbest = SelectKBest(score_func=chi2, k=10) # Scale features scaler = MinMaxScaler() X_scaled = scaler.fit_transform(X) # Fit SelectKBest kbest.fit(X_scaled, y) # Get top features best_features = np.argsort(kbest.scores_)[-2:] # Selecting top 2 features features_to_delete = np.argsort(kbest.scores_)[:-2] X = np.delete(X, features_to_delete, axis=1) print(f"Shape after feature selection: {X.shape}") |
Output:
1 |
Shape after feature selection: (142193, 2) |
Scaling Features
Scaling ensures that features contribute equally to the model’s performance.
Standardization
Transforms data to have a mean of zero and a standard deviation of one.
1 2 3 4 5 6 7 |
from sklearn.preprocessing import StandardScaler sc = StandardScaler(with_mean=False) sc.fit(X_train) X_train = sc.transform(X_train) X_test = sc.transform(X_test) |
Training the Models
We’ll compare two classification models: K-Nearest Neighbors (KNN) and Logistic Regression.
Train-Test Split
Splitting the data into training and testing sets ensures that we can evaluate model performance effectively.
1 2 3 4 5 6 7 8 |
from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.20, random_state=1 ) print(f"Training set shape: {X_train.shape}") print(f"Testing set shape: {X_test.shape}") |
Output:
1 2 |
Training set shape: (113754, 2) Testing set shape: (28439, 2) |
K-Nearest Neighbors (KNN)
KNN is a simple, instance-based learning algorithm used for classification and regression.
1 2 3 4 5 6 7 8 9 10 11 |
from sklearn.neighbors import KNeighborsClassifier from sklearn.metrics import accuracy_score # Initialize KNN with 3 neighbors knnClassifier = KNeighborsClassifier(n_neighbors=3) knnClassifier.fit(X_train, y_train) # Predict and evaluate y_pred_knn = knnClassifier.predict(X_test) knn_accuracy = accuracy_score(y_pred_knn, y_test) print(f"KNN Accuracy: {knn_accuracy:.2%}") |
Output:
1 |
KNN Accuracy: 80.03% |
Logistic Regression
A powerful algorithm for binary classification tasks, Logistic Regression estimates the probability of a binary outcome.
1 2 3 4 5 6 7 8 9 10 |
from sklearn.linear_model import LogisticRegression # Initialize Logistic Regression LRM = LogisticRegression(random_state=0, max_iter=200) LRM.fit(X_train, y_train) # Predict and evaluate y_pred_lr = LRM.predict(X_test) lr_accuracy = accuracy_score(y_pred_lr, y_test) print(f"Logistic Regression Accuracy: {lr_accuracy:.2%}") |
Output:
1 |
Logistic Regression Accuracy: 82.97% |
Evaluating Model Performance
Both KNN and Logistic Regression provide substantial accuracy on the dataset, but Logistic Regression outperforms KNN in this scenario.
Model | Accuracy |
---|---|
K-Nearest Neighbors | 80.03% |
Logistic Regression | 82.97% |
Hyperparameter Tuning
Optimizing hyperparameters can further enhance model performance. For Logistic Regression, parameters like C
(inverse of regularization strength) and solver
can be tuned. Similarly, KNN’s n_neighbors
can be varied.
Example: GridSearchCV for Logistic Regression
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
from sklearn.model_selection import GridSearchCV # Define parameter grid param_grid = { 'C': [0.01, 0.1, 1, 10, 100], 'solver': ['liblinear', 'lbfgs'] } # Initialize GridSearchCV grid = GridSearchCV(LogisticRegression(max_iter=200), param_grid, cv=5) grid.fit(X_train, y_train) print(f"Best Parameters: {grid.best_params_}") print(f"Best Cross-Validation Accuracy: {grid.best_score_:.2%}") |
Output:
1 2 |
Best Parameters: {'C': 1, 'solver': 'lbfgs'} Best Cross-Validation Accuracy: 83.25% |
Implementing the Best Parameters:
1 2 3 4 5 6 7 8 |
# Initialize Logistic Regression with best parameters best_lr = grid.best_estimator_ best_lr.fit(X_train, y_train) # Predict and evaluate y_pred_best_lr = best_lr.predict(X_test) best_lr_accuracy = accuracy_score(y_pred_best_lr, y_test) print(f"Optimized Logistic Regression Accuracy: {best_lr_accuracy:.2%}") |
Output:
1 |
Optimized Logistic Regression Accuracy: 83.00% |
Conclusion
In this guide, we’ve successfully implemented a Logistic Regression model in Python, showcasing the entire machine learning pipeline from data preprocessing to model evaluation. By handling missing data, encoding categorical variables, selecting relevant features, and scaling, we’ve optimized the dataset for superior model performance. Additionally, comparing Logistic Regression with KNN highlighted the strengths of each algorithm, with Logistic Regression slightly outperforming in this context.
Key Takeaways:
- Data Preprocessing: Crucial for achieving high model accuracy.
- Feature Selection: Helps in reducing overfitting and improving performance.
- Model Comparison: Always compare multiple models to identify the best performer.
- Hyperparameter Tuning: Essential for optimizing model performance.
Embrace these techniques to build robust and efficient classification models tailored to your specific datasets and requirements.
Keywords: Logistic Regression, Python, Scikit-Learn, Machine Learning, Data Preprocessing, Classification Models, K-Nearest Neighbors, Feature Selection, Hyperparameter Tuning, Data Science Tutorial
Meta Description: Learn how to implement Logistic Regression in Python with Scikit-Learn. This comprehensive guide covers data preprocessing, handling missing values, feature selection, and model evaluation, comparing Logistic Regression with KNN for optimal performance.