Mastering Classification Models: A Comprehensive Python Template for Data Science

Introduction to Classification Models
Setting Up Your Environment
Data Import and Exploration
Handling Missing Data
Encoding Categorical Variables
Feature Selection
Train-Test Split
Feature Scaling
Building and Evaluating Models
Conclusion

1. Introduction to Classification Models

Classification models are a cornerstone of supervised machine learning, enabling the prediction of discrete labels based on input features. These models are instrumental in various applications, from email spam detection to medical diagnosis. Mastering these models involves understanding data preprocessing, feature engineering, model selection, and evaluation metrics.

2. Setting Up Your Environment

Before diving into model building, ensure that your Python environment is equipped with the necessary libraries. Here’s how you can set up your environment:

# Install necessary libraries
!pip install pandas seaborn scikit-learn xgboost

1 2	# Install necessary libraries !pip install pandas seaborn scikit-learn xgboost

Import the essential libraries:

import pandas as pd
import seaborn as sns
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.metrics import accuracy_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
import xgboost as xgb

import pandas as pd

import seaborn as sns

import numpy as np

from sklearn.model_selection import train_test_split

from sklearn.preprocessing import StandardScaler, LabelEncoder, OneHotEncoder

from sklearn.impute import SimpleImputer

from sklearn.compose import ColumnTransformer

from sklearn.feature_selection import SelectKBest, chi2

from sklearn.metrics import accuracy_score

from sklearn.neighbors import KNeighborsClassifier

from sklearn.linear_model import LogisticRegression

from sklearn.naive_bayes import GaussianNB

from sklearn.svm import SVC

from sklearn.tree import DecisionTreeClassifier

from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier

import xgboost as xgb

3. Data Import and Exploration

For this tutorial, we’ll use the Weather Australia Dataset from Kaggle. This comprehensive dataset provides diverse weather-related features that are ideal for building classification models.

# Import data
data = pd.read_csv('weatherAUS.csv')  # Ensure the CSV file is in your working directory
print(data.tail())

# Import data

data = pd.read_csv('weatherAUS.csv') # Ensure the CSV file is in your working directory

print(data.tail())

Sample Output:

         Date Location  MinTemp  MaxTemp  Rainfall  Evaporation  Sunshine WindGustDir WindGustSpeed WindDir9am  ... Humidity3pm Pressure9am Pressure3pm Cloud9am Cloud3pm Temp9am Temp3pm RainToday RISK_MM RainTomorrow  
142188 2017-06-20    Uluru      3.5     21.8       0.0          NaN       E          31.0        ESE  ...        27.0       1024.7       1021.2      NaN      NaN     9.4     20.9        No     0.0            No  
142189 2017-06-21    Uluru      2.8     23.4       0.0          NaN       E          31.0         SE  ...        24.0       1024.6       1020.3      NaN      NaN    10.1     22.4        No     0.0            No  
142190 2017-06-22    Uluru      3.6     25.3       0.0          NaN       NNW        22.0         SE  ...        21.0       1023.5       1019.1      NaN      NaN    10.9     24.5        No     0.0            No  
142191 2017-06-23    Uluru      5.4     26.9       0.0          NaN       N          37.0         SE  ...        24.0       1021.0       1016.8      NaN      NaN    12.5     26.1        No     0.0            No  
142192 2017-06-24    Uluru      7.8     27.0       0.0          NaN       SE         28.0         SSE ...        24.0       1019.4       1016.5        3.0        2.0    15.1     26.0        No     0.0            No

Date Location MinTemp MaxTemp Rainfall Evaporation Sunshine WindGustDir WindGustSpeed WindDir9am ... Humidity3pm Pressure9am Pressure3pm Cloud9am Cloud3pm Temp9am Temp3pm RainToday RISK_MM RainTomorrow

142188 2017-06-20 Uluru 3.5 21.8 0.0 NaN E 31.0 ESE ... 27.0 1024.7 1021.2 NaN NaN 9.4 20.9 No 0.0 No

142189 2017-06-21 Uluru 2.8 23.4 0.0 NaN E 31.0 SE ... 24.0 1024.6 1020.3 NaN NaN 10.1 22.4 No 0.0 No

142190 2017-06-22 Uluru 3.6 25.3 0.0 NaN NNW 22.0 SE ... 21.0 1023.5 1019.1 NaN NaN 10.9 24.5 No 0.0 No

142191 2017-06-23 Uluru 5.4 26.9 0.0 NaN N 37.0 SE ... 24.0 1021.0 1016.8 NaN NaN 12.5 26.1 No 0.0 No

142192 2017-06-24 Uluru 7.8 27.0 0.0 NaN SE 28.0 SSE ... 24.0 1019.4 1016.5 3.0 2.0 15.1 26.0 No 0.0 No

4. Handling Missing Data

Data integrity is crucial for building reliable models. Let’s address missing values in both numeric and categorical features.

Handling Missing Numeric Data

Use the SimpleImputer from Scikit-learn to fill missing numeric values with the mean of each column.

from sklearn.impute import SimpleImputer

# Separate features and target
X = data.iloc[:, :-1]  # All columns except the last one
y = data.iloc[:, -1]   # Target column

# Identify numeric columns
numerical_cols = X.select_dtypes(include=['int64', 'float64']).columns

# Impute missing numeric values with mean
imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean')
X[numerical_cols] = imp_mean.fit_transform(X[numerical_cols])

from sklearn.impute import SimpleImputer

# Separate features and target

X = data.iloc[:, :-1] # All columns except the last one

y = data.iloc[:, -1] # Target column

# Identify numeric columns

numerical_cols = X.select_dtypes(include=['int64', 'float64']).columns

# Impute missing numeric values with mean

imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean')

X[numerical_cols] = imp_mean.fit_transform(X[numerical_cols])

Handling Missing Categorical Data

For categorical variables, impute missing values with the most frequent (mode) value.

# Identify categorical columns
categorical_cols = X.select_dtypes(include=['object']).columns

# Impute missing categorical values with the most frequent value
imp_freq = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
X[categorical_cols] = imp_freq.fit_transform(X[categorical_cols])

# Identify categorical columns

categorical_cols = X.select_dtypes(include=['object']).columns

# Impute missing categorical values with the most frequent value

imp_freq = SimpleImputer(missing_values=np.nan, strategy='most_frequent')

X[categorical_cols] = imp_freq.fit_transform(X[categorical_cols])

5. Encoding Categorical Variables

Machine learning models require numerical input. Therefore, categorical variables need to be encoded. We’ll use Label Encoding for binary categories and One-Hot Encoding for multi-class categories.

Label Encoding

from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
y = le.fit_transform(y)  # Encoding the target variable

from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()

y = le.fit_transform(y) # Encoding the target variable

One-Hot Encoding

Implement a method to handle encoding based on the number of unique categories.

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

def one_hot_encode(columns, data):
    ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), columns)], remainder='passthrough')
    return ct.fit_transform(data)

# Example usage:
# X = one_hot_encode(['WindGustDir', 'WindDir9am'], X)

from sklearn.compose import ColumnTransformer

from sklearn.preprocessing import OneHotEncoder

def one_hot_encode(columns, data):

ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), columns)], remainder='passthrough')

return ct.fit_transform(data)

# Example usage:

# X = one_hot_encode(['WindGustDir', 'WindDir9am'], X)

Alternatively, automate the encoding process based on unique category thresholds.

def encoding_selection(X, threshold=10):
    # Identify string columns
    string_cols = X.select_dtypes(include=['object']).columns
    one_hot_encoding_cols = []
    
    for col in string_cols:
        unique_count = X[col].nunique()
        if unique_count == 2 or unique_count > threshold:
            X[col] = le.fit_transform(X[col])
        else:
            one_hot_encoding_cols.append(col)
    
    if one_hot_encoding_cols:
        X = one_hot_encode(one_hot_encoding_cols, X)
    
    return X

X = encoding_selection(X)

def encoding_selection(X, threshold=10):

# Identify string columns

string_cols = X.select_dtypes(include=['object']).columns

one_hot_encoding_cols = []

for col in string_cols:

unique_count = X[col].nunique()

if unique_count == 2 or unique_count > threshold:

X[col] = le.fit_transform(X[col])

else:

one_hot_encoding_cols.append(col)

if one_hot_encoding_cols:

X = one_hot_encode(one_hot_encoding_cols, X)

return X

X = encoding_selection(X)

6. Feature Selection

Reducing the number of features can enhance model performance and reduce computational cost. We’ll use SelectKBest with the Chi-Squared test to select the top features.

from sklearn.feature_selection import SelectKBest, chi2
from sklearn.preprocessing import MinMaxScaler

# Scale features
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)

# Select top K features
k = 10  # You can adjust this based on your requirement
selector = SelectKBest(score_func=chi2, k=k)
X_selected = selector.fit_transform(X_scaled, y)

# Get selected feature indices
selected_indices = selector.get_support(indices=True)
selected_features = X.columns[selected_indices]
print("Selected Features:", selected_features)

from sklearn.feature_selection import SelectKBest, chi2

from sklearn.preprocessing import MinMaxScaler

# Scale features

scaler = MinMaxScaler()

X_scaled = scaler.fit_transform(X)

# Select top K features

k = 10 # You can adjust this based on your requirement

selector = SelectKBest(score_func=chi2, k=k)

X_selected = selector.fit_transform(X_scaled, y)

# Get selected feature indices

selected_indices = selector.get_support(indices=True)

selected_features = X.columns[selected_indices]

print("Selected Features:", selected_features)

7. Train-Test Split

Splitting the dataset into training and testing sets is essential to evaluate the model’s performance on unseen data.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X_selected, y, test_size=0.20, random_state=1)
print("Training set shape:", X_train.shape)
print("Test set shape:", X_test.shape)

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X_selected, y, test_size=0.20, random_state=1)

print("Training set shape:", X_train.shape)

print("Test set shape:", X_test.shape)

Output:

Training set shape: (113754, 10)
Test set shape: (28439, 10)

1 2	Training set shape: (113754, 10) Test set shape: (28439, 10)

8. Feature Scaling

Standardizing features ensures that each feature contributes equally to the distance calculations in algorithms like KNN and SVM.

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
print("Scaled Training set shape:", X_train.shape)
print("Scaled Test set shape:", X_test.shape)

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

X_train = scaler.fit_transform(X_train)

X_test = scaler.transform(X_test)

print("Scaled Training set shape:", X_train.shape)

print("Scaled Test set shape:", X_test.shape)

Output:

Scaled Training set shape: (113754, 10)
Scaled Test set shape: (28439, 10)

1 2	Scaled Training set shape: (113754, 10) Scaled Test set shape: (28439, 10)

9. Building and Evaluating Models

With the data preprocessed, we can now build and evaluate various classification models. We’ll assess models based on their accuracy scores.

K-Nearest Neighbors (KNN)

from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
print("KNN Accuracy:", accuracy_score(y_test, y_pred))

from sklearn.neighbors import KNeighborsClassifier

from sklearn.metrics import accuracy_score

knn = KNeighborsClassifier(n_neighbors=3)

knn.fit(X_train, y_train)

y_pred = knn.predict(X_test)

print("KNN Accuracy:", accuracy_score(y_test, y_pred))

Output:

KNN Accuracy: 1.0

1	KNN Accuracy: 1.0

Logistic Regression

from sklearn.linear_model import LogisticRegression

log_reg = LogisticRegression(random_state=0, max_iter=200)
log_reg.fit(X_train, y_train)
y_pred = log_reg.predict(X_test)
print("Logistic Regression Accuracy:", accuracy_score(y_test, y_pred))

from sklearn.linear_model import LogisticRegression

log_reg = LogisticRegression(random_state=0, max_iter=200)

log_reg.fit(X_train, y_train)

y_pred = log_reg.predict(X_test)

print("Logistic Regression Accuracy:", accuracy_score(y_test, y_pred))

Output:

Logistic Regression Accuracy: 0.99996

1	Logistic Regression Accuracy: 0.99996

Gaussian Naive Bayes

from sklearn.naive_bayes import GaussianNB

gnb = GaussianNB()
gnb.fit(X_train, y_train)
y_pred = gnb.predict(X_test)
print("GaussianNB Accuracy:", accuracy_score(y_test, y_pred))

from sklearn.naive_bayes import GaussianNB

gnb = GaussianNB()

gnb.fit(X_train, y_train)

y_pred = gnb.predict(X_test)

print("GaussianNB Accuracy:", accuracy_score(y_test, y_pred))

Output:

GaussianNB Accuracy: 0.97437

1	GaussianNB Accuracy: 0.97437

Support Vector Machine (SVM)

from sklearn.svm import SVC

svm = SVC()
svm.fit(X_train, y_train)
y_pred = svm.predict(X_test)
print("SVM Accuracy:", accuracy_score(y_test, y_pred))

from sklearn.svm import SVC

svm = SVC()

svm.fit(X_train, y_train)

y_pred = svm.predict(X_test)

print("SVM Accuracy:", accuracy_score(y_test, y_pred))

Output:

SVM Accuracy: 0.99996

1	SVM Accuracy: 0.99996

Decision Tree Classifier

from sklearn.tree import DecisionTreeClassifier

dtc = DecisionTreeClassifier()
dtc.fit(X_train, y_train)
y_pred = dtc.predict(X_test)
print("Decision Tree Accuracy:", accuracy_score(y_test, y_pred))

from sklearn.tree import DecisionTreeClassifier

dtc = DecisionTreeClassifier()

dtc.fit(X_train, y_train)

y_pred = dtc.predict(X_test)

print("Decision Tree Accuracy:", accuracy_score(y_test, y_pred))

Output:

Decision Tree Accuracy: 1.0

1	Decision Tree Accuracy: 1.0

Random Forest Classifier

from sklearn.ensemble import RandomForestClassifier

rfc = RandomForestClassifier(n_estimators=500, max_depth=5)
rfc.fit(X_train, y_train)
y_pred = rfc.predict(X_test)
print("Random Forest Accuracy:", accuracy_score(y_test, y_pred))

from sklearn.ensemble import RandomForestClassifier

rfc = RandomForestClassifier(n_estimators=500, max_depth=5)

rfc.fit(X_train, y_train)

y_pred = rfc.predict(X_test)

print("Random Forest Accuracy:", accuracy_score(y_test, y_pred))

Output:

Random Forest Accuracy: 1.0

1	Random Forest Accuracy: 1.0

AdaBoost Classifier

from sklearn.ensemble import AdaBoostClassifier

abc = AdaBoostClassifier()
abc.fit(X_train, y_train)
y_pred = abc.predict(X_test)
print("AdaBoost Accuracy:", accuracy_score(y_test, y_pred))

from sklearn.ensemble import AdaBoostClassifier

abc = AdaBoostClassifier()

abc.fit(X_train, y_train)

y_pred = abc.predict(X_test)

print("AdaBoost Accuracy:", accuracy_score(y_test, y_pred))

Output:

AdaBoost Accuracy: 1.0

1	AdaBoost Accuracy: 1.0

XGBoost Classifier

import xgboost as xgb

xgb_model = xgb.XGBClassifier(use_label_encoder=False, eval_metric='logloss')
xgb_model.fit(X_train, y_train)
y_pred = xgb_model.predict(X_test)
print("XGBoost Accuracy:", accuracy_score(y_test, y_pred))

import xgboost as xgb

xgb_model = xgb.XGBClassifier(use_label_encoder=False, eval_metric='logloss')

xgb_model.fit(X_train, y_train)

y_pred = xgb_model.predict(X_test)

print("XGBoost Accuracy:", accuracy_score(y_test, y_pred))

Output:

XGBoost Accuracy: 1.0

1	XGBoost Accuracy: 1.0

Note: The warning regarding the evaluation metric in XGBoost can be suppressed by explicitly setting the eval_metric parameter, as shown above.

10. Conclusion

Building classification models doesn’t have to be daunting. With a structured approach to data preprocessing, encoding, feature selection, and model evaluation, you can efficiently develop robust models tailored to your specific needs. The master template illustrated in this article serves as a comprehensive guide, streamlining the workflow from data ingestion to model evaluation. Whether you’re a beginner or an experienced data scientist, leveraging such templates can enhance productivity and model performance.

Key Takeaways:

Data Preprocessing: Clean and prepare your data meticulously to ensure model accuracy.
Encoding Techniques: Appropriately encode categorical variables to suit different algorithms.
Feature Selection: Utilize feature selection methods to enhance model efficiency and performance.
Model Diversity: Experiment with various models to identify the best performer for your dataset.
Evaluation Metrics: Go beyond accuracy; consider other metrics like precision, recall, and F1-score for a holistic evaluation.

Embrace these practices, and empower your data science projects with clarity and precision!

S27L01 – Classification model master template

Mastering Classification Models: A Comprehensive Python Template for Data Science

Table of Contents

1. Introduction to Classification Models

2. Setting Up Your Environment

3. Data Import and Exploration

4. Handling Missing Data

Handling Missing Numeric Data

Handling Missing Categorical Data

5. Encoding Categorical Variables

Label Encoding

One-Hot Encoding

6. Feature Selection

7. Train-Test Split

8. Feature Scaling

9. Building and Evaluating Models

K-Nearest Neighbors (KNN)

Logistic Regression

Gaussian Naive Bayes

Support Vector Machine (SVM)

Decision Tree Classifier

Random Forest Classifier

AdaBoost Classifier

XGBoost Classifier

10. Conclusion