Comprehensive Guide to Data Preprocessing for Classification Problems in Machine Learning

Introduction to Classification Problems
Data Import and Overview
Handling Missing Data
- A. Numeric Data
- B. Categorical Data
Encoding Categorical Variables
- A. Label Encoding
- B. One-Hot Encoding
Feature Selection
Train-Test Split
Feature Scaling
Conclusion

Introduction to Classification Problems

Classification is a supervised learning technique used to predict categorical labels. It involves assigning input data into predefined categories based on historical data. Classification models range from simple algorithms like Logistic Regression to more complex ones like Random Forests and Neural Networks. The success of these models hinges not just on the algorithm chosen but significantly on how the data is prepared and preprocessed.

Data Import and Overview

Before diving into preprocessing, it’s essential to understand and import the dataset. For this guide, we’ll use the WeatherAUS dataset from Kaggle, which contains daily weather observations across Australia.

# Importing necessary libraries
import pandas as pd 
import seaborn as sns

# Loading the dataset
data = pd.read_csv('weatherAUS.csv')

# Displaying the last five rows of the dataset
data.tail()

# Importing necessary libraries

import pandas as pd

import seaborn as sns

# Loading the dataset

data = pd.read_csv('weatherAUS.csv')

# Displaying the last five rows of the dataset

data.tail()

Output:

           Date Location  MinTemp  MaxTemp  Rainfall  Evaporation  Sunshine  WindGustDir  WindGustSpeed WindDir9am  ...  Humidity3pm  Pressure9am  Pressure3pm  Cloud9am  Cloud3pm  Temp9am  Temp3pm  RainToday  RISK_MM  RainTomorrow
142188 2017-06-20    Uluru      3.5     21.8       0.0          NaN       NaN           E          31.0        ESE  ...        27.0       1024.7       1021.2       NaN       NaN      9.4     20.9         No      0.0            No
142189 2017-06-21    Uluru      2.8     23.4       0.0          NaN       NaN           E          31.0          SE  ...        24.0       1024.6       1020.3       NaN       NaN     10.1     22.4         No      0.0            No
142190 2017-06-22    Uluru      3.6     25.3       0.0          NaN       NaN         NNW          22.0          SE  ...        21.0       1023.5       1019.1       NaN       NaN     10.9     24.5         No      0.0            No
142191 2017-06-23    Uluru      5.4     26.9       0.0          NaN       NaN           N          37.0          SE  ...        24.0       1021.0       1016.8       NaN       NaN     12.5     26.1         No      0.0            No
142192 2017-06-24    Uluru      7.8     27.0       0.0          NaN       NaN          SE          28.0         SSE  ...        24.0       1019.4       1016.5         3.0         2.0     15.1     26.0         No      0.0            No

[5 rows x 24 columns]

Date Location MinTemp MaxTemp Rainfall Evaporation Sunshine WindGustDir WindGustSpeed WindDir9am ... Humidity3pm Pressure9am Pressure3pm Cloud9am Cloud3pm Temp9am Temp3pm RainToday RISK_MM RainTomorrow

142188 2017-06-20 Uluru 3.5 21.8 0.0 NaN NaN E 31.0 ESE ... 27.0 1024.7 1021.2 NaN NaN 9.4 20.9 No 0.0 No

142189 2017-06-21 Uluru 2.8 23.4 0.0 NaN NaN E 31.0 SE ... 24.0 1024.6 1020.3 NaN NaN 10.1 22.4 No 0.0 No

142190 2017-06-22 Uluru 3.6 25.3 0.0 NaN NaN NNW 22.0 SE ... 21.0 1023.5 1019.1 NaN NaN 10.9 24.5 No 0.0 No

142191 2017-06-23 Uluru 5.4 26.9 0.0 NaN NaN N 37.0 SE ... 24.0 1021.0 1016.8 NaN NaN 12.5 26.1 No 0.0 No

142192 2017-06-24 Uluru 7.8 27.0 0.0 NaN NaN SE 28.0 SSE ... 24.0 1019.4 1016.5 3.0 2.0 15.1 26.0 No 0.0 No

[5 rows x 24 columns]

The dataset comprises various features like temperature, rainfall, humidity, wind speed, and more, which are vital for predicting whether it will rain tomorrow (RainTomorrow).

Handling Missing Data

Real-world datasets often come with missing or incomplete data. Handling these gaps is crucial to ensure the reliability of the model. We’ll approach missing data in two categories: Numeric and Categorical.

A. Numeric Data

For numerical features, a common strategy is to replace missing values with statistical measures like the mean, median, or mode. Here, we’ll use the mean to impute missing values.

import numpy as np
from sklearn.impute import SimpleImputer

# Identifying numerical columns
numerical_cols = list(np.where((X.dtypes == np.int64) | (X.dtypes == np.float64))[0])

# Initializing the imputer with mean strategy
imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean')

# Fitting the imputer on numerical columns
imp_mean.fit(X.iloc[:, numerical_cols])

# Transforming the data
X.iloc[:, numerical_cols] = imp_mean.transform(X.iloc[:, numerical_cols])

import numpy as np

from sklearn.impute import SimpleImputer

# Identifying numerical columns

numerical_cols = list(np.where((X.dtypes == np.int64) | (X.dtypes == np.float64))[0])

# Initializing the imputer with mean strategy

imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean')

# Fitting the imputer on numerical columns

imp_mean.fit(X.iloc[:, numerical_cols])

# Transforming the data

X.iloc[:, numerical_cols] = imp_mean.transform(X.iloc[:, numerical_cols])

B. Categorical Data

For categorical features, the most frequent value (mode) is a suitable replacement for missing data.

# Identifying categorical columns
string_cols = list(np.where((X.dtypes == np.object))[0])

# Initializing the imputer with most frequent strategy
imp_mean = SimpleImputer(missing_values=np.nan, strategy='most_frequent')

# Fitting the imputer on categorical columns
imp_mean.fit(X.iloc[:, string_cols])

# Transforming the data
X.iloc[:, string_cols] = imp_mean.transform(X.iloc[:, string_cols])

# Identifying categorical columns

string_cols = list(np.where((X.dtypes == np.object))[0])

# Initializing the imputer with most frequent strategy

imp_mean = SimpleImputer(missing_values=np.nan, strategy='most_frequent')

# Fitting the imputer on categorical columns

imp_mean.fit(X.iloc[:, string_cols])

# Transforming the data

X.iloc[:, string_cols] = imp_mean.transform(X.iloc[:, string_cols])

Encoding Categorical Variables

Machine learning models require numerical input. Therefore, it’s essential to convert categorical variables into numerical formats. We can achieve this using Label Encoding and One-Hot Encoding.

A. Label Encoding

Label Encoding assigns a unique integer to each unique category in a feature. It’s simple but may introduce ordinal relationships where there are none.

from sklearn import preprocessing

def LabelEncoderMethod(series):
    le = preprocessing.LabelEncoder()
    return le.fit_transform(series) 

# Encoding the target variable
y = LabelEncoderMethod(y)

from sklearn import preprocessing

def LabelEncoderMethod(series):

le = preprocessing.LabelEncoder()

return le.fit_transform(series)

# Encoding the target variable

y = LabelEncoderMethod(y)

B. One-Hot Encoding

One-Hot Encoding creates binary columns for each category, eliminating ordinal relationships and ensuring each category is treated distinctly.

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

def OneHotEncoderMethod(indices, data):
    columnTransformer = ColumnTransformer([('encoder', OneHotEncoder(), indices)], remainder='passthrough')
    return columnTransformer.fit_transform(data)

from sklearn.compose import ColumnTransformer

from sklearn.preprocessing import OneHotEncoder

def OneHotEncoderMethod(indices, data):

columnTransformer = ColumnTransformer([('encoder', OneHotEncoder(), indices)], remainder='passthrough')

return columnTransformer.fit_transform(data)

Encoding Selection for Features

Depending on the number of unique categories, it’s efficient to choose between Label Encoding and One-Hot Encoding.

def EncodingSelection(X, threshold=10):
    # Step 1: Select the string columns
    string_cols = list(np.where((X.dtypes == np.object))[0])
    one_hot_encoding_indices = []
    
    # Step 2: Apply Label Encoding or mark for One-Hot Encoding based on category count
    for col in string_cols:
        length = len(pd.unique(X[X.columns[col]]))
        if length == 2 or length > threshold:
            X[X.columns[col]] = LabelEncoderMethod(X[X.columns[col]])
        else:
            one_hot_encoding_indices.append(col)
            
    # Step 3: Apply One-Hot Encoding where necessary
    X = OneHotEncoderMethod(one_hot_encoding_indices, X)
    return X

# Applying encoding selection
X = EncodingSelection(X)

def EncodingSelection(X, threshold=10):

# Step 1: Select the string columns

string_cols = list(np.where((X.dtypes == np.object))[0])

one_hot_encoding_indices = []

# Step 2: Apply Label Encoding or mark for One-Hot Encoding based on category count

for col in string_cols:

length = len(pd.unique(X[X.columns[col]]))

if length == 2 or length > threshold:

X[X.columns[col]] = LabelEncoderMethod(X[X.columns[col]])

else:

one_hot_encoding_indices.append(col)

# Step 3: Apply One-Hot Encoding where necessary

X = OneHotEncoderMethod(one_hot_encoding_indices, X)

return X

# Applying encoding selection

X = EncodingSelection(X)

Output:

(142193, 23)

1	(142193, 23)

This step reduces the feature space by selecting only the most relevant encoded features.

Feature Selection

Not all features contribute equally to the prediction task. Feature selection helps in identifying and retaining the most informative features, enhancing model performance and reducing computational overhead.

from sklearn.feature_selection import SelectKBest, chi2
from sklearn import preprocessing

# Initializing SelectKBest with chi-squared statistic
kbest = SelectKBest(score_func=chi2, k=10)

# Scaling features using MinMaxScaler before feature selection
MMS = preprocessing.MinMaxScaler()
x_temp = MMS.fit_transform(X)

# Fitting SelectKBest
x_temp = kbest.fit(x_temp, y)

# Selecting top features based on scores
best_features = np.argsort(x_temp.scores_)[-13:]
features_to_delete = np.argsort(x_temp.scores_)[:-13]

# Dropping the least important features
X = np.delete(X, features_to_delete, axis=1)

# Verifying the new shape
print(X.shape)

from sklearn.feature_selection import SelectKBest, chi2

from sklearn import preprocessing

# Initializing SelectKBest with chi-squared statistic

kbest = SelectKBest(score_func=chi2, k=10)

# Scaling features using MinMaxScaler before feature selection

MMS = preprocessing.MinMaxScaler()

x_temp = MMS.fit_transform(X)

# Fitting SelectKBest

x_temp = kbest.fit(x_temp, y)

# Selecting top features based on scores

best_features = np.argsort(x_temp.scores_)[-13:]

features_to_delete = np.argsort(x_temp.scores_)[:-13]

# Dropping the least important features

X = np.delete(X, features_to_delete, axis=1)

# Verifying the new shape

print(X.shape)

Output:

(142193, 13)

1	(142193, 13)

This process reduces the feature set from 23 to 13, focusing on the most impactful features for our classification task.

Train-Test Split

To evaluate the performance of our classification model, we need to split the dataset into training and testing subsets.

from sklearn.model_selection import train_test_split

# Splitting the data: 80% training and 20% testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=1)

# Displaying the shape of training data
print(X_train.shape)

from sklearn.model_selection import train_test_split

# Splitting the data: 80% training and 20% testing

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=1)

# Displaying the shape of training data

print(X_train.shape)

Output:

(113754, 13)

1	(113754, 13)

Feature Scaling

Feature scaling ensures that all features contribute equally to the result, especially important for algorithms sensitive to feature magnitudes like Support Vector Machines or K-Nearest Neighbors.

Standardization

Standardization rescales the data to have a mean of zero and a standard deviation of one.

from sklearn import preprocessing

# Initializing the StandardScaler
sc = preprocessing.StandardScaler(with_mean=False)

# Fitting the scaler on training data
sc.fit(X_train)

# Transforming both training and testing data
X_train = sc.transform(X_train)
X_test = sc.transform(X_test)

# Verifying the shape after scaling
print(X_train.shape)
print(X_test.shape)

from sklearn import preprocessing

# Initializing the StandardScaler

sc = preprocessing.StandardScaler(with_mean=False)

# Fitting the scaler on training data

sc.fit(X_train)

# Transforming both training and testing data

X_train = sc.transform(X_train)

X_test = sc.transform(X_test)

# Verifying the shape after scaling

print(X_train.shape)

print(X_test.shape)

Output:

(113754, 13)
(28439, 13)

1 2	(113754, 13) (28439, 13)

Note: The parameter with_mean=False is used to avoid issues with sparse data matrices resulting from One-Hot Encoding.

Conclusion

Data preprocessing is a critical step in building robust and accurate classification models. By methodically handling missing data, encoding categorical variables, selecting relevant features, and scaling, we set a strong foundation for any machine learning model. This guide provided a hands-on approach using Python and its powerful libraries, ensuring that your classification problems are well-prepared for model training and evaluation. Remember, the adage “garbage in, garbage out” holds true in machine learning; hence, investing time in data preprocessing pays dividends in model performance.

Keywords: Classification Problems, Data Preprocessing, Machine Learning, Data Cleaning, Feature Selection, Label Encoding, One-Hot Encoding, Feature Scaling, Python, Pandas, Scikit-learn, Classification Models

S18L08 – Short discussion

Comprehensive Guide to Data Preprocessing for Classification Problems in Machine Learning

Table of Contents

Introduction to Classification Problems

Data Import and Overview

Handling Missing Data

A. Numeric Data

B. Categorical Data

Encoding Categorical Variables

A. Label Encoding

B. One-Hot Encoding

Encoding Selection for Features

Feature Selection

Train-Test Split

Feature Scaling

Standardization

Conclusion