Preparing Data for Machine Learning: Handling Missing Values, Encoding, and Balancing

Recap: One-Hot Encoding Basics
Handling Missing Values
Addressing the Date Feature
One-Hot Encoding Revisited
Handling Imbalanced Data
Splitting the Data
Feature Scaling
Conclusion

Recap: One-Hot Encoding Basics

In our previous session, we introduced one-hot encoding—a method to convert categorical variables into a format suitable for machine learning algorithms. We added the necessary statements but paused to explore more variables and content. Today, we’ll build upon that foundation.

Handling Missing Values

Identifying Missing Data

Before encoding, it’s crucial to ensure your dataset doesn’t contain missing values, which can lead to errors during model training. Using pandas, we can identify missing values as follows:

import pandas as pd

missing_values = pd.isnull(x).sum()
print(missing_values)

import pandas as pd

missing_values = pd.isnull(x).sum()

print(missing_values)

A sum of zero indicates no missing values. However, if certain columns show non-zero values, those columns contain missing data that need to be addressed.

Managing Numerical and Categorical Missing Data

We’ve successfully handled missing values in numerical columns using strategies like mean or median imputation. However, categorical (string) columns require a different approach. For categorical data, the most frequent value is often used for imputation. Here’s how to implement it:

from sklearn.impute import SimpleImputer

# For numerical data
num_imputer = SimpleImputer(strategy='mean')
x_numeric = num_imputer.fit_transform(x_numeric)

# For categorical data
cat_imputer = SimpleImputer(strategy='most_frequent')
x_categorical = cat_imputer.fit_transform(x_categorical)

from sklearn.impute import SimpleImputer

# For numerical data

num_imputer = SimpleImputer(strategy='mean')

x_numeric = num_imputer.fit_transform(x_numeric)

# For categorical data

cat_imputer = SimpleImputer(strategy='most_frequent')

x_categorical = cat_imputer.fit_transform(x_categorical)

Addressing the Date Feature

Dates can be tricky since they often contain unique values, making them less useful for predictive modeling. Including the entire date can introduce high dimensionality and slow down your model without adding predictive power. Here are some strategies:

Feature Extraction: Extract meaningful components like day and month while discarding the year.
Label Encoding: Assign numerical labels to dates, but be cautious as this may introduce unintended ordinal relationships.
One-Hot Encoding: Not recommended for dates due to the explosion in the number of features.

Given these challenges, the most straightforward solution is to drop the date feature altogether if it’s not essential for your model:

x = x.drop(['date'], axis=1)

1	x = x.drop(['date'], axis=1)

In our case, based on the dataset description from Kaggle’s “Rain Prediction in Australia,” we’ve also excluded the risk_mm variable for better performance.

One-Hot Encoding Revisited

After handling missing values and removing irrelevant features, we proceed with one-hot encoding:

from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder()
x_encoded = encoder.fit_transform(x)
print(x_encoded.shape)  # Example output: (number_of_samples, 115)

from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder()

x_encoded = encoder.fit_transform(x)

print(x_encoded.shape) # Example output: (number_of_samples, 115)

As expected, the number of columns increases due to the encoding process, expanding from 23 to 115 in our example.

Handling Imbalanced Data

Imbalanced datasets can bias your model toward the majority class, reducing its ability to predict the minority class accurately. Here’s how to address it:

Check for Imbalance:

Java

from collections import Counter counter = Counter(y) print(counter) # Example output: {0: 2700, 1: 900}

1
2
3
4

from collections import Counter

counter = Counter(y)
print(counter) # Example output: {0: 2700, 1: 900}

If one class significantly outnumbers the other (e.g., 75% vs. 25%), balancing is necessary.

Upsampling the Minority Class:

from sklearn.utils import resample

# Combine into a single DataFrame
data = pd.concat([x_encoded, y], axis=1)

# Separate majority and minority classes
majority = data[data.y == 0]
minority = data[data.y == 1]

# Upsample minority
minority_upsampled = resample(minority,
                              replace=True,
                              n_samples=len(majority),
                              random_state=42)

# Combine majority with upsampled minority
balanced_data = pd.concat([majority, minority_upsampled])

# Separate features and target
X_balanced = balanced_data.drop('y', axis=1)
y_balanced = balanced_data['y']

from sklearn.utils import resample

# Combine into a single DataFrame

data = pd.concat([x_encoded, y], axis=1)

# Separate majority and minority classes

majority = data[data.y == 0]

minority = data[data.y == 1]

# Upsample minority

minority_upsampled = resample(minority,

replace=True,

n_samples=len(majority),

random_state=42)

# Combine majority with upsampled minority

balanced_data = pd.concat([majority, minority_upsampled])

# Separate features and target

X_balanced = balanced_data.drop('y', axis=1)

y_balanced = balanced_data['y']

Verification:

Java

print(Counter(y_balanced)) # Output: {0: 2700, 1: 2700}

1
2

print(Counter(y_balanced))
# Output: {0: 2700, 1: 2700}

Splitting the Data

With balanced data, we proceed to split it into training and testing sets:

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X_balanced, y_balanced, 
                                                    test_size=0.2, 
                                                    random_state=42)

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X_balanced, y_balanced,

test_size=0.2,

random_state=42)

Feature Scaling

Finally, we standardize the features to ensure that each feature contributes equally to the model’s performance:

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler(with_mean=False)  # Avoid centering on sparse matrices
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print(X_train_scaled.shape)
print(X_test_scaled.shape)

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler(with_mean=False) # Avoid centering on sparse matrices

X_train_scaled = scaler.fit_transform(X_train)

X_test_scaled = scaler.transform(X_test)

print(X_train_scaled.shape)

print(X_test_scaled.shape)

Note: When dealing with sparse matrices resulting from one-hot encoding, setting with_mean=False in StandardScaler prevents errors related to centering.

Conclusion

Data preprocessing is both an art and a science, requiring thoughtful decision-making to prepare your dataset effectively. By handling missing values, encoding categorical variables, managing date features, and balancing your data, you set a solid foundation for building robust machine learning models. Remember, the quality of your data directly influences the performance of your models, so invest the necessary time and effort in these preprocessing steps.

Feel free to revisit this Jupyter notebook for a hands-on experience, and don’t hesitate to reach out if you have any questions. Happy modeling!

S05L08 – Assignment solution and OneHotEncoding – Part 02