Preparing Data for Machine Learning: Handling Missing Values, Encoding, and Balancing
Table of Contents
- Recap: One-Hot Encoding Basics
- Handling Missing Values
- Addressing the Date Feature
- One-Hot Encoding Revisited
- Handling Imbalanced Data
- Splitting the Data
- Feature Scaling
- Conclusion
Recap: One-Hot Encoding Basics
In our previous session, we introduced one-hot encoding—a method to convert categorical variables into a format suitable for machine learning algorithms. We added the necessary statements but paused to explore more variables and content. Today, we’ll build upon that foundation.
Handling Missing Values
Identifying Missing Data
Before encoding, it’s crucial to ensure your dataset doesn’t contain missing values, which can lead to errors during model training. Using pandas, we can identify missing values as follows:
1 2 3 4 |
import pandas as pd missing_values = pd.isnull(x).sum() print(missing_values) |
A sum of zero indicates no missing values. However, if certain columns show non-zero values, those columns contain missing data that need to be addressed.
Managing Numerical and Categorical Missing Data
We’ve successfully handled missing values in numerical columns using strategies like mean or median imputation. However, categorical (string) columns require a different approach. For categorical data, the most frequent value is often used for imputation. Here’s how to implement it:
1 2 3 4 5 6 7 8 9 |
from sklearn.impute import SimpleImputer # For numerical data num_imputer = SimpleImputer(strategy='mean') x_numeric = num_imputer.fit_transform(x_numeric) # For categorical data cat_imputer = SimpleImputer(strategy='most_frequent') x_categorical = cat_imputer.fit_transform(x_categorical) |
Addressing the Date Feature
Dates can be tricky since they often contain unique values, making them less useful for predictive modeling. Including the entire date can introduce high dimensionality and slow down your model without adding predictive power. Here are some strategies:
- Feature Extraction: Extract meaningful components like day and month while discarding the year.
- Label Encoding: Assign numerical labels to dates, but be cautious as this may introduce unintended ordinal relationships.
- One-Hot Encoding: Not recommended for dates due to the explosion in the number of features.
Given these challenges, the most straightforward solution is to drop the date feature altogether if it’s not essential for your model:
1 |
x = x.drop(['date'], axis=1) |
In our case, based on the dataset description from Kaggle’s “Rain Prediction in Australia,” we’ve also excluded the risk_mm
variable for better performance.
One-Hot Encoding Revisited
After handling missing values and removing irrelevant features, we proceed with one-hot encoding:
1 2 3 4 5 |
from sklearn.preprocessing import OneHotEncoder encoder = OneHotEncoder() x_encoded = encoder.fit_transform(x) print(x_encoded.shape) # Example output: (number_of_samples, 115) |
As expected, the number of columns increases due to the encoding process, expanding from 23 to 115 in our example.
Handling Imbalanced Data
Imbalanced datasets can bias your model toward the majority class, reducing its ability to predict the minority class accurately. Here’s how to address it:
- Check for Imbalance:
1234from collections import Countercounter = Counter(y)print(counter) # Example output: {0: 2700, 1: 900}
If one class significantly outnumbers the other (e.g., 75% vs. 25%), balancing is necessary.
- Upsampling the Minority Class:
123456789101112131415161718192021from sklearn.utils import resample# Combine into a single DataFramedata = pd.concat([x_encoded, y], axis=1)# Separate majority and minority classesmajority = data[data.y == 0]minority = data[data.y == 1]# Upsample minorityminority_upsampled = resample(minority,replace=True,n_samples=len(majority),random_state=42)# Combine majority with upsampled minoritybalanced_data = pd.concat([majority, minority_upsampled])# Separate features and targetX_balanced = balanced_data.drop('y', axis=1)y_balanced = balanced_data['y']
- Verification:
12print(Counter(y_balanced))# Output: {0: 2700, 1: 2700}
Splitting the Data
With balanced data, we proceed to split it into training and testing sets:
1 2 3 4 5 |
from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X_balanced, y_balanced, test_size=0.2, random_state=42) |
Feature Scaling
Finally, we standardize the features to ensure that each feature contributes equally to the model’s performance:
1 2 3 4 5 6 7 8 |
from sklearn.preprocessing import StandardScaler scaler = StandardScaler(with_mean=False) # Avoid centering on sparse matrices X_train_scaled = scaler.fit_transform(X_train) X_test_scaled = scaler.transform(X_test) print(X_train_scaled.shape) print(X_test_scaled.shape) |
Note: When dealing with sparse matrices resulting from one-hot encoding, setting with_mean=False
in StandardScaler
prevents errors related to centering.
Conclusion
Data preprocessing is both an art and a science, requiring thoughtful decision-making to prepare your dataset effectively. By handling missing values, encoding categorical variables, managing date features, and balancing your data, you set a solid foundation for building robust machine learning models. Remember, the quality of your data directly influences the performance of your models, so invest the necessary time and effort in these preprocessing steps.
Feel free to revisit this Jupyter notebook for a hands-on experience, and don’t hesitate to reach out if you have any questions. Happy modeling!