Understanding Data Splitting and Feature Scaling in Machine Learning

Data Splitting: Training and Testing Sets
1. What is a Test Set?
2. What is a Train Set?
3. Typical Split Ratio
4. Implementing Data Splitting with scikit-learn
Feature Scaling: Standardization and Normalization
1. Why Feature Scaling?
2. Standardization vs. Normalization
3. Recommended Approach
4. Implementing Feature Scaling with scikit-learn
Summary of Steps
Conclusion

Data Splitting: Training and Testing Sets

What is a Test Set?

A test set is a subset of your dataset that is reserved for evaluating the performance of your machine learning model. By feeding the model this holdout data, you can assess how accurately it predicts new, unseen data, thereby understanding the model’s real-world performance.

What is a Train Set?

Conversely, a train set is the portion of your data used to train the model. The model learns patterns, relationships, and structures within this data to make predictions or classifications on new data.

Typical Split Ratio

A common practice is to split the data into 80% for training and 20% for testing. This ratio provides a balance between allowing the model sufficient data to learn from and retaining enough data to robustly evaluate its performance.

Implementing Data Splitting with scikit-learn

Here’s a step-by-step guide to splitting your data using scikit-learn’s train_test_split function:

Import Necessary Libraries

Java

from sklearn.model_selection import train_test_split

1

from sklearn.model_selection import train_test_split
Prepare Your Data
Assume you have your features stored in X and your target variable in Y:

Java

X = data.drop('target', axis=1) # Features Y = data['target'] # Target variable

1
2

X = data.drop('target', axis=1) # Features
Y = data['target'] # Target variable
Split the Data

Java

X_train, X_test, Y_train, Y_test = train_test_split( X, Y, test_size=0.2, random_state=42 )

1
2
3

X_train, X_test, Y_train, Y_test = train_test_split(
X, Y, test_size=0.2, random_state=42
)
- test_size=0.2: Allocates 20% of the data for testing.
- random_state=42: Ensures reproducibility by controlling the shuffling process. Using a fixed random_state means you’ll get the same split every time you run the code, which is crucial for consistent model evaluation.
Verify the Split
You can check the number of records in each set:

Java

print(len(X_train)) # Should be 80% of total data print(len(X_test)) # Should be 20% of total data

1
2

print(len(X_train)) # Should be 80% of total data
print(len(X_test)) # Should be 20% of total data

Feature Scaling: Standardization and Normalization

Why Feature Scaling?

Machine learning algorithms perform better when numerical input features are on a comparable scale. Features with larger ranges can disproportionately influence the model, leading to suboptimal performance. Feature scaling standardizes the range of the features, enhancing the efficiency and accuracy of the model.

Standardization vs. Normalization

Standardization:
- Formula: \( z = \frac{(X – \mu)}{\sigma} \)
- Transforms the data to have a mean of 0 and a standard deviation of 1.
- Suitable for features with a Gaussian (normal) distribution.
- Widely used and generally effective, even when the data isn’t perfectly normal.
Normalization:
- Formula: \( X_{norm} = \frac{(X – X_{min})}{(X_{max} – X_{min})} \)
- Scales the data to a fixed range, typically 0 to 1.
- Best used when the data follows a known distribution or when bounds are required.

Recommended Approach

It’s generally advisable to split the data before performing feature scaling. This practice ensures that the scaling parameters (like mean and standard deviation) are derived solely from the training data, preventing data leakage and ensuring that the test data remains a true holdout set.

Implementing Feature Scaling with scikit-learn

Import the StandardScaler

Java

from sklearn.preprocessing import StandardScaler

1

from sklearn.preprocessing import StandardScaler
Initialize the Scaler

Java

scaler = StandardScaler()

1

scaler = StandardScaler()
Fit and Transform the Training Data

Java

X_train_scaled = scaler.fit_transform(X_train)

1

X_train_scaled = scaler.fit_transform(X_train)
Transform the Test Data

Java

X_test_scaled = scaler.transform(X_test)

1

X_test_scaled = scaler.transform(X_test)
- Important: Only fit the scaler on the training data. The same transformation is then applied to the test data. This ensures that the test data is scaled consistently without introducing information from the test set into the training process.
Handling Categorical Variables
If your dataset includes categorical variables encoded as numerical values (e.g., 0, 1, 2), avoid applying scaling to these columns, as it can distort their meaning. Ensure that only continuous numerical features undergo scaling.

Summary of Steps

Import Data: Load your dataset into a suitable format (e.g., pandas DataFrame).
Split Data: Divide the dataset into features (X) and target (Y), then perform an 80/20 train-test split.
Handle Missing Data: Address any gaps in your data through imputation or removal.
Feature Selection: Remove irrelevant or redundant features to improve model performance.
Encode Data: Convert categorical variables into numerical formats if necessary.
Feature Scaling: Apply standardization or normalization to ensure all features contribute equally to the model.

Conclusion

Proper data preparation is a cornerstone of successful machine learning projects. By meticulously splitting your data and applying appropriate feature scaling, you set the stage for building models that are both accurate and reliable. As you continue to explore machine learning, these foundational practices will serve you well in tackling more complex challenges.

Stay tuned for our next article, where we’ll delve deeper into preprocessing techniques and other critical aspects of building robust machine learning models.

S05L04 – Test and train data split and Feature scaling