Implementing K-Fold Cross-Validation for Car Price Prediction Without GridSearchCV

Introduction
Dataset Overview
Data Preprocessing
1. Handling Missing Data
2. Feature Selection
Feature Engineering
1. Encoding Categorical Variables
2. Feature Scaling
Building Regression Models
Implementing K-Fold Cross-Validation
Evaluating Model Performance
Conclusion

Introduction

Predicting car prices is a classic regression problem that involves forecasting the price of a vehicle based on various features such as engine size, horsepower, fuel type, and more. Implementing K-Fold Cross-Validation enhances the reliability of our model by ensuring it generalizes well to unseen data. This article demonstrates how to preprocess data, engineer features, build multiple regression models, and evaluate their performance using K-Fold Cross-Validation in Python.

Dataset Overview

We will be using the Car Price Prediction dataset from Kaggle, which contains detailed specifications of different car models along with their prices. The dataset includes features like symboling, CarName, fueltype, aspiration, doornumber, carbody, and many more that influence the car’s price.

Data Preprocessing

Effective data preprocessing is essential to prepare the dataset for modeling. This involves handling missing values, encoding categorical variables, and selecting relevant features.

Handling Missing Data

Numeric Data

Missing values in numeric features can be handled using statistical measures. We’ll use the mean strategy to impute missing values in numeric columns.

import numpy as np
import pandas as pd
from sklearn.impute import SimpleImputer

# Load the dataset
data = pd.read_csv('CarPrice.csv')
X = data.iloc[:, :-1]
y = data.iloc[:, -1]

# Identify numerical columns
numerical_cols = X.select_dtypes(include=['int64', 'float64']).columns

# Impute missing values with mean
imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean')
X[numerical_cols] = imp_mean.fit_transform(X[numerical_cols])

import numpy as np

import pandas as pd

from sklearn.impute import SimpleImputer

# Load the dataset

data = pd.read_csv('CarPrice.csv')

X = data.iloc[:, :-1]

y = data.iloc[:, -1]

# Identify numerical columns

numerical_cols = X.select_dtypes(include=['int64', 'float64']).columns

# Impute missing values with mean

imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean')

X[numerical_cols] = imp_mean.fit_transform(X[numerical_cols])

Categorical Data

For categorical features, the most frequent value strategy is effective in imputing missing values.

from sklearn.impute import SimpleImputer

# Identify categorical columns
string_cols = X.select_dtypes(include=['object']).columns

# Impute missing values with the most frequent value
imp_freq = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
X[string_cols] = imp_freq.fit_transform(X[string_cols])

from sklearn.impute import SimpleImputer

# Identify categorical columns

string_cols = X.select_dtypes(include=['object']).columns

# Impute missing values with the most frequent value

imp_freq = SimpleImputer(missing_values=np.nan, strategy='most_frequent')

X[string_cols] = imp_freq.fit_transform(X[string_cols])

Feature Selection

Selecting relevant features helps in reducing the complexity of the model and improving its performance.

# Drop the 'car_ID' column as it does not contribute to price prediction
X.drop('car_ID', axis=1, inplace=True)

1 2	# Drop the 'car_ID' column as it does not contribute to price prediction X.drop('car_ID', axis=1, inplace=True)

Feature Engineering

Feature engineering involves transforming raw data into meaningful features that better represent the underlying problem to the predictive models.

Encoding Categorical Variables

Machine learning algorithms require numerical input, so categorical variables need to be encoded. We’ll use One-Hot Encoding to convert categorical variables into a binary matrix.

from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

# Identify categorical columns by their indices
string_cols = list(np.where((X.dtypes == object))[0])

# Apply One-Hot Encoding
columnTransformer = ColumnTransformer(
    [('encoder', OneHotEncoder(), string_cols)],
    remainder='passthrough'
)
X = columnTransformer.fit_transform(X)

from sklearn.preprocessing import OneHotEncoder

from sklearn.compose import ColumnTransformer

# Identify categorical columns by their indices

string_cols = list(np.where((X.dtypes == object))[0])

# Apply One-Hot Encoding

columnTransformer = ColumnTransformer(

[('encoder', OneHotEncoder(), string_cols)],

remainder='passthrough'

)

X = columnTransformer.fit_transform(X)

Feature Scaling

Scaling ensures that each feature contributes equally to the result, enhancing the performance of certain algorithms.

from sklearn.preprocessing import StandardScaler

# Initialize the StandardScaler
sc = StandardScaler(with_mean=False)

# Fit and transform the training data
sc.fit(X_train)
X_train = sc.transform(X_train)
X_test = sc.transform(X_test)

from sklearn.preprocessing import StandardScaler

# Initialize the StandardScaler

sc = StandardScaler(with_mean=False)

# Fit and transform the training data

sc.fit(X_train)

X_train = sc.transform(X_train)

X_test = sc.transform(X_test)

Building Regression Models

We’ll build and evaluate five different regression models to predict car prices:

Decision Tree Regressor
Random Forest Regressor
AdaBoost Regressor
XGBoost Regressor
Support Vector Regressor (SVR)

Decision Tree Regressor

A Decision Tree Regressor splits the data into subsets based on feature values, making it easy to interpret.

from sklearn.tree import DecisionTreeRegressor

# Initialize the model
decisionTreeRegressor = DecisionTreeRegressor(max_depth=4)

from sklearn.tree import DecisionTreeRegressor

# Initialize the model

decisionTreeRegressor = DecisionTreeRegressor(max_depth=4)

Random Forest Regressor

Random Forest aggregates the predictions of multiple Decision Trees, reducing overfitting and improving accuracy.

from sklearn.ensemble import RandomForestRegressor

# Initialize the model
randomForestRegressor = RandomForestRegressor(n_estimators=25, random_state=10)

from sklearn.ensemble import RandomForestRegressor

# Initialize the model

randomForestRegressor = RandomForestRegressor(n_estimators=25, random_state=10)

AdaBoost Regressor

AdaBoost combines multiple weak learners to create a strong predictive model, focusing on instances that were previously mispredicted.

from sklearn.ensemble import AdaBoostRegressor

# Initialize the model
adaBoostRegressor = AdaBoostRegressor(random_state=0, n_estimators=100)

from sklearn.ensemble import AdaBoostRegressor

# Initialize the model

adaBoostRegressor = AdaBoostRegressor(random_state=0, n_estimators=100)

XGBoost Regressor

XGBoost is an optimized distributed gradient boosting library designed for performance and speed.

import xgboost as xgb

# Initialize the model
xgbRegressor = xgb.XGBRegressor(
    n_estimators=100,
    reg_lambda=1,
    gamma=0,
    max_depth=3,
    learning_rate=0.05
)

import xgboost as xgb

# Initialize the model

xgbRegressor = xgb.XGBRegressor(

n_estimators=100,

reg_lambda=1,

gamma=0,

max_depth=3,

learning_rate=0.05

)

Support Vector Regressor (SVR)

SVR uses the principles of Support Vector Machines for regression tasks, effective in high-dimensional spaces.

from sklearn.svm import SVR

# Initialize the model
svr = SVR()

from sklearn.svm import SVR

# Initialize the model

svr = SVR()

Implementing K-Fold Cross-Validation

K-Fold Cross-Validation partitions the dataset into k subsets and iteratively trains and validates the model k times, each time using a different subset as the validation set.

from sklearn.model_selection import KFold
from sklearn.metrics import r2_score

# Define the K-Fold Cross Validator
kf = KFold(n_splits=10, random_state=42, shuffle=True)

# Function to build and evaluate the model
def build_model(X_train, X_test, y_train, y_test, model):
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    return r2_score(y_test, y_pred)

from sklearn.model_selection import KFold

from sklearn.metrics import r2_score

# Define the K-Fold Cross Validator

kf = KFold(n_splits=10, random_state=42, shuffle=True)

# Function to build and evaluate the model

def build_model(X_train, X_test, y_train, y_test, model):

model.fit(X_train, y_train)

y_pred = model.predict(X_test)

return r2_score(y_test, y_pred)

Running K-Fold Cross-Validation

We’ll evaluate each model’s performance across the K-Folds and compute the mean R² score.

# Initialize score lists
decisionTreeRegressor_scores = []
randomForestRegressor_scores = []
adaBoostRegressor_scores = []
xgbRegressor_scores = []
svr_scores = []

# Perform K-Fold Cross-Validation
for train_index, test_index in kf.split(X):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    
    # Decision Tree
    decisionTreeRegressor_scores.append(
        build_model(X_train, X_test, y_train, y_test, decisionTreeRegressor)
    )
    
    # Random Forest
    randomForestRegressor_scores.append(
        build_model(X_train, X_test, y_train, y_test, randomForestRegressor)
    )
    
    # AdaBoost
    adaBoostRegressor_scores.append(
        build_model(X_train, X_test, y_train, y_test, adaBoostRegressor)
    )
    
    # XGBoost
    xgbRegressor_scores.append(
        build_model(X_train, X_test, y_train, y_test, xgbRegressor)
    )
    
    # SVR
    svr_scores.append(
        build_model(X_train, X_test, y_train, y_test, svr)
    )

# Initialize score lists

decisionTreeRegressor_scores = []

randomForestRegressor_scores = []

adaBoostRegressor_scores = []

xgbRegressor_scores = []

svr_scores = []

# Perform K-Fold Cross-Validation

for train_index, test_index in kf.split(X):

X_train, X_test = X[train_index], X[test_index]

y_train, y_test = y[train_index], y[test_index]

# Decision Tree

decisionTreeRegressor_scores.append(

build_model(X_train, X_test, y_train, y_test, decisionTreeRegressor)

)

# Random Forest

randomForestRegressor_scores.append(

build_model(X_train, X_test, y_train, y_test, randomForestRegressor)

)

# AdaBoost

adaBoostRegressor_scores.append(

build_model(X_train, X_test, y_train, y_test, adaBoostRegressor)

)

# XGBoost

xgbRegressor_scores.append(

build_model(X_train, X_test, y_train, y_test, xgbRegressor)

)

# SVR

svr_scores.append(

build_model(X_train, X_test, y_train, y_test, svr)

)

Evaluating Model Performance

After running K-Fold Cross-Validation, we’ll calculate the mean R² score for each model to assess their performance.

from statistics import mean

print('Decision Tree Regressor Mean R² Score: ', mean(decisionTreeRegressor_scores))
print('Random Forest Regressor Mean R² Score: ', mean(randomForestRegressor_scores))
print('AdaBoost Regressor Mean R² Score: ', mean(adaBoostRegressor_scores))
print('XGBoost Regressor Mean R² Score: ', mean(xgbRegressor_scores))
print('SVR Mean R² Score: ', mean(svr_scores))

from statistics import mean

print('Decision Tree Regressor Mean R² Score: ', mean(decisionTreeRegressor_scores))

print('Random Forest Regressor Mean R² Score: ', mean(randomForestRegressor_scores))

print('AdaBoost Regressor Mean R² Score: ', mean(adaBoostRegressor_scores))

print('XGBoost Regressor Mean R² Score: ', mean(xgbRegressor_scores))

print('SVR Mean R² Score: ', mean(svr_scores))

Sample Output:

Decision Tree Regressor Mean R² Score:  0.8786768422448108
Random Forest Regressor Mean R² Score:  0.9070724684428952
AdaBoost Regressor Mean R² Score:  0.894756851083693
XGBoost Regressor Mean R² Score:  0.9049838393114154
SVR Mean R² Score:  -0.1510507928400266

Decision Tree Regressor Mean R² Score: 0.8786768422448108

Random Forest Regressor Mean R² Score: 0.9070724684428952

AdaBoost Regressor Mean R² Score: 0.894756851083693

XGBoost Regressor Mean R² Score: 0.9049838393114154

SVR Mean R² Score: -0.1510507928400266

Interpretation:

Random Forest Regressor shows the highest mean R² score, indicating the best performance among the models tested.
SVR yields a negative R² score, suggesting poor performance on this dataset, possibly due to its inability to capture the underlying patterns effectively without hyperparameter tuning.

Conclusion

Implementing K-Fold Cross-Validation provides a robust method for evaluating the performance of regression models, ensuring that the results are generalizable and not dependent on a particular train-test split. In this guide, we demonstrated how to preprocess data, encode categorical variables, scale features, build multiple regression models, and evaluate their performance using K-Fold Cross-Validation without GridSearchCV.

Key Takeaways:

Data Preprocessing: Proper handling of missing data and feature selection are crucial for model performance.
Feature Engineering: Encoding categorical variables and scaling features can significantly impact the model’s ability to learn patterns.
Model Evaluation: K-Fold Cross-Validation offers a reliable way to assess how well your model generalizes to unseen data.
Model Selection: Among the models tested, ensemble methods like Random Forest and XGBoost outperform simpler models like Decision Trees and SVR in this particular case.

For further optimization, integrating hyperparameter tuning techniques such as GridSearchCV or RandomizedSearchCV can enhance model performance by finding the best set of parameters for each algorithm.

—

By following this structured approach, you can effectively implement K-Fold Cross-Validation for various regression tasks, ensuring your models are both accurate and robust.

S17L04 – K Fold cross validation without GridSearchCV continues

Implementing K-Fold Cross-Validation for Car Price Prediction Without GridSearchCV

Table of Contents

Introduction

Dataset Overview

Data Preprocessing

Handling Missing Data

Numeric Data

Categorical Data

Feature Selection

Feature Engineering

Encoding Categorical Variables

Feature Scaling

Building Regression Models

Decision Tree Regressor

Random Forest Regressor

AdaBoost Regressor

XGBoost Regressor

Support Vector Regressor (SVR)

Implementing K-Fold Cross-Validation

Running K-Fold Cross-Validation

Evaluating Model Performance

Conclusion