Mastering Car Price Prediction with Advanced Regression Models: A Comprehensive Guide

Introduction
Dataset Overview
Data Import and Initial Exploration
Data Cleaning and Preprocessing
1. Handling Missing Numerical Data
2. Handling Missing Categorical Data
Feature Selection and Encoding
1. Dropping Irrelevant Features
2. One-Hot Encoding Categorical Variables
Train-Test Split
Feature Scaling
Building and Evaluating Regression Models
Model Performance Comparison
Conclusion

Introduction

Predictive analytics empowers businesses to anticipate future trends, optimize operations, and enhance decision-making processes. Car price prediction is a quintessential example where machine learning models can forecast vehicle prices based on attributes like brand, engine specifications, fuel type, and more. This guide walks you through building a comprehensive regression model pipeline, from data preprocessing to evaluating multiple regression algorithms.

Dataset Overview

The Car Price Prediction dataset on Kaggle is a rich resource containing 205 entries with 26 features each. These features encompass various aspects of cars, such as the number of doors, engine size, horsepower, fuel type, and more, all of which influence the car’s market price.

Key Features:

CarName: Name of the car (brand and model)
FuelType: Type of fuel used (e.g., gas, diesel)
Aspiration: Engine aspiration type
Doornumber: Number of doors (two or four)
Enginesize: Size of the engine
Horsepower: Engine power
Price: Market price of the car (target variable)

Data Import and Initial Exploration

First, we import the dataset using pandas and take a preliminary look at the data structure.

import pandas as pd

# Load the dataset
data = pd.read_csv('CarPrice.csv')

# Display the first five rows
print(data.head())

import pandas as pd

# Load the dataset

data = pd.read_csv('CarPrice.csv')

# Display the first five rows

print(data.head())

Sample Output:

   car_ID  symboling                   CarName fueltype aspiration doornumber  \
0       1          3        alfa-romero giulia      gas        std        two   
1       2          3       alfa-romero stelvio      gas        std        two   
2       3          1  alfa-romero Quadrifoglio      gas        std        two   
3       4          2               audi 100 ls      gas        std       four   
4       5          2                audi 100ls      gas        std       four   

      carbody drivewheel enginelocation  wheelbase  ...  horsepower  peakrpm citympg  \
0  convertible        rwd          front       88.6  ...       111.0     5000      21   
1  convertible        rwd          front       88.6  ...       111.0     5000      21   
2    hatchback        rwd          front       94.5  ...       154.0     5000      19   
3        sedan        fwd          front       99.8  ...       102.0     5500      24   
4        sedan        4wd          front       99.4  ...       115.0     5500      18   

   highwaympg    price  
0          27  13495.0  
1          27  16500.0  
2          26  16500.0  
3          30  13950.0  
4          22  17450.0

car_ID symboling CarName fueltype aspiration doornumber \

0 1 3 alfa-romero giulia gas std two

1 2 3 alfa-romero stelvio gas std two

2 3 1 alfa-romero Quadrifoglio gas std two

3 4 2 audi 100 ls gas std four

4 5 2 audi 100ls gas std four

carbody drivewheel enginelocation wheelbase ... horsepower peakrpm citympg \

0 convertible rwd front 88.6 ... 111.0 5000 21

1 convertible rwd front 88.6 ... 111.0 5000 21

2 hatchback rwd front 94.5 ... 154.0 5000 19

3 sedan fwd front 99.8 ... 102.0 5500 24

4 sedan 4wd front 99.4 ... 115.0 5500 18

highwaympg price

0 27 13495.0

1 27 16500.0

2 26 16500.0

3 30 13950.0

4 22 17450.0

Data Cleaning and Preprocessing

Handling Missing Numerical Data

Missing values can significantly skew the performance of machine learning models. We first address missing numerical data by imputing with the mean value.

import numpy as np
from sklearn.impute import SimpleImputer

# Identify numerical columns
numerical_cols = list(np.where((X.dtypes == np.int64) | (X.dtypes == np.float64))[0])

# Initialize imputer
imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean')
imp_mean.fit(X.iloc[:, numerical_cols])

# Impute missing numerical data
X.iloc[:, numerical_cols] = imp_mean.transform(X.iloc[:, numerical_cols])

import numpy as np

from sklearn.impute import SimpleImputer

# Identify numerical columns

numerical_cols = list(np.where((X.dtypes == np.int64) | (X.dtypes == np.float64))[0])

# Initialize imputer

imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean')

imp_mean.fit(X.iloc[:, numerical_cols])

# Impute missing numerical data

X.iloc[:, numerical_cols] = imp_mean.transform(X.iloc[:, numerical_cols])

Handling Missing Categorical Data

For categorical variables, missing values are imputed using the most frequent strategy.

# Identify categorical columns
string_cols = list(np.where((X.dtypes == np.object))[0])

# Initialize imputer for categorical data
imp_freq = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
imp_freq.fit(X.iloc[:, string_cols])

# Impute missing categorical data
X.iloc[:, string_cols] = imp_freq.transform(X.iloc[:, string_cols])

# Identify categorical columns

string_cols = list(np.where((X.dtypes == np.object))[0])

# Initialize imputer for categorical data

imp_freq = SimpleImputer(missing_values=np.nan, strategy='most_frequent')

imp_freq.fit(X.iloc[:, string_cols])

# Impute missing categorical data

X.iloc[:, string_cols] = imp_freq.transform(X.iloc[:, string_cols])

Feature Selection and Encoding

Dropping Irrelevant Features

The car_ID column is a unique identifier and does not contribute to the predictive power of the model. Hence, it is removed.

# Drop 'car_ID' column
X.drop('car_ID', axis=1, inplace=True)

1 2	# Drop 'car_ID' column X.drop('car_ID', axis=1, inplace=True)

One-Hot Encoding Categorical Variables

Machine learning algorithms require numerical input. Therefore, categorical variables are transformed using One-Hot Encoding.

from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

# Re-identify categorical columns after dropping 'car_ID'
string_cols = list(np.where((X.dtypes == np.object))[0])

# Apply One-Hot Encoding
columnTransformer = ColumnTransformer(
    [('encoder', OneHotEncoder(), string_cols)],
    remainder='passthrough'
)
X = columnTransformer.fit_transform(X)

from sklearn.preprocessing import OneHotEncoder

from sklearn.compose import ColumnTransformer

# Re-identify categorical columns after dropping 'car_ID'

string_cols = list(np.where((X.dtypes == np.object))[0])

# Apply One-Hot Encoding

columnTransformer = ColumnTransformer(

[('encoder', OneHotEncoder(), string_cols)],

remainder='passthrough'

)

X = columnTransformer.fit_transform(X)

Before Encoding:

Shape: (205, 24)

After Encoding:

Shape: (205, 199)

Train-Test Split

Splitting the dataset into training and testing sets is crucial for evaluating model performance.

from sklearn.model_selection import train_test_split

# Perform train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, Y, test_size=0.20, random_state=1
)

print(f"Training set shape: {X_train.shape}")
print(f"Testing set shape: {X_test.shape}")

from sklearn.model_selection import train_test_split

# Perform train-test split

X_train, X_test, y_train, y_test = train_test_split(

X, Y, test_size=0.20, random_state=1

)

print(f"Training set shape: {X_train.shape}")

print(f"Testing set shape: {X_test.shape}")

Output:

Training set shape: (164, 199)
Testing set shape: (41, 199)

1 2	Training set shape: (164, 199) Testing set shape: (41, 199)

Feature Scaling

Feature scaling ensures that all features contribute equally to the model’s performance. Here, we use Standardization.

from sklearn import preprocessing

# Initialize StandardScaler
sc = preprocessing.StandardScaler(with_mean=False)
sc.fit(X_train)

# Transform the data
X_train = sc.transform(X_train)
X_test = sc.transform(X_test)

from sklearn import preprocessing

# Initialize StandardScaler

sc = preprocessing.StandardScaler(with_mean=False)

sc.fit(X_train)

# Transform the data

X_train = sc.transform(X_train)

X_test = sc.transform(X_test)

Building and Evaluating Regression Models

We will explore several regression models, evaluating each based on the R² score.

1. Linear Regression

Linear Regression serves as a baseline model.

from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

# Initialize and train the model
model = LinearRegression()
model.fit(X_train, y_train)

# Predict and evaluate
y_pred = model.predict(X_test)
r2 = r2_score(y_test, y_pred)
print(f"Linear Regression R² Score: {r2:.2f}")

from sklearn.linear_model import LinearRegression

from sklearn.metrics import r2_score

# Initialize and train the model

model = LinearRegression()

model.fit(X_train, y_train)

# Predict and evaluate

y_pred = model.predict(X_test)

r2 = r2_score(y_test, y_pred)

print(f"Linear Regression R² Score: {r2:.2f}")

R² Score: 0.097
Interpretation: The model explains approximately 9.7% of the variance in car prices.

2. Polynomial Linear Regression

To capture non-linear relationships, we introduce polynomial features.

from sklearn.preprocessing import PolynomialFeatures

# Initialize PolynomialFeatures
poly = PolynomialFeatures(degree=2)
X_train_poly = poly.fit_transform(X_train)
X_test_poly = poly.transform(X_test)

# Train the model
model = LinearRegression()
model.fit(X_train_poly, y_train)

# Predict and evaluate
y_pred = model.predict(X_test_poly)
r2 = r2_score(y_test, y_pred)
print(f"Polynomial Linear Regression R² Score: {r2:.2f}")

from sklearn.preprocessing import PolynomialFeatures

# Initialize PolynomialFeatures

poly = PolynomialFeatures(degree=2)

X_train_poly = poly.fit_transform(X_train)

X_test_poly = poly.transform(X_test)

# Train the model

model = LinearRegression()

model.fit(X_train_poly, y_train)

# Predict and evaluate

y_pred = model.predict(X_test_poly)

r2 = r2_score(y_test, y_pred)

print(f"Polynomial Linear Regression R² Score: {r2:.2f}")

R² Score: -0.45
Interpretation: The model performs worse than the baseline, explaining -45% of the variance.

3. Decision Tree Regression

Decision Trees can model complex relationships by partitioning the data.

from sklearn.tree import DecisionTreeRegressor

# Initialize and train the model
model = DecisionTreeRegressor(max_depth=4)
model.fit(X_train, y_train)

# Predict and evaluate
y_pred = model.predict(X_test)
r2 = r2_score(y_test, y_pred)
print(f"Decision Tree Regression R² Score: {r2:.2f}")

from sklearn.tree import DecisionTreeRegressor

# Initialize and train the model

model = DecisionTreeRegressor(max_depth=4)

model.fit(X_train, y_train)

# Predict and evaluate

y_pred = model.predict(X_test)

r2 = r2_score(y_test, y_pred)

print(f"Decision Tree Regression R² Score: {r2:.2f}")

R² Score: 0.88
Interpretation: A significant improvement, explaining 88% of the variance.

4. Random Forest Regression

Random Forest aggregates multiple Decision Trees to enhance performance and mitigate overfitting.

from sklearn.ensemble import RandomForestRegressor

# Initialize and train the model
model = RandomForestRegressor(n_estimators=25, random_state=10)
model.fit(X_train, y_train)

# Predict and evaluate
y_pred = model.predict(X_test)
r2 = r2_score(y_test, y_pred)
print(f"Random Forest Regression R² Score: {r2:.2f}")

from sklearn.ensemble import RandomForestRegressor

# Initialize and train the model

model = RandomForestRegressor(n_estimators=25, random_state=10)

model.fit(X_train, y_train)

# Predict and evaluate

y_pred = model.predict(X_test)

r2 = r2_score(y_test, y_pred)

print(f"Random Forest Regression R² Score: {r2:.2f}")

R² Score: 0.91
Interpretation: Excellent performance, explaining 91% of the variance.

5. AdaBoost Regression

AdaBoost combines weak learners to form a strong predictor by focusing on mistakes.

from sklearn.ensemble import AdaBoostRegressor

# Initialize and train the model
model = AdaBoostRegressor(random_state=0, n_estimators=100)
model.fit(X_train, y_train)

# Predict and evaluate
y_pred = model.predict(X_test)
r2 = r2_score(y_test, y_pred)
print(f"AdaBoost Regression R² Score: {r2:.2f}")

from sklearn.ensemble import AdaBoostRegressor

# Initialize and train the model

model = AdaBoostRegressor(random_state=0, n_estimators=100)

model.fit(X_train, y_train)

# Predict and evaluate

y_pred = model.predict(X_test)

r2 = r2_score(y_test, y_pred)

print(f"AdaBoost Regression R² Score: {r2:.2f}")

R² Score: 0.88
Interpretation: Comparable to Decision Tree, explaining 88% of the variance.

6. XGBoost Regression

XGBoost is a powerful gradient boosting framework known for its efficiency and performance.

import xgboost as xgb

# Initialize and train the model
model = xgb.XGBRegressor(
    n_estimators=100,
    reg_lambda=1,
    gamma=0,
    max_depth=3,
    learning_rate=0.05
)
model.fit(X_train, y_train)

# Predict and evaluate
y_pred = model.predict(X_test)
r2 = r2_score(y_test, y_pred)
print(f"XGBoost Regression R² Score: {r2:.2f}")

import xgboost as xgb

# Initialize and train the model

model = xgb.XGBRegressor(

n_estimators=100,

reg_lambda=1,

gamma=0,

max_depth=3,

learning_rate=0.05

)

model.fit(X_train, y_train)

# Predict and evaluate

y_pred = model.predict(X_test)

r2 = r2_score(y_test, y_pred)

print(f"XGBoost Regression R² Score: {r2:.2f}")

R² Score: 0.89
Interpretation: Robust performance, explaining 89% of the variance.

7. Support Vector Regression (SVR)

SVR is effective in high-dimensional spaces but may underperform with larger datasets.

from sklearn.svm import SVR

# Initialize and train the model
model = SVR()
model.fit(X_train, y_train)

# Predict and evaluate
y_pred = model.predict(X_test)
r2 = r2_score(y_test, y_pred)
print(f"Support Vector Regression (SVR) R² Score: {r2:.2f}")

from sklearn.svm import SVR

# Initialize and train the model

model = SVR()

model.fit(X_train, y_train)

# Predict and evaluate

y_pred = model.predict(X_test)

r2 = r2_score(y_test, y_pred)

print(f"Support Vector Regression (SVR) R² Score: {r2:.2f}")

R² Score: -0.03
Interpretation: Performs poorly, explaining -3% of the variance.

Model Performance Comparison

Model	R² Score
Linear Regression	0.10
Polynomial Linear Regression	-0.45
Decision Tree Regression	0.88
Random Forest Regression	0.91
AdaBoost Regression	0.88
XGBoost Regression	0.89
Support Vector Regression (SVR)	-0.03

Insights:

Random Forest Regression outperforms all other models with an R² score of 0.91, indicating it explains 91% of the variance in car prices.
Polynomial Linear Regression performed the worst, even worse than the baseline model, suggesting overfitting or inappropriate feature transformation.
Support Vector Regression (SVR) struggled with this dataset, possibly due to the high dimensionality post-encoding.

Conclusion

Predictive modeling for car price prediction underscores the significance of selecting the right algorithm and thorough data preprocessing. In our exploration:

Decision Tree and Random Forest models demonstrated exceptional performance, with Random Forest slightly edging out others.
Ensemble methods like AdaBoost and XGBoost also showcased strong results, highlighting their efficacy in handling complex datasets.
Linear models, especially when extended to polynomial features, may not always yield better performance and can sometimes degrade model efficacy.
Support Vector Regression (SVR) may not be the best fit for datasets with high dimensionality or where non-linear patterns are less pronounced.

Key Takeaways:

Data Preprocessing: Handling missing values and encoding categorical variables are crucial steps that significantly influence model performance.
Feature Scaling: Ensures that all features contribute equally, improving the efficiency of gradient-based algorithms.
Model Selection: Ensemble methods like Random Forests and XGBoost often provide superior performance in regression tasks.
Model Evaluation: R² score is a valuable metric for assessing how well predictions approximate the actual outcomes.

Embarking on car price prediction using advanced regression models not only enhances predictive accuracy but also equips stakeholders with actionable insights into market dynamics. As the field of machine learning continues to evolve, staying abreast of the latest algorithms and techniques remains essential for data enthusiasts and professionals alike.

S16L02 – Master template regression model – Models and evaluation

Mastering Car Price Prediction with Advanced Regression Models: A Comprehensive Guide

Table of Contents

Introduction

Dataset Overview

Data Import and Initial Exploration

Data Cleaning and Preprocessing

Handling Missing Numerical Data

Handling Missing Categorical Data

Feature Selection and Encoding

Dropping Irrelevant Features

One-Hot Encoding Categorical Variables

Train-Test Split

Feature Scaling

Building and Evaluating Regression Models

1. Linear Regression

2. Polynomial Linear Regression

3. Decision Tree Regression

4. Random Forest Regression

5. AdaBoost Regression

6. XGBoost Regression

7. Support Vector Regression (SVR)

Model Performance Comparison

Conclusion