Mastering Car Price Prediction with Advanced Regression Models: A Comprehensive Guide
Table of Contents
- Introduction
- Dataset Overview
- Data Import and Initial Exploration
- Data Cleaning and Preprocessing
- Feature Selection and Encoding
- Train-Test Split
- Feature Scaling
- Building and Evaluating Regression Models
- Model Performance Comparison
- Conclusion
Introduction
Predictive analytics empowers businesses to anticipate future trends, optimize operations, and enhance decision-making processes. Car price prediction is a quintessential example where machine learning models can forecast vehicle prices based on attributes like brand, engine specifications, fuel type, and more. This guide walks you through building a comprehensive regression model pipeline, from data preprocessing to evaluating multiple regression algorithms.
Dataset Overview
The Car Price Prediction dataset on Kaggle is a rich resource containing 205 entries with 26 features each. These features encompass various aspects of cars, such as the number of doors, engine size, horsepower, fuel type, and more, all of which influence the car’s market price.
Key Features:
CarName
: Name of the car (brand and model)FuelType
: Type of fuel used (e.g., gas, diesel)Aspiration
: Engine aspiration typeDoornumber
: Number of doors (two or four)Enginesize
: Size of the engineHorsepower
: Engine powerPrice
: Market price of the car (target variable)
Data Import and Initial Exploration
First, we import the dataset using pandas and take a preliminary look at the data structure.
1 2 3 4 5 6 7 |
import pandas as pd # Load the dataset data = pd.read_csv('CarPrice.csv') # Display the first five rows print(data.head()) |
Sample Output:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
car_ID symboling CarName fueltype aspiration doornumber \ 0 1 3 alfa-romero giulia gas std two 1 2 3 alfa-romero stelvio gas std two 2 3 1 alfa-romero Quadrifoglio gas std two 3 4 2 audi 100 ls gas std four 4 5 2 audi 100ls gas std four carbody drivewheel enginelocation wheelbase ... horsepower peakrpm citympg \ 0 convertible rwd front 88.6 ... 111.0 5000 21 1 convertible rwd front 88.6 ... 111.0 5000 21 2 hatchback rwd front 94.5 ... 154.0 5000 19 3 sedan fwd front 99.8 ... 102.0 5500 24 4 sedan 4wd front 99.4 ... 115.0 5500 18 highwaympg price 0 27 13495.0 1 27 16500.0 2 26 16500.0 3 30 13950.0 4 22 17450.0 |
Data Cleaning and Preprocessing
Handling Missing Numerical Data
Missing values can significantly skew the performance of machine learning models. We first address missing numerical data by imputing with the mean value.
1 2 3 4 5 6 7 8 9 10 11 12 |
import numpy as np from sklearn.impute import SimpleImputer # Identify numerical columns numerical_cols = list(np.where((X.dtypes == np.int64) | (X.dtypes == np.float64))[0]) # Initialize imputer imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean') imp_mean.fit(X.iloc[:, numerical_cols]) # Impute missing numerical data X.iloc[:, numerical_cols] = imp_mean.transform(X.iloc[:, numerical_cols]) |
Handling Missing Categorical Data
For categorical variables, missing values are imputed using the most frequent strategy.
1 2 3 4 5 6 7 8 9 |
# Identify categorical columns string_cols = list(np.where((X.dtypes == np.object))[0]) # Initialize imputer for categorical data imp_freq = SimpleImputer(missing_values=np.nan, strategy='most_frequent') imp_freq.fit(X.iloc[:, string_cols]) # Impute missing categorical data X.iloc[:, string_cols] = imp_freq.transform(X.iloc[:, string_cols]) |
Feature Selection and Encoding
Dropping Irrelevant Features
The car_ID
column is a unique identifier and does not contribute to the predictive power of the model. Hence, it is removed.
1 2 |
# Drop 'car_ID' column X.drop('car_ID', axis=1, inplace=True) |
One-Hot Encoding Categorical Variables
Machine learning algorithms require numerical input. Therefore, categorical variables are transformed using One-Hot Encoding.
1 2 3 4 5 6 7 8 9 10 11 12 |
from sklearn.preprocessing import OneHotEncoder from sklearn.compose import ColumnTransformer # Re-identify categorical columns after dropping 'car_ID' string_cols = list(np.where((X.dtypes == np.object))[0]) # Apply One-Hot Encoding columnTransformer = ColumnTransformer( [('encoder', OneHotEncoder(), string_cols)], remainder='passthrough' ) X = columnTransformer.fit_transform(X) |
Before Encoding:
- Shape: (205, 24)
After Encoding:
- Shape: (205, 199)
Train-Test Split
Splitting the dataset into training and testing sets is crucial for evaluating model performance.
1 2 3 4 5 6 7 8 9 |
from sklearn.model_selection import train_test_split # Perform train-test split X_train, X_test, y_train, y_test = train_test_split( X, Y, test_size=0.20, random_state=1 ) print(f"Training set shape: {X_train.shape}") print(f"Testing set shape: {X_test.shape}") |
Output:
1 2 |
Training set shape: (164, 199) Testing set shape: (41, 199) |
Feature Scaling
Feature scaling ensures that all features contribute equally to the model’s performance. Here, we use Standardization.
1 2 3 4 5 6 7 8 9 |
from sklearn import preprocessing # Initialize StandardScaler sc = preprocessing.StandardScaler(with_mean=False) sc.fit(X_train) # Transform the data X_train = sc.transform(X_train) X_test = sc.transform(X_test) |
Building and Evaluating Regression Models
We will explore several regression models, evaluating each based on the R² score.
1. Linear Regression
Linear Regression serves as a baseline model.
1 2 3 4 5 6 7 8 9 10 11 |
from sklearn.linear_model import LinearRegression from sklearn.metrics import r2_score # Initialize and train the model model = LinearRegression() model.fit(X_train, y_train) # Predict and evaluate y_pred = model.predict(X_test) r2 = r2_score(y_test, y_pred) print(f"Linear Regression R² Score: {r2:.2f}") |
R² Score: 0.097
Interpretation: The model explains approximately 9.7% of the variance in car prices.
2. Polynomial Linear Regression
To capture non-linear relationships, we introduce polynomial features.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
from sklearn.preprocessing import PolynomialFeatures # Initialize PolynomialFeatures poly = PolynomialFeatures(degree=2) X_train_poly = poly.fit_transform(X_train) X_test_poly = poly.transform(X_test) # Train the model model = LinearRegression() model.fit(X_train_poly, y_train) # Predict and evaluate y_pred = model.predict(X_test_poly) r2 = r2_score(y_test, y_pred) print(f"Polynomial Linear Regression R² Score: {r2:.2f}") |
R² Score: -0.45
Interpretation: The model performs worse than the baseline, explaining -45% of the variance.
3. Decision Tree Regression
Decision Trees can model complex relationships by partitioning the data.
1 2 3 4 5 6 7 8 9 10 |
from sklearn.tree import DecisionTreeRegressor # Initialize and train the model model = DecisionTreeRegressor(max_depth=4) model.fit(X_train, y_train) # Predict and evaluate y_pred = model.predict(X_test) r2 = r2_score(y_test, y_pred) print(f"Decision Tree Regression R² Score: {r2:.2f}") |
R² Score: 0.88
Interpretation: A significant improvement, explaining 88% of the variance.
4. Random Forest Regression
Random Forest aggregates multiple Decision Trees to enhance performance and mitigate overfitting.
1 2 3 4 5 6 7 8 9 10 |
from sklearn.ensemble import RandomForestRegressor # Initialize and train the model model = RandomForestRegressor(n_estimators=25, random_state=10) model.fit(X_train, y_train) # Predict and evaluate y_pred = model.predict(X_test) r2 = r2_score(y_test, y_pred) print(f"Random Forest Regression R² Score: {r2:.2f}") |
R² Score: 0.91
Interpretation: Excellent performance, explaining 91% of the variance.
5. AdaBoost Regression
AdaBoost combines weak learners to form a strong predictor by focusing on mistakes.
1 2 3 4 5 6 7 8 9 10 |
from sklearn.ensemble import AdaBoostRegressor # Initialize and train the model model = AdaBoostRegressor(random_state=0, n_estimators=100) model.fit(X_train, y_train) # Predict and evaluate y_pred = model.predict(X_test) r2 = r2_score(y_test, y_pred) print(f"AdaBoost Regression R² Score: {r2:.2f}") |
R² Score: 0.88
Interpretation: Comparable to Decision Tree, explaining 88% of the variance.
6. XGBoost Regression
XGBoost is a powerful gradient boosting framework known for its efficiency and performance.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
import xgboost as xgb # Initialize and train the model model = xgb.XGBRegressor( n_estimators=100, reg_lambda=1, gamma=0, max_depth=3, learning_rate=0.05 ) model.fit(X_train, y_train) # Predict and evaluate y_pred = model.predict(X_test) r2 = r2_score(y_test, y_pred) print(f"XGBoost Regression R² Score: {r2:.2f}") |
R² Score: 0.89
Interpretation: Robust performance, explaining 89% of the variance.
7. Support Vector Regression (SVR)
SVR is effective in high-dimensional spaces but may underperform with larger datasets.
1 2 3 4 5 6 7 8 9 10 |
from sklearn.svm import SVR # Initialize and train the model model = SVR() model.fit(X_train, y_train) # Predict and evaluate y_pred = model.predict(X_test) r2 = r2_score(y_test, y_pred) print(f"Support Vector Regression (SVR) R² Score: {r2:.2f}") |
R² Score: -0.03
Interpretation: Performs poorly, explaining -3% of the variance.
Model Performance Comparison
Model | R² Score |
---|---|
Linear Regression | 0.10 |
Polynomial Linear Regression | -0.45 |
Decision Tree Regression | 0.88 |
Random Forest Regression | 0.91 |
AdaBoost Regression | 0.88 |
XGBoost Regression | 0.89 |
Support Vector Regression (SVR) | -0.03 |
Insights:
- Random Forest Regression outperforms all other models with an R² score of 0.91, indicating it explains 91% of the variance in car prices.
- Polynomial Linear Regression performed the worst, even worse than the baseline model, suggesting overfitting or inappropriate feature transformation.
- Support Vector Regression (SVR) struggled with this dataset, possibly due to the high dimensionality post-encoding.
Conclusion
Predictive modeling for car price prediction underscores the significance of selecting the right algorithm and thorough data preprocessing. In our exploration:
- Decision Tree and Random Forest models demonstrated exceptional performance, with Random Forest slightly edging out others.
- Ensemble methods like AdaBoost and XGBoost also showcased strong results, highlighting their efficacy in handling complex datasets.
- Linear models, especially when extended to polynomial features, may not always yield better performance and can sometimes degrade model efficacy.
- Support Vector Regression (SVR) may not be the best fit for datasets with high dimensionality or where non-linear patterns are less pronounced.
Key Takeaways:
- Data Preprocessing: Handling missing values and encoding categorical variables are crucial steps that significantly influence model performance.
- Feature Scaling: Ensures that all features contribute equally, improving the efficiency of gradient-based algorithms.
- Model Selection: Ensemble methods like Random Forests and XGBoost often provide superior performance in regression tasks.
- Model Evaluation: R² score is a valuable metric for assessing how well predictions approximate the actual outcomes.
Embarking on car price prediction using advanced regression models not only enhances predictive accuracy but also equips stakeholders with actionable insights into market dynamics. As the field of machine learning continues to evolve, staying abreast of the latest algorithms and techniques remains essential for data enthusiasts and professionals alike.