S16L01 – Master template regression model – Data creation

html
Mastering Regression: A Comprehensive Template for Car Price Prediction

Unlock the full potential of regression analysis with our expert-crafted template designed for car price prediction. Whether you're experimenting with different models or tackling various regression problems, this guide provides a step-by-step approach to streamline your machine learning workflow.

Table of Contents

    Introduction to Regression in Machine Learning
    Understanding the CarPrice Dataset
    Setting Up Your Environment
    
        Data Preprocessing
        
            Handling Missing Data
            Feature Selection
            Encoding Categorical Variables
        
    
    Feature Scaling
    Splitting the Dataset
    
        Building and Evaluating Models
        
            Linear Regression
            Polynomial Regression
            Decision Tree Regressor
            Random Forest Regressor
            AdaBoost Regressor
            XGBoost Regressor
            Support Vector Regression (SVR)
        
    
    Conclusion
    Accessing the Regression Template


Introduction to Regression in Machine Learning

Regression analysis is a fundamental component of machine learning, enabling us to predict continuous outcomes based on input features. From real estate pricing to stock market forecasting, regression models play a pivotal role in decision-making processes across various industries. In this article, we'll delve into creating a robust regression template using Python, specifically tailored for predicting car prices.

Understanding the CarPrice Dataset

Our journey begins with the CarPrice dataset, sourced from Kaggle. This dataset comprises 25 fields and approximately 206 records, making it manageable yet sufficiently complex for demonstrating regression techniques.

Dataset Structure

Here’s a snapshot of the dataset:


    
        car_ID
        symboling
        CarName
        fueltype
        aspiration
        doornumber
        carbody
        drivewheel
        enginelocation
        wheelbase
        ...
        price
    
    
        1
        3
        alfa-romero giulia
        gas
        std
        two
        convertible
        rwd
        front
        88.6
        ...
        13495.0
    
    
        2
        3
        alfa-romero stelvio
        gas
        std
        two
        convertible
        rwd
        front
        88.6
        ...
        16500.0
    
    
        ...
        ...
        ...
        ...
        ...
        ...
        ...
        ...
        ...
        ...
        ...
        ...
    


The target variable is price, representing the car's price in dollars.

Setting Up Your Environment

Before diving into the data, ensure you have the necessary Python libraries installed. We'll be using pandas for data manipulation, numpy for numerical operations, and scikit-learn along with XGBoost for building and evaluating models.



		
		
			
			
Java
			
			import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from sklearn.preprocessing import PolynomialFeatures
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, AdaBoostRegressor
from sklearn.svm import SVR
import xgboost as xgb
			
				
					
				
					1
2
3
4
5
6
7
8
9
10
11
12
13
				
						import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from sklearn.preprocessing import PolynomialFeatures
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, AdaBoostRegressor
from sklearn.svm import SVR
import xgboost as xgb
					
				
			
		



Data Preprocessing

Handling Missing Data

Data cleanliness is paramount. We'll address missing values separately for numerical and categorical data.

Numeric Data

For numerical columns, we'll use the SimpleImputer to fill missing values with the mean of each column.



		
		
			
			
Java
			
			# Identify numerical columns
numerical_cols = list(np.where((X.dtypes == np.int64) | (X.dtypes == np.float64))[0])

# Initialize imputer for numerical data
imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean')

# Fit and transform the data
imp_mean.fit(X.iloc[:, numerical_cols])
X.iloc[:, numerical_cols] = imp_mean.transform(X.iloc[:, numerical_cols])
			
				
					
				
					1
2
3
4
5
6
7
8
9
				
						# Identify numerical columns
numerical_cols = list(np.where((X.dtypes == np.int64) | (X.dtypes == np.float64))[0])
 
# Initialize imputer for numerical data
imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean')
 
# Fit and transform the data
imp_mean.fit(X.iloc[:, numerical_cols])
X.iloc[:, numerical_cols] = imp_mean.transform(X.iloc[:, numerical_cols])
					
				
			
		



Categorical Data

For categorical columns, we'll fill missing values with the most frequent category using SimpleImputer.



		
		
			
			
Java
			
			# Identify categorical columns
string_cols = list(np.where((X.dtypes == object))[0])

# Initialize imputer for categorical data
imp_freq = SimpleImputer(missing_values=np.nan, strategy='most_frequent')

# Fit and transform the data
imp_freq.fit(X.iloc[:, string_cols])
X.iloc[:, string_cols] = imp_freq.transform(X.iloc[:, string_cols])
			
				
					
				
					1
2
3
4
5
6
7
8
9
				
						# Identify categorical columns
string_cols = list(np.where((X.dtypes == object))[0])
 
# Initialize imputer for categorical data
imp_freq = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
 
# Fit and transform the data
imp_freq.fit(X.iloc[:, string_cols])
X.iloc[:, string_cols] = imp_freq.transform(X.iloc[:, string_cols])
					
				
			
		



Feature Selection

Not all features contribute meaningfully to the model. For instance, the car_ID column is merely an identifier and doesn't provide predictive value. We'll drop such irrelevant columns.



		
		
			
			
Java
			
			# Drop the car_ID column
X.drop('car_ID', axis=1, inplace=True)
			
				
					
				
					1
2
				
						# Drop the car_ID column
X.drop('car_ID', axis=1, inplace=True)
					
				
			
		



Encoding Categorical Variables

Machine learning models require numeric input. We'll convert categorical variables into numerical format using One-Hot Encoding.



		
		
			
			
Java
			
			# Initialize OneHotEncoder within ColumnTransformer
columnTransformer = ColumnTransformer(
    [('encoder', OneHotEncoder(), string_cols)],
    remainder='passthrough'
)

# Transform the data
X = columnTransformer.fit_transform(X)
			
				
					
				
					1
2
3
4
5
6
7
8
				
						# Initialize OneHotEncoder within ColumnTransformer
columnTransformer = ColumnTransformer(
    [('encoder', OneHotEncoder(), string_cols)],
    remainder='passthrough'
)
 
# Transform the data
X = columnTransformer.fit_transform(X)
					
				
			
		



After encoding, the dataset shape changes from (205, 24) to (205, 199), indicating the successful transformation of categorical variables.

Feature Scaling

Scaling ensures that all features contribute equally to the result, especially for distance-based algorithms.



		
		
			
			
Java
			
			# Initialize StandardScaler
sc = StandardScaler(with_mean=False)

# Fit and transform the training data
sc.fit(X_train)
X_train = sc.transform(X_train)
X_test = sc.transform(X_test)
			
				
					
				
					1
2
3
4
5
6
7
				
						# Initialize StandardScaler
sc = StandardScaler(with_mean=False)
 
# Fit and transform the training data
sc.fit(X_train)
X_train = sc.transform(X_train)
X_test = sc.transform(X_test)
					
				
			
		



Splitting the Dataset

We'll divide the dataset into training and testing sets to evaluate our model's performance.



		
		
			
			
Java
			
			# Split the data
X_train, X_test, y_train, y_test = train_test_split(
    X, Y, test_size=0.20, random_state=1
)
			
				
					
				
					1
2
3
4
				
						# Split the data
X_train, X_test, y_train, y_test = train_test_split(
    X, Y, test_size=0.20, random_state=1
)
					
				
			
		




    Training Set: 164 samples
    Testing Set: 41 samples


Building and Evaluating Models

We'll explore various regression models, evaluating each using the R² score.

1. Linear Regression

A straightforward approach to predict continuous values.



		
		
			
			
Java
			
			model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print(r2_score(y_test, y_pred))  # Output: 0.0974
			
				
					
				
					1
2
3
4
				
						model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print(r2_score(y_test, y_pred))  # Output: 0.0974
					
				
			
		



The R² score indicates that the linear model explains approximately 9.74% of the variance.

2. Polynomial Regression

Captures non-linear relationships by introducing polynomial features.



		
		
			
			
Java
			
			poly = PolynomialFeatures(degree=2)
X_train_poly = poly.fit_transform(X_train)
X_test_poly = poly.transform(X_test)

model = LinearRegression()
model.fit(X_train_poly, y_train)
y_pred = model.predict(X_test_poly)
print(r2_score(y_test, y_pred))  # Output: -0.4531
			
				
					
				
					1
2
3
4
5
6
7
8
				
						poly = PolynomialFeatures(degree=2)
X_train_poly = poly.fit_transform(X_train)
X_test_poly = poly.transform(X_test)
 
model = LinearRegression()
model.fit(X_train_poly, y_train)
y_pred = model.predict(X_test_poly)
print(r2_score(y_test, y_pred))  # Output: -0.4531
					
				
			
		



The negative R² score suggests overfitting or inappropriate degree selection.

3. Decision Tree Regressor

A non-linear model that splits the data into subsets.



		
		
			
			
Java
			
			model = DecisionTreeRegressor(max_depth=4)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print(r2_score(y_test, y_pred))  # Output: 0.8840
			
				
					
				
					1
2
3
4
				
						model = DecisionTreeRegressor(max_depth=4)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print(r2_score(y_test, y_pred))  # Output: 0.8840
					
				
			
		



Significantly higher R² score, indicating better performance.

4. Random Forest Regressor

An ensemble method that builds multiple decision trees.



		
		
			
			
Java
			
			model = RandomForestRegressor(n_estimators=25, random_state=10)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print(r2_score(y_test, y_pred))  # Output: 0.9108
			
				
					
				
					1
2
3
4
				
						model = RandomForestRegressor(n_estimators=25, random_state=10)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print(r2_score(y_test, y_pred))  # Output: 0.9108
					
				
			
		



An impressive R² score of 91.08%, showcasing robust performance.

5. AdaBoost Regressor

Boosting technique that combines weak learners to form a strong predictor.



		
		
			
			
Java
			
			model = AdaBoostRegressor(random_state=0, n_estimators=100)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print(r2_score(y_test, y_pred))  # Output: 0.8807
			
				
					
				
					1
2
3
4
				
						model = AdaBoostRegressor(random_state=0, n_estimators=100)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print(r2_score(y_test, y_pred))  # Output: 0.8807
					
				
			
		



Achieves an R² score of 88.07%.

6. XGBoost Regressor

A scalable and efficient implementation of gradient boosting.



		
		
			
			
Java
			
			model = xgb.XGBRegressor(
    n_estimators=100,
    reg_lambda=1,
    gamma=0,
    max_depth=3,
    learning_rate=0.05
)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print(r2_score(y_test, y_pred))  # Output: 0.8947
			
				
					
				
					1
2
3
4
5
6
7
8
9
10
				
						model = xgb.XGBRegressor(
    n_estimators=100,
    reg_lambda=1,
    gamma=0,
    max_depth=3,
    learning_rate=0.05
)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print(r2_score(y_test, y_pred))  # Output: 0.8947
					
				
			
		



Delivers an R² score of 89.47%.

7. Support Vector Regression (SVR)

Effective in high-dimensional spaces, SVR uses kernel tricks for non-linear data.



		
		
			
			
Java
			
			model = SVR()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print(r2_score(y_test, y_pred))  # Output: -0.0271
			
				
					
				
					1
2
3
4
				
						model = SVR()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print(r2_score(y_test, y_pred))  # Output: -0.0271
					
				
			
		



The negative R² score indicates poor performance, possibly due to parameter tuning needs.

Conclusion

This comprehensive regression template offers a systematic approach to handling regression problems, from data preprocessing to model evaluation. While simple models like Linear Regression may fall short, ensemble methods like Random Forest and XGBoost demonstrate superior performance in predicting car prices. Tailoring this template to your specific dataset can enhance predictive accuracy and streamline your machine learning projects.

Accessing the Regression Template

Ready to implement this regression workflow? Access the complete Jupyter Notebook and CarPrice.csv dataset here. Utilize these resources to kickstart your machine learning projects and achieve accurate predictive models with ease.

Enhance your regression analysis skills today and unlock new opportunities in data-driven decision-making!
car_ID	symboling	CarName	fueltype	aspiration	doornumber	carbody	drivewheel	enginelocation	wheelbase	...	price
1	3	alfa-romero giulia	gas	std	two	convertible	rwd	front	88.6	...	13495.0
2	3	alfa-romero stelvio	gas	std	two	convertible	rwd	front	88.6	...	16500.0
...	...	...	...	...	...	...	...	...	...	...	...