Implementing Polynomial Regression and Decision Tree Regressor on Insurance Data: A Comprehensive Guide

In the realm of machine learning, regression models play a pivotal role in predicting continuous outcomes. This article delves into the application of Polynomial Regression and Decision Tree Regressor on an insurance dataset, offering a step-by-step guide to data preprocessing, model building, evaluation, and optimization. Whether you’re a seasoned data scientist or a budding enthusiast, this comprehensive guide will equip you with the knowledge to implement and compare these regression techniques effectively.

Introduction
Dataset Overview
Data Preprocessing
- Label Encoding
- One-Hot Encoding
Splitting Data into Training and Testing Sets
Building and Evaluating a Polynomial Regression Model
Implementing Decision Tree Regressor
Hyperparameter Tuning and Its Impact
Cross-Validation and Model Stability
Comparison of Models
Conclusion and Best Practices

Introduction

Machine learning offers a spectrum of regression techniques suitable for various predictive tasks. This guide focuses on two such methods:

Polynomial Regression: Extends linear regression by considering polynomial relationships between the independent and dependent variables.
Decision Tree Regressor: Utilizes tree-like models of decisions to predict continuous values.

Applying these models to an insurance dataset allows us to predict insurance charges based on factors like age, BMI, smoking habits, and more.

Dataset Overview

We utilize the Insurance Dataset from Kaggle, which contains the following features:

Age: Age of the primary beneficiary.
Sex: Gender of the beneficiary.
BMI: Body Mass Index.
Children: Number of children covered by insurance.
Smoker: Smoking status.
Region: Residential area of the beneficiary.
Charges: Individual medical costs billed by health insurance.

The goal is to predict the Charges based on the other features.

Data Preprocessing

Effective data preprocessing is crucial for building accurate machine learning models. This section covers Label Encoding and One-Hot Encoding to handle categorical variables.

Label Encoding

Label Encoding transforms categorical text data into numerical form, which is essential for machine learning algorithms.

from sklearn import preprocessing
import pandas as pd

# Load dataset
data = pd.read_csv('S07_datasets_13720_18513_insurance.csv')
X = data.iloc[:, :-1]
Y = data.iloc[:, -1]

# Initialize LabelEncoder
le = preprocessing.LabelEncoder()

# Encode 'sex' and 'smoker' columns
X['sex'] = le.fit_transform(X['sex'])
X['smoker'] = le.fit_transform(X['smoker'])

from sklearn import preprocessing

import pandas as pd

# Load dataset

data = pd.read_csv('S07_datasets_13720_18513_insurance.csv')

X = data.iloc[:, :-1]

Y = data.iloc[:, -1]

# Initialize LabelEncoder

le = preprocessing.LabelEncoder()

# Encode 'sex' and 'smoker' columns

X['sex'] = le.fit_transform(X['sex'])

X['smoker'] = le.fit_transform(X['smoker'])

Output:

   age  sex     bmi  children  smoker     region
0   19    0  27.900         0       1  southwest
1   18    1  33.770         1       0  southeast
...

age sex bmi children smoker region

0 19 0 27.900 0 1 southwest

1 18 1 33.770 1 0 southeast

...

One-Hot Encoding

One-Hot Encoding converts categorical variables into a form that can be provided to ML algorithms to do a better job in prediction.

from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

# Initialize ColumnTransformer with OneHotEncoder for 'region'
columnTransformer = ColumnTransformer([('encoder', OneHotEncoder(), [5])], remainder='passthrough')

# Apply transformation
X = columnTransformer.fit_transform(X)
print(X)

from sklearn.preprocessing import OneHotEncoder

from sklearn.compose import ColumnTransformer

# Initialize ColumnTransformer with OneHotEncoder for 'region'

columnTransformer = ColumnTransformer([('encoder', OneHotEncoder(), [5])], remainder='passthrough')

# Apply transformation

X = columnTransformer.fit_transform(X)

print(X)

Output:

[[0.    0.    0.   ... 27.9   0.    1.]
 [0.    0.    1.   ... 33.77  1.    0.]
...
]

[[0. 0. 0. ... 27.9 0. 1.]

[0. 0. 1. ... 33.77 1. 0.]

...

]

Splitting Data into Training and Testing Sets

Splitting the dataset ensures that the model’s performance is evaluated on unseen data, providing a better estimate of its real-world performance.

from sklearn.model_selection import train_test_split

# Split data into training and testing sets (80-20 split)
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.20, random_state=1)

from sklearn.model_selection import train_test_split

# Split data into training and testing sets (80-20 split)

X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.20, random_state=1)

Building and Evaluating a Polynomial Regression Model

Polynomial Regression allows the model to fit a non-linear relationship between the independent and dependent variables.

from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import r2_score

# Initialize PolynomialFeatures with degree 2
poly = PolynomialFeatures(degree=2)
X_poly = poly.fit_transform(X_train)

# Initialize and fit Linear Regression model
model = LinearRegression()
model.fit(X_poly, y_train)

# Predict on test set
y_pred = model.predict(poly.transform(X_test))

# Evaluate model
r2 = r2_score(y_test, y_pred)
print(f'Polynomial Regression R2 Score: {r2:.2f}')

from sklearn.linear_model import LinearRegression

from sklearn.preprocessing import PolynomialFeatures

from sklearn.metrics import r2_score

# Initialize PolynomialFeatures with degree 2

poly = PolynomialFeatures(degree=2)

X_poly = poly.fit_transform(X_train)

# Initialize and fit Linear Regression model

model = LinearRegression()

model.fit(X_poly, y_train)

# Predict on test set

y_pred = model.predict(poly.transform(X_test))

# Evaluate model

r2 = r2_score(y_test, y_pred)

print(f'Polynomial Regression R2 Score: {r2:.2f}')

Output:

Polynomial Regression R2 Score: 0.86

1	Polynomial Regression R2 Score: 0.86

An R² score of 0.86 indicates that approximately 86% of the variance in the insurance charges is explained by the model.

Implementing Decision Tree Regressor

Decision Trees partition the data into subsets based on feature values, allowing for complex modeling of relationships.

from sklearn.tree import DecisionTreeRegressor

# Initialize Decision Tree Regressor with max_depth=4
dt_model = DecisionTreeRegressor(max_depth=4)
dt_model.fit(X_train, y_train)

# Predict on test set
y_pred_dt = dt_model.predict(X_test)

# Evaluate model
r2_dt = r2_score(y_test, y_pred_dt)
print(f'Decision Tree Regressor R2 Score: {r2_dt:.2f}')

from sklearn.tree import DecisionTreeRegressor

# Initialize Decision Tree Regressor with max_depth=4

dt_model = DecisionTreeRegressor(max_depth=4)

dt_model.fit(X_train, y_train)

# Predict on test set

y_pred_dt = dt_model.predict(X_test)

# Evaluate model

r2_dt = r2_score(y_test, y_pred_dt)

print(f'Decision Tree Regressor R2 Score: {r2_dt:.2f}')

Output:

Decision Tree Regressor R2 Score: 0.87

1	Decision Tree Regressor R2 Score: 0.87

Surprisingly, the Decision Tree Regressor achieved a slightly higher R² score than the Polynomial Regression model in this instance.

Hyperparameter Tuning and Its Impact

Hyperparameters like max_depth significantly impact the model’s performance by controlling the complexity of the Decision Tree.

# Experimenting with different max_depth values
for depth in [2, 3, 4, 10]:
    dt_model = DecisionTreeRegressor(max_depth=depth, random_state=1)
    dt_model.fit(X_train, y_train)
    y_pred_dt = dt_model.predict(X_test)
    r2_dt = r2_score(y_test, y_pred_dt)
    print(f'max_depth={depth} => R2 Score: {r2_dt:.2f}')

# Experimenting with different max_depth values

for depth in [2, 3, 4, 10]:

dt_model = DecisionTreeRegressor(max_depth=depth, random_state=1)

dt_model.fit(X_train, y_train)

y_pred_dt = dt_model.predict(X_test)

r2_dt = r2_score(y_test, y_pred_dt)

print(f'max_depth={depth} => R2 Score: {r2_dt:.2f}')

Output:

max_depth=2 => R2 Score: 0.75
max_depth=3 => R2 Score: 0.86
max_depth=4 => R2 Score: 0.87
max_depth=10 => R2 Score: 0.75

max_depth=2 => R2 Score: 0.75

max_depth=3 => R2 Score: 0.86

max_depth=4 => R2 Score: 0.87

max_depth=10 => R2 Score: 0.75

Max Depth=2: Underfitting the model with a lower R² score.
Max Depth=3 & 4: Optimal performance with higher R² scores.
Max Depth=10: Overfitting, leading to decreased performance on the test set.

Conclusion: Selecting an appropriate max_depth is crucial to balance bias and variance, ensuring the model generalizes well to unseen data.

Cross-Validation and Model Stability

Cross-validation, specifically K-Fold Cross-Validation, provides a more robust estimation of the model’s performance by partitioning the data into k subsets and iteratively training and testing the model.

from sklearn.model_selection import cross_val_score

# Initialize Decision Tree Regressor
dt_model = DecisionTreeRegressor(max_depth=4, random_state=1)

# Perform 5-Fold Cross-Validation
cv_scores = cross_val_score(dt_model, X, Y, cv=5, scoring='r2')

print(f'Cross-Validation R2 Scores: {cv_scores}')
print(f'Average R2 Score: {cv_scores.mean():.2f}')

from sklearn.model_selection import cross_val_score

# Initialize Decision Tree Regressor

dt_model = DecisionTreeRegressor(max_depth=4, random_state=1)

# Perform 5-Fold Cross-Validation

cv_scores = cross_val_score(dt_model, X, Y, cv=5, scoring='r2')

print(f'Cross-Validation R2 Scores: {cv_scores}')

print(f'Average R2 Score: {cv_scores.mean():.2f}')

Output:

Cross-Validation R2 Scores: [0.85 0.86 0.87 0.88 0.86]
Average R2 Score: 0.86

1 2	Cross-Validation R2 Scores: [0.85 0.86 0.87 0.88 0.86] Average R2 Score: 0.86

Benefit: Cross-validation mitigates the risk of model evaluation based on a single train-test split, providing a more generalized performance metric.

Comparison of Models

Model	R² Score
Polynomial Regression	0.86
Decision Tree Regressor	0.87

Insights:

Decision Tree Regressor slightly outperforms Polynomial Regression in this case.
Proper Hyperparameter Tuning significantly enhances the Decision Tree’s performance.
Both models have their merits; the choice depends on the specific use case and data characteristics.

Conclusion and Best Practices

In this guide, we explored the implementation of Polynomial Regression and Decision Tree Regressor on an insurance dataset. Key takeaways include:

Data Preprocessing: Proper encoding of categorical variables is essential for model accuracy.
Model Evaluation: R² Score serves as a reliable metric to assess model performance.
Hyperparameter Tuning: Adjusting parameters like max_depth can prevent overfitting and underfitting.
Cross-Validation: Enhances the reliability of performance metrics.

Best Practices:

Understand Your Data: Before modeling, explore and understand the dataset to make informed preprocessing and modeling decisions.
Feature Engineering: Consider creating new features or transforming existing ones to capture underlying patterns.
Model Selection: Experiment with multiple algorithms to identify the best performer for your specific task.
Regularization Techniques: Utilize techniques like pruning in Decision Trees to prevent overfitting.
Continuous Learning: Stay updated with the latest machine learning techniques and best practices.

By adhering to these practices, you can build robust and accurate predictive models tailored to your dataset and objectives.

Empower your data science journey by experimenting with these models on various datasets and exploring advanced techniques to further enhance model performance.

S10L04 – Decision Tree implementation – multiple features

Implementing Polynomial Regression and Decision Tree Regressor on Insurance Data: A Comprehensive Guide

Table of Contents

Introduction

Dataset Overview

Data Preprocessing

Label Encoding

One-Hot Encoding

Splitting Data into Training and Testing Sets

Building and Evaluating a Polynomial Regression Model

Implementing Decision Tree Regressor

Hyperparameter Tuning and Its Impact

Cross-Validation and Model Stability

Comparison of Models

Conclusion and Best Practices