Implementing Polynomial Regression and Decision Tree Regressor on Insurance Data: A Comprehensive Guide
In the realm of machine learning, regression models play a pivotal role in predicting continuous outcomes. This article delves into the application of Polynomial Regression and Decision Tree Regressor on an insurance dataset, offering a step-by-step guide to data preprocessing, model building, evaluation, and optimization. Whether you’re a seasoned data scientist or a budding enthusiast, this comprehensive guide will equip you with the knowledge to implement and compare these regression techniques effectively.
Table of Contents
- Introduction
- Dataset Overview
- Data Preprocessing
- Splitting Data into Training and Testing Sets
- Building and Evaluating a Polynomial Regression Model
- Implementing Decision Tree Regressor
- Hyperparameter Tuning and Its Impact
- Cross-Validation and Model Stability
- Comparison of Models
- Conclusion and Best Practices
Introduction
Machine learning offers a spectrum of regression techniques suitable for various predictive tasks. This guide focuses on two such methods:
- Polynomial Regression: Extends linear regression by considering polynomial relationships between the independent and dependent variables.
- Decision Tree Regressor: Utilizes tree-like models of decisions to predict continuous values.
Applying these models to an insurance dataset allows us to predict insurance charges based on factors like age, BMI, smoking habits, and more.
Dataset Overview
We utilize the Insurance Dataset from Kaggle, which contains the following features:
- Age: Age of the primary beneficiary.
- Sex: Gender of the beneficiary.
- BMI: Body Mass Index.
- Children: Number of children covered by insurance.
- Smoker: Smoking status.
- Region: Residential area of the beneficiary.
- Charges: Individual medical costs billed by health insurance.
The goal is to predict the Charges
based on the other features.
Data Preprocessing
Effective data preprocessing is crucial for building accurate machine learning models. This section covers Label Encoding and One-Hot Encoding to handle categorical variables.
Label Encoding
Label Encoding transforms categorical text data into numerical form, which is essential for machine learning algorithms.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
from sklearn import preprocessing import pandas as pd # Load dataset data = pd.read_csv('S07_datasets_13720_18513_insurance.csv') X = data.iloc[:, :-1] Y = data.iloc[:, -1] # Initialize LabelEncoder le = preprocessing.LabelEncoder() # Encode 'sex' and 'smoker' columns X['sex'] = le.fit_transform(X['sex']) X['smoker'] = le.fit_transform(X['smoker']) |
Output:
1 2 3 4 |
age sex bmi children smoker region 0 19 0 27.900 0 1 southwest 1 18 1 33.770 1 0 southeast ... |
One-Hot Encoding
One-Hot Encoding converts categorical variables into a form that can be provided to ML algorithms to do a better job in prediction.
1 2 3 4 5 6 7 8 9 |
from sklearn.preprocessing import OneHotEncoder from sklearn.compose import ColumnTransformer # Initialize ColumnTransformer with OneHotEncoder for 'region' columnTransformer = ColumnTransformer([('encoder', OneHotEncoder(), [5])], remainder='passthrough') # Apply transformation X = columnTransformer.fit_transform(X) print(X) |
Output:
1 2 3 4 |
[[0. 0. 0. ... 27.9 0. 1.] [0. 0. 1. ... 33.77 1. 0.] ... ] |
Splitting Data into Training and Testing Sets
Splitting the dataset ensures that the model’s performance is evaluated on unseen data, providing a better estimate of its real-world performance.
1 2 3 4 |
from sklearn.model_selection import train_test_split # Split data into training and testing sets (80-20 split) X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.20, random_state=1) |
Building and Evaluating a Polynomial Regression Model
Polynomial Regression allows the model to fit a non-linear relationship between the independent and dependent variables.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
from sklearn.linear_model import LinearRegression from sklearn.preprocessing import PolynomialFeatures from sklearn.metrics import r2_score # Initialize PolynomialFeatures with degree 2 poly = PolynomialFeatures(degree=2) X_poly = poly.fit_transform(X_train) # Initialize and fit Linear Regression model model = LinearRegression() model.fit(X_poly, y_train) # Predict on test set y_pred = model.predict(poly.transform(X_test)) # Evaluate model r2 = r2_score(y_test, y_pred) print(f'Polynomial Regression R2 Score: {r2:.2f}') |
Output:
1 |
Polynomial Regression R2 Score: 0.86 |
An R² score of 0.86 indicates that approximately 86% of the variance in the insurance charges is explained by the model.
Implementing Decision Tree Regressor
Decision Trees partition the data into subsets based on feature values, allowing for complex modeling of relationships.
1 2 3 4 5 6 7 8 9 10 11 12 |
from sklearn.tree import DecisionTreeRegressor # Initialize Decision Tree Regressor with max_depth=4 dt_model = DecisionTreeRegressor(max_depth=4) dt_model.fit(X_train, y_train) # Predict on test set y_pred_dt = dt_model.predict(X_test) # Evaluate model r2_dt = r2_score(y_test, y_pred_dt) print(f'Decision Tree Regressor R2 Score: {r2_dt:.2f}') |
Output:
1 |
Decision Tree Regressor R2 Score: 0.87 |
Surprisingly, the Decision Tree Regressor achieved a slightly higher R² score than the Polynomial Regression model in this instance.
Hyperparameter Tuning and Its Impact
Hyperparameters like max_depth
significantly impact the model’s performance by controlling the complexity of the Decision Tree.
1 2 3 4 5 6 7 |
# Experimenting with different max_depth values for depth in [2, 3, 4, 10]: dt_model = DecisionTreeRegressor(max_depth=depth, random_state=1) dt_model.fit(X_train, y_train) y_pred_dt = dt_model.predict(X_test) r2_dt = r2_score(y_test, y_pred_dt) print(f'max_depth={depth} => R2 Score: {r2_dt:.2f}') |
Output:
1 2 3 4 |
max_depth=2 => R2 Score: 0.75 max_depth=3 => R2 Score: 0.86 max_depth=4 => R2 Score: 0.87 max_depth=10 => R2 Score: 0.75 |
- Max Depth=2: Underfitting the model with a lower R² score.
- Max Depth=3 & 4: Optimal performance with higher R² scores.
- Max Depth=10: Overfitting, leading to decreased performance on the test set.
Conclusion: Selecting an appropriate max_depth
is crucial to balance bias and variance, ensuring the model generalizes well to unseen data.
Cross-Validation and Model Stability
Cross-validation, specifically K-Fold Cross-Validation, provides a more robust estimation of the model’s performance by partitioning the data into k subsets and iteratively training and testing the model.
1 2 3 4 5 6 7 8 9 10 |
from sklearn.model_selection import cross_val_score # Initialize Decision Tree Regressor dt_model = DecisionTreeRegressor(max_depth=4, random_state=1) # Perform 5-Fold Cross-Validation cv_scores = cross_val_score(dt_model, X, Y, cv=5, scoring='r2') print(f'Cross-Validation R2 Scores: {cv_scores}') print(f'Average R2 Score: {cv_scores.mean():.2f}') |
Output:
1 2 |
Cross-Validation R2 Scores: [0.85 0.86 0.87 0.88 0.86] Average R2 Score: 0.86 |
Benefit: Cross-validation mitigates the risk of model evaluation based on a single train-test split, providing a more generalized performance metric.
Comparison of Models
Model | R² Score |
---|---|
Polynomial Regression | 0.86 |
Decision Tree Regressor | 0.87 |
Insights:
- Decision Tree Regressor slightly outperforms Polynomial Regression in this case.
- Proper Hyperparameter Tuning significantly enhances the Decision Tree’s performance.
- Both models have their merits; the choice depends on the specific use case and data characteristics.
Conclusion and Best Practices
In this guide, we explored the implementation of Polynomial Regression and Decision Tree Regressor on an insurance dataset. Key takeaways include:
- Data Preprocessing: Proper encoding of categorical variables is essential for model accuracy.
- Model Evaluation: R² Score serves as a reliable metric to assess model performance.
- Hyperparameter Tuning: Adjusting parameters like
max_depth
can prevent overfitting and underfitting. - Cross-Validation: Enhances the reliability of performance metrics.
Best Practices:
- Understand Your Data: Before modeling, explore and understand the dataset to make informed preprocessing and modeling decisions.
- Feature Engineering: Consider creating new features or transforming existing ones to capture underlying patterns.
- Model Selection: Experiment with multiple algorithms to identify the best performer for your specific task.
- Regularization Techniques: Utilize techniques like pruning in Decision Trees to prevent overfitting.
- Continuous Learning: Stay updated with the latest machine learning techniques and best practices.
By adhering to these practices, you can build robust and accurate predictive models tailored to your dataset and objectives.
Empower your data science journey by experimenting with these models on various datasets and exploring advanced techniques to further enhance model performance.