Visualizing Decision Tree Regression in Python: A Comprehensive Guide

Unlock the power of Decision Tree Regression with Python! In this comprehensive guide, we’ll walk you through visualizing a Decision Tree Regression model using Python’s powerful libraries. Whether you’re a budding data scientist or an experienced professional, understanding how to visualize and interpret your models is crucial for making informed decisions. We’ll delve into concepts like underfitting and overfitting, model evaluation, and practical implementation using real-world datasets.

Introduction to Decision Tree Regression
Understanding the Dataset
Setting Up Your Environment
Data Exploration and Visualization
Preparing the Data
Building the Decision Tree Model
Making Predictions
Comparing Actual vs. Predicted Values
Model Evaluation
Visualizing the Model
Understanding Underfitting and Overfitting
Conclusion

1. Introduction to Decision Tree Regression

Decision Tree Regression is a versatile and powerful machine learning algorithm used for predicting continuous outcomes. Unlike linear regression models, decision trees can capture complex relationships and interactions between features without requiring extensive data preprocessing. Visualizing these trees helps in understanding the decision-making process of the model, making it easier to interpret and communicate results.

Why Visualization Matters:

Interpretability: Easily understand how the model makes predictions.
Debugging: Identify and rectify model flaws like overfitting or underfitting.
Communication: Present clear insights to stakeholders.

2. Understanding the Dataset

For our demonstration, we’ll use the Canada Per Capita Income dataset from Kaggle. This dataset contains information about the annual per capita income in the United States from 1950 to 2030, measured in US dollars.

Sample Data:

Year	Per Capita Income (US$)
1970	3399.30
1971	3768.30
1972	4251.18
1973	4804.46
1974	5576.51

3. Setting Up Your Environment

Before diving into the implementation, ensure you have the necessary libraries installed. We’ll be using libraries such as numpy, pandas, matplotlib, seaborn, and scikit-learn.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import r2_score
from sklearn import tree

import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

import seaborn as sns

from sklearn.model_selection import train_test_split

from sklearn.tree import DecisionTreeRegressor

from sklearn.metrics import r2_score

from sklearn import tree

Why These Libraries?

NumPy & Pandas: Efficient data manipulation and analysis.
Matplotlib & Seaborn: High-quality data visualization.
Scikit-learn: Robust machine learning tools and algorithms.

4. Data Exploration and Visualization

Understanding your data is the first crucial step. Let’s visualize the per capita income over the years to identify trends.

# Load the dataset
data = pd.read_csv('canada_per_capita_income.csv')

# Display the first few rows
print(data.head())

# Scatter plot
sns.scatterplot(data=data, x='per capita income (US$)', y='year')
plt.title('Canada Per Capita Income Over Years')
plt.xlabel('Per Capita Income (US$)')
plt.ylabel('Year')
plt.show()

# Load the dataset

data = pd.read_csv('canada_per_capita_income.csv')

# Display the first few rows

print(data.head())

# Scatter plot

sns.scatterplot(data=data, x='per capita income (US$)', y='year')

plt.title('Canada Per Capita Income Over Years')

plt.xlabel('Per Capita Income (US$)')

plt.ylabel('Year')

plt.show()

Output:

Insights:

There’s a clear upward trend in per capita income from 1970 to the early 2000s.
Some fluctuations indicate economic events impacting income levels.

5. Preparing the Data

Before modeling, we need to split the data into features (X) and target (Y), followed by a train-test split to evaluate model performance.

# Features and Target
X = data[['year']]  # Predictor
Y = data['per capita income (US$)']  # Target

# Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(
    X, Y, test_size=0.20, random_state=1
)

# Features and Target

X = data[['year']] # Predictor

Y = data['per capita income (US$)'] # Target

# Train-Test Split

X_train, X_test, y_train, y_test = train_test_split(

X, Y, test_size=0.20, random_state=1

)

Why Train-Test Split?

Training Set: To train the model.
Testing Set: To evaluate the model’s performance on unseen data.

6. Building the Decision Tree Model

With the data ready, let’s build and train a Decision Tree Regressor.

# Initialize the model with a max depth of 10
model = DecisionTreeRegressor(max_depth=10)

# Train the model
model.fit(X_train, y_train)

# Initialize the model with a max depth of 10

model = DecisionTreeRegressor(max_depth=10)

# Train the model

model.fit(X_train, y_train)

Parameters Explained:

max_depth: Controls the maximum depth of the tree. Deeper trees can capture more complex patterns but may overfit.

7. Making Predictions

After training, use the model to make predictions on the test dataset.

# Make predictions
y_pred = model.predict(X_test)

# Display predictions
print(y_pred)

# Make predictions

y_pred = model.predict(X_test)

# Display predictions

print(y_pred)

Sample Output:

[15875.58673  17266.09769  37446.48609  25719.14715   3768.297935
 5576.514583 16622.67187  18601.39724  41039.8936   16369.31725]

1 2	[15875.58673 17266.09769 37446.48609 25719.14715 3768.297935 5576.514583 16622.67187 18601.39724 41039.8936 16369.31725]

8. Comparing Actual vs. Predicted Values

It’s essential to compare the actual values against the model’s predictions to gauge performance visually.

# Create a comparison DataFrame
comparison = pd.DataFrame({
    'Actual': y_test,
    'Predicted': y_pred
})

print(comparison)

# Create a comparison DataFrame

comparison = pd.DataFrame({

'Actual': y_test,

'Predicted': y_pred

})

print(comparison)

Sample Output:

	Actual	Predicted
24	15755.82	15875.59
22	16412.08	17266.10
39	32755.18	37446.49
35	29198.06	25719.15
2	4251.17	3768.30
3	4804.46	5576.51
29	17581.02	16622.67
32	19232.18	18601.40
45	35175.19	41039.89
26	16699.83	16369.32

Visualization:

plt.scatter(X_test, y_test, color='blue', label='Actual')
plt.scatter(X_test, y_pred, color='red', label='Predicted')
plt.title('Actual vs Predicted Per Capita Income')
plt.xlabel('Year')
plt.ylabel('Per Capita Income (US$)')
plt.legend()
plt.show()

plt.scatter(X_test, y_test, color='blue', label='Actual')

plt.scatter(X_test, y_pred, color='red', label='Predicted')

plt.title('Actual vs Predicted Per Capita Income')

plt.xlabel('Year')

plt.ylabel('Per Capita Income (US$)')

plt.legend()

plt.show()

9. Model Evaluation

To quantitatively assess the model’s performance, we’ll use the R² score, which indicates how well the model explains the variability of the target data.

# Calculate R2 Score
r2 = r2_score(y_test, y_pred)
print(f"R² Score: {r2:.2f}")

# Calculate R2 Score

r2 = r2_score(y_test, y_pred)

print(f"R² Score: {r2:.2f}")

Output:

R² Score: 0.93

1	R² Score: 0.93

Interpretation:

An R² score of 0.93 implies that 93% of the variability in per capita income is explained by the model.
This indicates a strong predictive performance.

10. Visualizing the Model

Visualization helps in understanding the decision-making process of the model. We’ll plot the regression tree and the model’s predictions.

Plotting Predictions Over a Range of Years

# Define a range of years for prediction
vals = np.arange(1950, 2050, 2).reshape(-1, 1)

# Predict using the model
predictions = model.predict(vals)

# Plot
plt.scatter(X, Y, color='blue', label='Data')
plt.plot(vals, predictions, color='red', label='Decision Tree Prediction')
plt.title('Decision Tree Regression Model')
plt.xlabel('Year')
plt.ylabel('Per Capita Income (US$)')
plt.legend()
plt.show()

# Define a range of years for prediction

vals = np.arange(1950, 2050, 2).reshape(-1, 1)

# Predict using the model

predictions = model.predict(vals)

# Plot

plt.scatter(X, Y, color='blue', label='Data')

plt.plot(vals, predictions, color='red', label='Decision Tree Prediction')

plt.title('Decision Tree Regression Model')

plt.xlabel('Year')

plt.ylabel('Per Capita Income (US$)')

plt.legend()

plt.show()

Visualizing the Decision Tree Structure

Understanding the tree structure is vital for interpreting how decisions are made.

# Plot the decision tree
plt.figure(figsize=(25,15))
tree.plot_tree(model, fontsize=10, feature_names=['Year'], filled=True)
plt.title('Decision Tree Structure')
plt.show()

# Plot the decision tree

plt.figure(figsize=(25,15))

tree.plot_tree(model, fontsize=10, feature_names=['Year'], filled=True)

plt.title('Decision Tree Structure')

plt.show()

11. Understanding Underfitting and Overfitting

Balancing model complexity is crucial. Let’s explore how tweaking the max_depth parameter affects model performance.

Underfitting:

Definition: Model is too simple, capturing neither the trend nor the noise.
Indicator: Low R² score, poor performance on both training and testing data.

# Initialize model with low max_depth
underfit_model = DecisionTreeRegressor(max_depth=2)
underfit_model.fit(X_train, y_train)
under_pred = underfit_model.predict(X_test)
under_r2 = r2_score(y_test, under_pred)
print(f"Underfitted Model R² Score: {under_r2:.2f}")

# Initialize model with low max_depth

underfit_model = DecisionTreeRegressor(max_depth=2)

underfit_model.fit(X_train, y_train)

under_pred = underfit_model.predict(X_test)

under_r2 = r2_score(y_test, under_pred)

print(f"Underfitted Model R² Score: {under_r2:.2f}")

Output:

Underfitted Model R² Score: 0.65

1	Underfitted Model R² Score: 0.65

Visualization:

Explanation:

The model fails to capture the underlying trend, leading to inaccurate predictions.

Overfitting:

Definition: Model is too complex, capturing noise along with the trend.
Indicator: High R² on training data but poor generalization to testing data.

# Initialize model with high max_depth
overfit_model = DecisionTreeRegressor(max_depth=10)
overfit_model.fit(X_train, y_train)
over_pred = overfit_model.predict(X_test)
over_r2 = r2_score(y_test, over_pred)
print(f"Overfitted Model R² Score: {over_r2:.2f}")

# Initialize model with high max_depth

overfit_model = DecisionTreeRegressor(max_depth=10)

overfit_model.fit(X_train, y_train)

over_pred = overfit_model.predict(X_test)

over_r2 = r2_score(y_test, over_pred)

print(f"Overfitted Model R² Score: {over_r2:.2f}")

Output:

Overfitted Model R² Score: 0.92

1	Overfitted Model R² Score: 0.92

Visualization:

Explanation:

The model fits the training data exceptionally well but might struggle with unseen data due to its complexity.

Optimal Depth:

Finding a balance ensures the model generalizes well without being too simplistic or overly complex.

12. Conclusion

Visualizing Decision Tree Regression models offers invaluable insights into their decision-making processes, performance, and potential pitfalls like underfitting and overfitting. By adjusting parameters like max_depth, you can tailor the model’s complexity to suit your data’s intricacies, ensuring robust and reliable predictions.

Key Takeaways:

Model Visualization: Essential for interpretability and debugging.
Underfitting vs. Overfitting: Balancing complexity is crucial for optimal performance.
Evaluation Metrics: Use R² score to quantify model performance.

Embrace these visualization techniques to enhance your data science projects, making your models not only accurate but also transparent and trustworthy.

Enhance your data science journey by mastering Decision Tree Regression and its visualization. Stay tuned for more tutorials and insights to elevate your analytical skills!

S10L03 – Visualization of decision tree model

Visualizing Decision Tree Regression in Python: A Comprehensive Guide

Table of Contents

1. Introduction to Decision Tree Regression

2. Understanding the Dataset

3. Setting Up Your Environment

4. Data Exploration and Visualization

5. Preparing the Data

6. Building the Decision Tree Model

7. Making Predictions

8. Comparing Actual vs. Predicted Values

9. Model Evaluation

10. Visualizing the Model

Plotting Predictions Over a Range of Years

Visualizing the Decision Tree Structure

11. Understanding Underfitting and Overfitting

Underfitting:

Overfitting:

Optimal Depth:

12. Conclusion