Implementing and Evaluating Linear Regression in Python: A Comprehensive Guide

Linear Regression

Introduction to Linear Regression

Linear regression is one of the most fundamental and widely used algorithms in machine learning and data analysis. It models the relationship between a dependent variable and one or more independent variables by fitting a linear equation to observed data. In this guide, we’ll walk you through implementing linear regression in Python using the scikit-learn library, comparing actual versus predicted results, and evaluating the model’s performance using the R-squared metric.

Whether you’re a data science enthusiast or a seasoned professional, this comprehensive tutorial will provide you with the knowledge and practical skills to build and assess a linear regression model effectively.

Dataset Overview
Setting Up Your Environment
Importing the Necessary Libraries
Loading and Exploring the Data
Visualizing the Data
Preparing the Data for Training
Building the Linear Regression Model
Making Predictions
Comparing Actual vs. Predicted Results
Evaluating the Model with R-squared
Conclusion
Further Reading

Dataset Overview

For this tutorial, we’ll be using the Canada Per Capita Income dataset from Kaggle. This dataset provides information on annual per capita income in Canada over several years, allowing us to analyze income trends and build a predictive model.

Setting Up Your Environment

Before diving into the code, ensure that you have Python installed on your system. It’s recommended to use a virtual environment to manage your project dependencies. You can set up a virtual environment using venv or tools like conda.

# Using venv
python -m venv linear_regression_env
source linear_regression_env/bin/activate  # On Windows: linear_regression_env\Scripts\activate

# Install necessary packages
pip install numpy pandas matplotlib seaborn scikit-learn

# Using venv

python -m venv linear_regression_env

source linear_regression_env/bin/activate # On Windows: linear_regression_env\Scripts\activate

# Install necessary packages

pip install numpy pandas matplotlib seaborn scikit-learn

Importing the Necessary Libraries

We’ll start by importing the essential libraries required for data manipulation, visualization, and building our regression model.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

# Set seaborn style for better aesthetics
sns.set()

import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

import seaborn as sns

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LinearRegression

from sklearn.metrics import r2_score

# Set seaborn style for better aesthetics

sns.set()

Loading and Exploring the Data

Next, we’ll load the dataset into a Pandas DataFrame and take a preliminary look at its structure.

# Import Data
data = pd.read_csv('canada_per_capita_income.csv')

# Preview the first few rows of the dataset
print(data.head())

# Import Data

data = pd.read_csv('canada_per_capita_income.csv')

# Preview the first few rows of the dataset

print(data.head())

Output:

   year  per capita income (US$)
0  1970              3399.299037
1  1971              3768.297935
2  1972              4251.175484
3  1973              4804.463248
4  1974              5576.514583

year per capita income (US$)

0 1970 3399.299037

1 1971 3768.297935

2 1972 4251.175484

3 1973 4804.463248

4 1974 5576.514583

From the output, we can observe that the dataset contains two columns: year and per capita income (US$).

Visualizing the Data

Visualization helps in understanding the underlying patterns and relationships within the data. We’ll create a scatter plot to visualize the relationship between the year and per capita income.

sns.scatterplot(data=data, x='per capita income (US$)', y='year')
plt.title('Per Capita Income Over Years in Canada')
plt.xlabel('Per Capita Income (US$)')
plt.ylabel('Year')
plt.show()

sns.scatterplot(data=data, x='per capita income (US$)', y='year')

plt.title('Per Capita Income Over Years in Canada')

plt.xlabel('Per Capita Income (US$)')

plt.ylabel('Year')

plt.show()

Scatter Plot

The scatter plot reveals the trend of increasing per capita income over the years. However, the relationship may not be perfectly linear, indicating potential variability in the data.

Preparing the Data for Training

Before training our model, we need to prepare the data by separating the features (independent variables) from the target variable (dependent variable).

# Define features and target variable
X = data.iloc[:, :-1]  # All columns except the last one
Y = data.iloc[:, -1]   # The last column

# Define features and target variable

X = data.iloc[:, :-1] # All columns except the last one

Y = data.iloc[:, -1] # The last column

In this case, X contains the year column, and Y contains the per capita income (US$).

Building the Linear Regression Model

We’ll now split the data into training and testing sets to evaluate the performance of our model on unseen data.

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.20, random_state=1)

1 2	# Train-test split X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.20, random_state=1)

Explanation:

X_train & y_train: Used to train the model.
X_test & y_test: Used to test the model’s performance.
test_size=0.20: 20% of the data is reserved for testing.
random_state=1: Ensures reproducibility of the split.

Now, let’s instantiate and train the Linear Regression model.

# Initialize the Linear Regression model
model = LinearRegression()

# Fit the model to the training data
model.fit(X_train, y_train)

# Output the model details
print(model)

# Initialize the Linear Regression model

model = LinearRegression()

# Fit the model to the training data

model.fit(X_train, y_train)

# Output the model details

print(model)

Output:

LinearRegression()

1	LinearRegression()

Making Predictions

With the trained model, we can now make predictions on the test set.

# Make predictions on the test data
y_pred = model.predict(X_test)

# Display the predicted values
print(y_pred)

# Make predictions on the test data

y_pred = model.predict(X_test)

# Display the predicted values

print(y_pred)

Output:

[20349.94572643 18613.49135581 33373.35350612 29900.44476487
  1248.94764955  2117.17483487 24691.081653   27295.76320894
  38582.716618   22086.40009706]

[20349.94572643 18613.49135581 33373.35350612 29900.44476487

1248.94764955 2117.17483487 24691.081653 27295.76320894

38582.716618 22086.40009706]

These values represent the model’s predictions of per capita income based on the input years in the test set.

Comparing Actual vs. Predicted Results

To assess how well our model is performing, we’ll compare the actual values (y_test) with the predicted values (y_pred).

# Create a DataFrame to compare actual and predicted values
comparison = pd.DataFrame()
comparison['Actual'] = y_test
comparison['Predicted'] = y_pred

# Display the comparison
print(comparison)

# Create a DataFrame to compare actual and predicted values

comparison = pd.DataFrame()

comparison['Actual'] = y_test

comparison['Predicted'] = y_pred

# Display the comparison

print(comparison)

Output:

              Actual     Predicted
24  15755.820270  20349.945726
22  16412.083090  18613.491356
39  32755.176820  33373.353506
35  29198.055690  29900.444765
2    4251.175484   1248.947650
3    4804.463248   2117.174835
29  17581.024140  24691.081653
32  19232.175560  27295.763209
45  35175.188980  38582.716618
26  16699.826680  22086.400097

Actual Predicted

24 15755.820270 20349.945726

22 16412.083090 18613.491356

39 32755.176820 33373.353506

35 29198.055690 29900.444765

2 4251.175484 1248.947650

3 4804.463248 2117.174835

29 17581.024140 24691.081653

32 19232.175560 27295.763209

45 35175.188980 38582.716618

26 16699.826680 22086.400097

Analysis:

Good Predictions: Entries where Actual and Predicted values are close indicate the model is performing well.
Discrepancies: Significant differences highlight areas where the model may need improvement or where the relationship is not perfectly linear.

For instance, while most predictions are reasonably close to actual values, some discrepancies, such as an actual value of 4,251.18 versus a predicted value of 1,248.95, suggest variability that the model couldn’t capture.

Evaluating the Model with R-squared

To quantitatively assess the model’s performance, we’ll use the R-squared (R²) metric. R-squared represents the proportion of the variance for the dependent variable that’s explained by the independent variable(s) in the model.

# Calculate R-squared score
r2 = r2_score(y_test, y_pred)

print(f"R-squared Score: {r2:.2f}")

# Calculate R-squared score

r2 = r2_score(y_test, y_pred)

print(f"R-squared Score: {r2:.2f}")

Output:

R-squared Score: 0.80

1	R-squared Score: 0.80

Interpretation:

An R-squared value of 0.80 indicates that 80% of the variance in per capita income is explained by the year.
While 80% is a strong indication of a good fit, it also implies that 20% of the variance remains unexplained, possibly due to other factors not considered in the model or inherent data variability.

Understanding R-squared Values:

1: Perfect fit. The model explains all the variability.
0: No explanatory power. The model does not explain any variability.
Negative Values: Indicates that the model performs worse than a horizontal line (mean of the target variable).

Conclusion

In this guide, we’ve successfully implemented a Linear Regression model in Python to predict Canada’s per capita income over the years. By following these steps, you can:

Load and Explore Data: Understand the dataset’s structure and initial trends.
Visualize Relationships: Use scatter plots to identify potential linear relationships.
Prepare Data: Split the data into training and testing sets for unbiased evaluation.
Build and Train the Model: Utilize scikit-learn’s LinearRegression to fit the model.
Make Predictions: Generate predictions using the trained model.
Compare Results: Analyze how well the predicted values align with actual data.
Evaluate Performance: Use R-squared to quantify the model’s explanatory power.

While the model demonstrates a commendable R-squared value, there’s room for improvement. Exploring additional features, transforming variables, or employing more complex algorithms can potentially enhance predictive performance.

S06L03 – Linear regression implementation in python – Part 2

Implementing and Evaluating Linear Regression in Python: A Comprehensive Guide

Introduction to Linear Regression

Table of Contents

Dataset Overview

Setting Up Your Environment

Importing the Necessary Libraries

Loading and Exploring the Data

Visualizing the Data

Preparing the Data for Training

Building the Linear Regression Model

Making Predictions

Comparing Actual vs. Predicted Results

Evaluating the Model with R-squared

Conclusion

Further Reading