Implementing and Evaluating Linear Regression in Python: A Comprehensive Guide
Introduction to Linear Regression
Linear regression is one of the most fundamental and widely used algorithms in machine learning and data analysis. It models the relationship between a dependent variable and one or more independent variables by fitting a linear equation to observed data. In this guide, we’ll walk you through implementing linear regression in Python using the scikit-learn library, comparing actual versus predicted results, and evaluating the model’s performance using the R-squared metric.
Whether you’re a data science enthusiast or a seasoned professional, this comprehensive tutorial will provide you with the knowledge and practical skills to build and assess a linear regression model effectively.
Table of Contents
- Dataset Overview
- Setting Up Your Environment
- Importing the Necessary Libraries
- Loading and Exploring the Data
- Visualizing the Data
- Preparing the Data for Training
- Building the Linear Regression Model
- Making Predictions
- Comparing Actual vs. Predicted Results
- Evaluating the Model with R-squared
- Conclusion
- Further Reading
Dataset Overview
For this tutorial, we’ll be using the Canada Per Capita Income dataset from Kaggle. This dataset provides information on annual per capita income in Canada over several years, allowing us to analyze income trends and build a predictive model.
Setting Up Your Environment
Before diving into the code, ensure that you have Python installed on your system. It’s recommended to use a virtual environment to manage your project dependencies. You can set up a virtual environment using venv
or tools like conda
.
1 2 3 4 5 6 |
# Using venv python -m venv linear_regression_env source linear_regression_env/bin/activate # On Windows: linear_regression_env\Scripts\activate # Install necessary packages pip install numpy pandas matplotlib seaborn scikit-learn |
Importing the Necessary Libraries
We’ll start by importing the essential libraries required for data manipulation, visualization, and building our regression model.
1 2 3 4 5 6 7 8 9 10 |
import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression from sklearn.metrics import r2_score # Set seaborn style for better aesthetics sns.set() |
Loading and Exploring the Data
Next, we’ll load the dataset into a Pandas DataFrame and take a preliminary look at its structure.
1 2 3 4 5 |
# Import Data data = pd.read_csv('canada_per_capita_income.csv') # Preview the first few rows of the dataset print(data.head()) |
Output:
1 2 3 4 5 6 |
year per capita income (US$) 0 1970 3399.299037 1 1971 3768.297935 2 1972 4251.175484 3 1973 4804.463248 4 1974 5576.514583 |
From the output, we can observe that the dataset contains two columns: year
and per capita income (US$)
.
Visualizing the Data
Visualization helps in understanding the underlying patterns and relationships within the data. We’ll create a scatter plot to visualize the relationship between the year and per capita income.
1 2 3 4 5 |
sns.scatterplot(data=data, x='per capita income (US$)', y='year') plt.title('Per Capita Income Over Years in Canada') plt.xlabel('Per Capita Income (US$)') plt.ylabel('Year') plt.show() |
The scatter plot reveals the trend of increasing per capita income over the years. However, the relationship may not be perfectly linear, indicating potential variability in the data.
Preparing the Data for Training
Before training our model, we need to prepare the data by separating the features (independent variables) from the target variable (dependent variable).
1 2 3 |
# Define features and target variable X = data.iloc[:, :-1] # All columns except the last one Y = data.iloc[:, -1] # The last column |
In this case, X
contains the year
column, and Y
contains the per capita income (US$)
.
Building the Linear Regression Model
We’ll now split the data into training and testing sets to evaluate the performance of our model on unseen data.
1 2 |
# Train-test split X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.20, random_state=1) |
Explanation:
- X_train & y_train: Used to train the model.
- X_test & y_test: Used to test the model’s performance.
- test_size=0.20: 20% of the data is reserved for testing.
- random_state=1: Ensures reproducibility of the split.
Now, let’s instantiate and train the Linear Regression model.
1 2 3 4 5 6 7 8 |
# Initialize the Linear Regression model model = LinearRegression() # Fit the model to the training data model.fit(X_train, y_train) # Output the model details print(model) |
Output:
1 |
LinearRegression() |
Making Predictions
With the trained model, we can now make predictions on the test set.
1 2 3 4 5 |
# Make predictions on the test data y_pred = model.predict(X_test) # Display the predicted values print(y_pred) |
Output:
1 2 3 |
[20349.94572643 18613.49135581 33373.35350612 29900.44476487 1248.94764955 2117.17483487 24691.081653 27295.76320894 38582.716618 22086.40009706] |
These values represent the model’s predictions of per capita income based on the input years in the test set.
Comparing Actual vs. Predicted Results
To assess how well our model is performing, we’ll compare the actual values (y_test
) with the predicted values (y_pred
).
1 2 3 4 5 6 7 |
# Create a DataFrame to compare actual and predicted values comparison = pd.DataFrame() comparison['Actual'] = y_test comparison['Predicted'] = y_pred # Display the comparison print(comparison) |
Output:
1 2 3 4 5 6 7 8 9 10 11 |
Actual Predicted 24 15755.820270 20349.945726 22 16412.083090 18613.491356 39 32755.176820 33373.353506 35 29198.055690 29900.444765 2 4251.175484 1248.947650 3 4804.463248 2117.174835 29 17581.024140 24691.081653 32 19232.175560 27295.763209 45 35175.188980 38582.716618 26 16699.826680 22086.400097 |
Analysis:
- Good Predictions: Entries where
Actual
andPredicted
values are close indicate the model is performing well. - Discrepancies: Significant differences highlight areas where the model may need improvement or where the relationship is not perfectly linear.
For instance, while most predictions are reasonably close to actual values, some discrepancies, such as an actual value of 4,251.18 versus a predicted value of 1,248.95, suggest variability that the model couldn’t capture.
Evaluating the Model with R-squared
To quantitatively assess the model’s performance, we’ll use the R-squared (R²) metric. R-squared represents the proportion of the variance for the dependent variable that’s explained by the independent variable(s) in the model.
1 2 3 4 |
# Calculate R-squared score r2 = r2_score(y_test, y_pred) print(f"R-squared Score: {r2:.2f}") |
Output:
1 |
R-squared Score: 0.80 |
Interpretation:
- An R-squared value of 0.80 indicates that 80% of the variance in per capita income is explained by the year.
- While 80% is a strong indication of a good fit, it also implies that 20% of the variance remains unexplained, possibly due to other factors not considered in the model or inherent data variability.
Understanding R-squared Values:
- 1: Perfect fit. The model explains all the variability.
- 0: No explanatory power. The model does not explain any variability.
- Negative Values: Indicates that the model performs worse than a horizontal line (mean of the target variable).
Conclusion
In this guide, we’ve successfully implemented a Linear Regression model in Python to predict Canada’s per capita income over the years. By following these steps, you can:
- Load and Explore Data: Understand the dataset’s structure and initial trends.
- Visualize Relationships: Use scatter plots to identify potential linear relationships.
- Prepare Data: Split the data into training and testing sets for unbiased evaluation.
- Build and Train the Model: Utilize scikit-learn’s
LinearRegression
to fit the model. - Make Predictions: Generate predictions using the trained model.
- Compare Results: Analyze how well the predicted values align with actual data.
- Evaluate Performance: Use R-squared to quantify the model’s explanatory power.
While the model demonstrates a commendable R-squared value, there’s room for improvement. Exploring additional features, transforming variables, or employing more complex algorithms can potentially enhance predictive performance.
Further Reading
- Scikit-learn Documentation
- Understanding R-squared
- Linear Regression Assumptions
- Improving Model Performance
Embarking on this linear regression journey equips you with foundational skills applicable across various data-driven domains. Continue practicing with diverse datasets to deepen your understanding and proficiency in machine learning.
#LinearRegression #Python #MachineLearning #DataScience #ScikitLearn #R2Score #DataVisualization #PredictiveModeling #PythonTutorial