Python에서 선형 회귀 모델 구축을 위한 단계별 가이드

Python에서 선형 회귀를 구현하는 이 포괄적인 가이드를 통해 데이터 기반 의사 결정의 힘을 활용하세요. 데이터 과학 초보자이거나 머신 러닝 기술을 향상시키고자 하는 분들에게 이 튜토리얼은 데이터셋 이해부터 정확한 예측에 이르기까지 전체 과정을 안내해 드립니다.

선형 회귀 소개

선형 회귀는 머신 러닝 및 통계 분야의 기본 알고리즘입니다. 이는 관측된 데이터에 선형 방정식을 적합시켜 종속 변수와 하나 이상의 독립 변수 간의 관계를 설정합니다. 이 기법은 예측 분석, 예측, 예측 변수의 강도 이해에 널리 사용됩니다.

주요 다루는 주제:

선형 회귀란?
선형 회귀의 응용
선형 회귀 vs 비선형 회귀
비용 함수 및 최적화

데이터셋 이해하기

이 튜토리얼에서는 캐나다 1인당 소득 데이터셋을 사용할 것입니다. 이 데이터셋은 Kaggle에서 제공됩니다. 이 데이터셋은 캐나다의 연간 1인당 소득을 미국 달러로 측정한 것입니다.

데이터셋 개요:

열:
- year: 소득이 기록된 연도.
- per capita income (US$): 1인당 소득 (미국 달러).

샘플 데이터:

year	per capita income (US$)
1970	3399.299037
1971	3768.297935
1972	4251.175484
1973	4804.463248
1974	5576.514583

Python 환경 설정하기

코드에 뛰어들기 전에 Python 환경이 필요한 라이브러리로 설정되어 있는지 확인하세요. 우리는 다음을 사용할 것입니다:

NumPy: 수치 연산을 위해.
Pandas: 데이터 조작 및 분석을 위해.
Matplotlib & Seaborn: 데이터 시각화를 위해.
Scikit-Learn: 선형 회귀 모델 구축 및 평가를 위해.

설치 명령어:

pip install numpy pandas matplotlib seaborn scikit-learn

1	pip install numpy pandas matplotlib seaborn scikit-learn

데이터 가져오기 및 탐색

필수 라이브러리를 가져오고 데이터셋을 Pandas DataFrame에 로드하는 것부터 시작하세요.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Set seaborn style for better aesthetics
sns.set()

# Load the dataset
data = pd.read_csv('canada_per_capita_income.csv')

# Display the first few rows
print(data.head())

import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

import seaborn as sns

# Set seaborn style for better aesthetics

sns.set()

# Load the dataset

data = pd.read_csv('canada_per_capita_income.csv')

# Display the first few rows

print(data.head())

출력:

   year  per capita income (US$)
0  1970              3399.299037
1  1971              3768.297935
2  1972              4251.175484
3  1973              4804.463248
4  1974              5576.514583

year per capita income (US$)

0 1970 3399.299037

1 1971 3768.297935

2 1972 4251.175484

3 1973 4804.463248

4 1974 5576.514583

데이터 시각화:

기본 패턴과 관계를 이해하기 위해 데이터를 시각화하는 것이 중요합니다.

# Scatter plot to visualize the relationship
sns.scatterplot(data=data, x='year', y='per capita income (US$)')
plt.title('Canada Per Capita Income Over Years')
plt.xlabel('Year')
plt.ylabel('Per Capita Income (US$)')
plt.show()

# Scatter plot to visualize the relationship

sns.scatterplot(data=data, x='year', y='per capita income (US$)')

plt.title('Canada Per Capita Income Over Years')

plt.xlabel('Year')

plt.ylabel('Per Capita Income (US$)')

plt.show()

*이 산점도는 1인당 소득이 일반적으로 연도에 따라 증가했음을 나타내는 긍정적인 선형 추세를 보여줍니다.*

데이터 전처리

데이터 전처리는 데이터셋이 깔끔하고 효과적인 모델을 구축하기에 적합한지 확인합니다.

1. 결측값 확인

# Check for null values
print(data.isnull().sum())

1 2	# Check for null values print(data.isnull().sum())

출력:

year                         0
per capita income (US$)      0
dtype: int64

year 0

per capita income (US$) 0

dtype: int64

*결측값이 없습니다.*

2. 특징과 목표 변수 분리

# Features
X = data.iloc[:, :-1]  # All columns except the last

# Target variable
Y = data.iloc[:, -1]   # The last column

# Features

X = data.iloc[:, :-1] # All columns except the last

# Target variable

Y = data.iloc[:, -1] # The last column

3. 학습-테스트 분할

데이터셋을 학습 세트와 테스트 세트로 분할하면 보지 못한 데이터에 대한 모델의 성능을 평가할 수 있습니다.

from sklearn.model_selection import train_test_split

# Split the data (80% training, 20% testing)
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.20, random_state=1)

from sklearn.model_selection import train_test_split

# Split the data (80% training, 20% testing)

X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.20, random_state=1)

*랜덤 상태를 사용하여 결과의 재현 가능성을 보장합니다.*

선형 회귀 모델 구축

데이터가 준비되면 이제 선형 회귀 모델을 구축할 수 있습니다.

from sklearn.linear_model import LinearRegression

# Initialize the model
model = LinearRegression()

# Train the model
model.fit(X_train, y_train)

from sklearn.linear_model import LinearRegression

# Initialize the model

model = LinearRegression()

# Train the model

model.fit(X_train, y_train)

모델 요약:

print(model)

1	print(model)

출력:

LinearRegression()

1	LinearRegression()

*이 출력은 모델이 예측을 수행할 준비가 되었음을 의미합니다.*

예측하기

훈련된 모델을 사용하여 테스트 데이터셋의 1인당 소득을 예측할 수 있습니다.

# Make predictions on the test set
y_pred = model.predict(X_test)

# Display the predictions alongside actual values
comparison = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})
print(comparison)

# Make predictions on the test set

y_pred = model.predict(X_test)

# Display the predictions alongside actual values

comparison = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})

print(comparison)

*이 비교를 통해 모델의 예측이 실제 데이터와 얼마나 일치하는지 시각화할 수 있습니다.*

모델 평가

모델의 성능을 평가하는 것은 정확성과 신뢰성을 이해하는 데 중요합니다.

1. R² 점수 계산

결정 계수라고도 하는 R² 점수는 데이터가 회귀 모델에 얼마나 잘 맞는지를 나타냅니다.

from sklearn.metrics import r2_score

# Calculate R²
r2 = r2_score(y_test, y_pred)
print(f'R² Score: {r2:.2f}')

from sklearn.metrics import r2_score

# Calculate R²

r2 = r2_score(y_test, y_pred)

print(f'R² Score: {r2:.2f}')

해석:

R² = 1: 완벽한 적합.
R² = 0: 모델이 변동성을 전혀 설명하지 못함.
0 < R² < 1: 모델이 설명하는 분산의 비율을 나타냄.

*우리의 경우, 높은 R² 값은 더 나은 적합을 의미합니다.*

2. 예측값과 실제값 시각화

# Plotting Actual vs Predicted values
plt.figure(figsize=(10,6))
plt.scatter(X_test, y_test, color='blue', label='Actual')
plt.scatter(X_test, y_pred, color='red', label='Predicted')
plt.title('Actual vs Predicted Per Capita Income')
plt.xlabel('Year')
plt.ylabel('Per Capita Income (US$)')
plt.legend()
plt.show()

# Plotting Actual vs Predicted values

plt.figure(figsize=(10,6))

plt.scatter(X_test, y_test, color='blue', label='Actual')

plt.scatter(X_test, y_pred, color='red', label='Predicted')

plt.title('Actual vs Predicted Per Capita Income')

plt.xlabel('Year')

plt.ylabel('Per Capita Income (US$)')

plt.legend()

plt.show()

*이 시각화는 다양한 연도에 걸친 예측의 정확성을 평가하는 데 도움이 됩니다.*

결론

이 튜토리얼에서는 캐나다 1인당 소득 데이터셋을 사용하여 Python에서 선형 회귀 모델을 구축하는 과정을 살펴보았습니다. 데이터셋 이해부터 전처리, 모델 구축, 예측, 평가에 이르기까지 각 단계는 정확하고 신뢰할 수 있는 예측 모델을 개발하는 데 중요합니다.

주요 요점:

선형 회귀는 연속 변수를 예측하기 위한 강력한 도구입니다.
적절한 데이터 전처리는 모델 성능을 향상시킵니다.
시각화는 데이터 추세와 모델 정확성을 이해하는 데 도움이 됩니다.
R²와 같은 평가 지표는 모델의 효과를 평가하는 데 필수적입니다.

다음 단계:

다중 특징을 가진 더 복잡한 데이터셋을 탐색하세요.
Ridge 회귀, Lasso 회귀와 같은 다른 회귀 기법에 대해 배우세요.
범주형 데이터 문제를 위한 분류 알고리즘을 학습하세요.

추가 자료

Python에서 선형 회귀를 마스터하여 데이터 과학 여정을 강화하세요. 머신 러닝 및 데이터 분석에 대한 더 많은 튜토리얼과 통찰력을 기대하세요!

S06L02 – 파이썬으로 선형 회귀 구현 – 파트 1

Python에서 선형 회귀 모델 구축을 위한 단계별 가이드

목차

선형 회귀 소개

데이터셋 이해하기

Python 환경 설정하기

데이터 가져오기 및 탐색

데이터 전처리

1. 결측값 확인

2. 특징과 목표 변수 분리

3. 학습-테스트 분할

선형 회귀 모델 구축

예측하기

모델 평가

1. R² 점수 계산

2. 예측값과 실제값 시각화

결론

추가 자료