S16L02 – 마스터 템플릿 회귀 모델 – 모델 및 평가

html
고급 회귀 모델을 이용한 자동차 가격 예측 마스터링: 종합 가이드

목차

소개
데이터셋 개요
데이터 가져오기 및 초기 탐색
데이터 정제 및 전처리
    
        누락된 수치 데이터 처리
        누락된 범주형 데이터 처리
    

특징 선택 및 인코딩
    
        관련 없는 특징 제거
        범주형 변수 원-핫 인코딩
    

훈련-테스트 분할
특징 스케일링
회귀 모델 구축 및 평가
    
        1. 선형 회귀
        2. 다항 선형 회귀
        3. 의사결정나무 회귀
        4. 랜덤 포레스트 회귀
        5. AdaBoost 회귀
        6. XGBoost 회귀
        7. 서포트 벡터 회귀 (SVR)
    

모델 성능 비교
결론




소개

예측 분석은 기업이 미래의 추세를 예측하고, 운영을 최적화하며, 의사 결정 과정을 향상시킬 수 있도록 합니다. 자동차 가격 예측은 브랜드, 엔진 사양, 연료 타입 등과 같은 속성을 기반으로 차량 가격을 예측할 수 있는 머신 러닝 모델의 대표적인 예입니다. 이 가이드는 데이터 전처리부터 여러 회귀 알고리즘 평가에 이르기까지 종합적인 회귀 모델 파이프라인 구축 과정을 안내합니다.

데이터셋 개요

Kaggle의 자동차 가격 예측 데이터셋은 각 26개의 특징을 가진 205개의 항목을 포함하는 풍부한 자료입니다. 이러한 특징들은 문 수, 엔진 크기, 마력, 연료 타입 등과 같은 자동차의 다양한 측면을 포괄하며, 모두 자동차의 시장 가격에 영향을 미칩니다.

주요 특징:

CarName: 자동차 이름 (브랜드 및 모델)
FuelType: 사용하는 연료 유형 (예: 가스, 디젤)
Aspiration: 엔진 흡기 방식
Doornumber: 문 수 (두 개 또는 네 개)
Enginesize: 엔진 크기
Horsepower: 엔진 출력
Price: 자동차의 시장 가격 (목표 변수)


데이터 가져오기 및 초기 탐색

먼저, pandas를 사용하여 데이터셋을 가져오고 데이터 구조를 사전적으로 살펴봅니다.





		
		
			
			
Java
			
			import pandas as pd

# Load the dataset
data = pd.read_csv('CarPrice.csv')

# Display the first five rows
print(data.head())
			
				
					
				
					1
2
3
4
5
6
7
				
						import pandas as pd
 
# Load the dataset
data = pd.read_csv('CarPrice.csv')
 
# Display the first five rows
print(data.head())
					
				
			
		



샘플 출력:




		
		
			
			
Java
			
			   car_ID  symboling                   CarName fueltype aspiration doornumber  \
0       1          3        alfa-romero giulia      gas        std        two   
1       2          3       alfa-romero stelvio      gas        std        two   
2       3          1  alfa-romero Quadrifoglio      gas        std        two   
3       4          2               audi 100 ls      gas        std       four   
4       5          2                audi 100ls      gas        std       four   

      carbody drivewheel enginelocation  wheelbase  ...  horsepower  peakrpm citympg  \
0  convertible        rwd          front       88.6  ...       111.0     5000      21   
1  convertible        rwd          front       88.6  ...       111.0     5000      21   
2    hatchback        rwd          front       94.5  ...       154.0     5000      19   
3        sedan        fwd          front       99.8  ...       102.0     5500      24   
4        sedan        4wd          front       99.4  ...       115.0     5500      18   

   highwaympg    price  
0          27  13495.0  
1          27  16500.0  
2          26  16500.0  
3          30  13950.0  
4          22  17450.0  
			
				
					
				
					1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
				
						   car_ID  symboling                   CarName fueltype aspiration doornumber  \
0       1          3        alfa-romero giulia      gas        std        two   
1       2          3       alfa-romero stelvio      gas        std        two   
2       3          1  alfa-romero Quadrifoglio      gas        std        two   
3       4          2               audi 100 ls      gas        std       four   
4       5          2                audi 100ls      gas        std       four   
 
      carbody drivewheel enginelocation  wheelbase  ...  horsepower  peakrpm citympg  \
0  convertible        rwd          front       88.6  ...       111.0     5000      21   
1  convertible        rwd          front       88.6  ...       111.0     5000      21   
2    hatchback        rwd          front       94.5  ...       154.0     5000      19   
3        sedan        fwd          front       99.8  ...       102.0     5500      24   
4        sedan        4wd          front       99.4  ...       115.0     5500      18   
 
   highwaympg    price  
0          27  13495.0  
1          27  16500.0  
2          26  16500.0  
3          30  13950.0  
4          22  17450.0  
					
				
			
		



데이터 정제 및 전처리

누락된 수치 데이터 처리

누락된 값은 머신 러닝 모델의 성능을 크게 왜곡시킬 수 있습니다. 먼저, 수치 데이터의 누락 값을 평균 값으로 대체하여 처리합니다.





		
		
			
			
Java
			
			import numpy as np
from sklearn.impute import SimpleImputer

# Identify numerical columns
numerical_cols = list(np.where((X.dtypes == np.int64) | (X.dtypes == np.float64))[0])

# Initialize imputer
imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean')
imp_mean.fit(X.iloc[:, numerical_cols])

# Impute missing numerical data
X.iloc[:, numerical_cols] = imp_mean.transform(X.iloc[:, numerical_cols])
			
				
					
				
					1
2
3
4
5
6
7
8
9
10
11
12
				
						import numpy as np
from sklearn.impute import SimpleImputer
 
# Identify numerical columns
numerical_cols = list(np.where((X.dtypes == np.int64) | (X.dtypes == np.float64))[0])
 
# Initialize imputer
imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean')
imp_mean.fit(X.iloc[:, numerical_cols])
 
# Impute missing numerical data
X.iloc[:, numerical_cols] = imp_mean.transform(X.iloc[:, numerical_cols])
					
				
			
		



누락된 범주형 데이터 처리

범주형 변수의 경우, 가장 빈번한 값으로 누락 값을 대체합니다.





		
		
			
			
Java
			
			# Identify categorical columns
string_cols = list(np.where((X.dtypes == np.object))[0])

# Initialize imputer for categorical data
imp_freq = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
imp_freq.fit(X.iloc[:, string_cols])

# Impute missing categorical data
X.iloc[:, string_cols] = imp_freq.transform(X.iloc[:, string_cols])
			
				
					
				
					1
2
3
4
5
6
7
8
9
				
						# Identify categorical columns
string_cols = list(np.where((X.dtypes == np.object))[0])
 
# Initialize imputer for categorical data
imp_freq = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
imp_freq.fit(X.iloc[:, string_cols])
 
# Impute missing categorical data
X.iloc[:, string_cols] = imp_freq.transform(X.iloc[:, string_cols])
					
				
			
		



특징 선택 및 인코딩

관련 없는 특징 제거

car_ID 열은 고유 식별자이며 모델의 예측력에 기여하지 않습니다. 따라서 이를 제거합니다.





		
		
			
			
Java
			
			# Drop 'car_ID' column
X.drop('car_ID', axis=1, inplace=True)
			
				
					
				
					1
2
				
						# Drop 'car_ID' column
X.drop('car_ID', axis=1, inplace=True)
					
				
			
		



범주형 변수 원-핫 인코딩

머신 러닝 알고리즘은 수치 입력을 요구합니다. 따라서 범주형 변수는 원-핫 인코딩을 사용하여 변환됩니다.





		
		
			
			
Java
			
			from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

# Re-identify categorical columns after dropping 'car_ID'
string_cols = list(np.where((X.dtypes == np.object))[0])

# Apply One-Hot Encoding
columnTransformer = ColumnTransformer(
    [('encoder', OneHotEncoder(), string_cols)],
    remainder='passthrough'
)
X = columnTransformer.fit_transform(X)
			
				
					
				
					1
2
3
4
5
6
7
8
9
10
11
12
				
						from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
 
# Re-identify categorical columns after dropping 'car_ID'
string_cols = list(np.where((X.dtypes == np.object))[0])
 
# Apply One-Hot Encoding
columnTransformer = ColumnTransformer(
    [('encoder', OneHotEncoder(), string_cols)],
    remainder='passthrough'
)
X = columnTransformer.fit_transform(X)
					
				
			
		



인코딩 전:

Shape: (205, 24)


인코딩 후:

Shape: (205, 199)


훈련-테스트 분할

데이터셋을 훈련 세트와 테스트 세트로 분할하는 것은 모델 성능을 평가하는 데 중요합니다.





		
		
			
			
Java
			
			from sklearn.model_selection import train_test_split

# Perform train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, Y, test_size=0.20, random_state=1
)

print(f"Training set shape: {X_train.shape}")
print(f"Testing set shape: {X_test.shape}")
			
				
					
				
					1
2
3
4
5
6
7
8
9
				
						from sklearn.model_selection import train_test_split
 
# Perform train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, Y, test_size=0.20, random_state=1
)
 
print(f"Training set shape: {X_train.shape}")
print(f"Testing set shape: {X_test.shape}")
					
				
			
		



출력:




		
		
			
			
Java
			
			Training set shape: (164, 199)
Testing set shape: (41, 199)
			
				
					
				
					1
2
				
						Training set shape: (164, 199)
Testing set shape: (41, 199)
					
				
			
		



특징 스케일링

특징 스케일링은 모든 특징이 모델의 성능에 동등하게 기여하도록 보장합니다. 여기서는 표준화를 사용합니다.





		
		
			
			
Java
			
			from sklearn import preprocessing

# Initialize StandardScaler
sc = preprocessing.StandardScaler(with_mean=False)
sc.fit(X_train)

# Transform the data
X_train = sc.transform(X_train)
X_test = sc.transform(X_test)
			
				
					
				
					1
2
3
4
5
6
7
8
9
				
						from sklearn import preprocessing
 
# Initialize StandardScaler
sc = preprocessing.StandardScaler(with_mean=False)
sc.fit(X_train)
 
# Transform the data
X_train = sc.transform(X_train)
X_test = sc.transform(X_test)
					
				
			
		



회귀 모델 구축 및 평가

여러 회귀 모델을 탐구하고 각 모델을 R² 점수를 기준으로 평가합니다.

1. 선형 회귀

선형 회귀는 기본 모델로 사용됩니다.





		
		
			
			
Java
			
			from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

# Initialize and train the model
model = LinearRegression()
model.fit(X_train, y_train)

# Predict and evaluate
y_pred = model.predict(X_test)
r2 = r2_score(y_test, y_pred)
print(f"Linear Regression R² Score: {r2:.2f}")
			
				
					
				
					1
2
3
4
5
6
7
8
9
10
11
				
						from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
 
# Initialize and train the model
model = LinearRegression()
model.fit(X_train, y_train)
 
# Predict and evaluate
y_pred = model.predict(X_test)
r2 = r2_score(y_test, y_pred)
print(f"Linear Regression R² Score: {r2:.2f}")
					
				
			
		



R² 점수: 0.097

해석: 모델이 자동차 가격의 분산 중 약 9.7%를 설명합니다.

2. 다항 선형 회귀

비선형 관계를 포착하기 위해 다항 특징을 도입합니다.





		
		
			
			
Java
			
			from sklearn.preprocessing import PolynomialFeatures

# Initialize PolynomialFeatures
poly = PolynomialFeatures(degree=2)
X_train_poly = poly.fit_transform(X_train)
X_test_poly = poly.transform(X_test)

# Train the model
model = LinearRegression()
model.fit(X_train_poly, y_train)

# Predict and evaluate
y_pred = model.predict(X_test_poly)
r2 = r2_score(y_test, y_pred)
print(f"Polynomial Linear Regression R² Score: {r2:.2f}")
			
				
					
				
					1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
				
						from sklearn.preprocessing import PolynomialFeatures
 
# Initialize PolynomialFeatures
poly = PolynomialFeatures(degree=2)
X_train_poly = poly.fit_transform(X_train)
X_test_poly = poly.transform(X_test)
 
# Train the model
model = LinearRegression()
model.fit(X_train_poly, y_train)
 
# Predict and evaluate
y_pred = model.predict(X_test_poly)
r2 = r2_score(y_test, y_pred)
print(f"Polynomial Linear Regression R² Score: {r2:.2f}")
					
				
			
		



R² 점수: -0.45

해석: 모델이 기준 모델보다 성능이 떨어지며, 분산의 -45%를 설명합니다.

3. 의사결정나무 회귀

의사결정나무는 데이터를 분할하여 복잡한 관계를 모델링할 수 있습니다.





		
		
			
			
Java
			
			from sklearn.tree import DecisionTreeRegressor

# Initialize and train the model
model = DecisionTreeRegressor(max_depth=4)
model.fit(X_train, y_train)

# Predict and evaluate
y_pred = model.predict(X_test)
r2 = r2_score(y_test, y_pred)
print(f"Decision Tree Regression R² Score: {r2:.2f}")
			
				
					
				
					1
2
3
4
5
6
7
8
9
10
				
						from sklearn.tree import DecisionTreeRegressor
 
# Initialize and train the model
model = DecisionTreeRegressor(max_depth=4)
model.fit(X_train, y_train)
 
# Predict and evaluate
y_pred = model.predict(X_test)
r2 = r2_score(y_test, y_pred)
print(f"Decision Tree Regression R² Score: {r2:.2f}")
					
				
			
		



R² 점수: 0.88

해석: 상당한 향상으로 분산의 88%를 설명합니다.

4. 랜덤 포레스트 회귀

랜덤 포레스트는 여러 의사결정나무를 집계하여 성능을 향상시키고 과적합을 완화합니다.





		
		
			
			
Java
			
			from sklearn.ensemble import RandomForestRegressor

# Initialize and train the model
model = RandomForestRegressor(n_estimators=25, random_state=10)
model.fit(X_train, y_train)

# Predict and evaluate
y_pred = model.predict(X_test)
r2 = r2_score(y_test, y_pred)
print(f"Random Forest Regression R² Score: {r2:.2f}")
			
				
					
				
					1
2
3
4
5
6
7
8
9
10
				
						from sklearn.ensemble import RandomForestRegressor
 
# Initialize and train the model
model = RandomForestRegressor(n_estimators=25, random_state=10)
model.fit(X_train, y_train)
 
# Predict and evaluate
y_pred = model.predict(X_test)
r2 = r2_score(y_test, y_pred)
print(f"Random Forest Regression R² Score: {r2:.2f}")
					
				
			
		



R² 점수: 0.91

해석: 탁월한 성능으로 분산의 91%를 설명합니다.

5. AdaBoost 회귀

AdaBoost는 약한 학습자들을 결합하여 실수를 집중적으로 학습하는 강력한 예측기를 형성합니다.





		
		
			
			
Java
			
			from sklearn.ensemble import AdaBoostRegressor

# Initialize and train the model
model = AdaBoostRegressor(random_state=0, n_estimators=100)
model.fit(X_train, y_train)

# Predict and evaluate
y_pred = model.predict(X_test)
r2 = r2_score(y_test, y_pred)
print(f"AdaBoost Regression R² Score: {r2:.2f}")
			
				
					
				
					1
2
3
4
5
6
7
8
9
10
				
						from sklearn.ensemble import AdaBoostRegressor
 
# Initialize and train the model
model = AdaBoostRegressor(random_state=0, n_estimators=100)
model.fit(X_train, y_train)
 
# Predict and evaluate
y_pred = model.predict(X_test)
r2 = r2_score(y_test, y_pred)
print(f"AdaBoost Regression R² Score: {r2:.2f}")
					
				
			
		



R² 점수: 0.88

해석: 의사결정나무와 유사하게, 분산의 88%를 설명합니다.

6. XGBoost 회귀

XGBoost는 효율성과 성능으로 잘 알려진 강력한 그래디언트 부스팅 프레임워크입니다.





		
		
			
			
Java
			
			import xgboost as xgb

# Initialize and train the model
model = xgb.XGBRegressor(
    n_estimators=100,
    reg_lambda=1,
    gamma=0,
    max_depth=3,
    learning_rate=0.05
)
model.fit(X_train, y_train)

# Predict and evaluate
y_pred = model.predict(X_test)
r2 = r2_score(y_test, y_pred)
print(f"XGBoost Regression R² Score: {r2:.2f}")
			
				
					
				
					1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
				
						import xgboost as xgb
 
# Initialize and train the model
model = xgb.XGBRegressor(
    n_estimators=100,
    reg_lambda=1,
    gamma=0,
    max_depth=3,
    learning_rate=0.05
)
model.fit(X_train, y_train)
 
# Predict and evaluate
y_pred = model.predict(X_test)
r2 = r2_score(y_test, y_pred)
print(f"XGBoost Regression R² Score: {r2:.2f}")
					
				
			
		



R² 점수: 0.89

해석: 견고한 성능으로 분산의 89%를 설명합니다.

7. 서포트 벡터 회귀 (SVR)

SVR은 고차원 공간에서 효과적이지만, 더 큰 데이터셋에서는 성능이 저하될 수 있습니다.





		
		
			
			
Java
			
			from sklearn.svm import SVR

# Initialize and train the model
model = SVR()
model.fit(X_train, y_train)

# Predict and evaluate
y_pred = model.predict(X_test)
r2 = r2_score(y_test, y_pred)
print(f"Support Vector Regression (SVR) R² Score: {r2:.2f}")
			
				
					
				
					1
2
3
4
5
6
7
8
9
10
				
						from sklearn.svm import SVR
 
# Initialize and train the model
model = SVR()
model.fit(X_train, y_train)
 
# Predict and evaluate
y_pred = model.predict(X_test)
r2 = r2_score(y_test, y_pred)
print(f"Support Vector Regression (SVR) R² Score: {r2:.2f}")
					
				
			
		



R² 점수: -0.03

해석: 성능이 저조하여 분산의 -3%를 설명합니다.

모델 성능 비교



모델
R² 점수


선형 회귀
0.10


다항 선형 회귀
-0.45


의사결정나무 회귀
0.88


랜덤 포레스트 회귀
0.91


AdaBoost 회귀
0.88


XGBoost 회귀
0.89


서포트 벡터 회귀 (SVR)
-0.03



인사이트:

랜덤 포레스트 회귀는 R² 점수 0.91로 모든 모델 중 가장 뛰어나며, 자동차 가격의 91%의 분산을 설명합니다.
다항 선형 회귀는 가장 성능이 저조하여 기준 모델보다도 못했으며, 과적합 또는 부적절한 특징 변환을 시사합니다.
서포트 벡터 회귀 (SVR)는 인코딩 후의 고차원성으로 인해 이 데이터셋에서 어려움을 겪었습니다.


결론

자동차 가격 예측을 위한 예측 모델링은 적절한 알고리즘 선택과 철저한 데이터 전처리의 중요성을 강조합니다. 우리의 탐색에서:

의사결정나무와 랜덤 포레스트 모델은 뛰어난 성능을 보여주었으며, 랜덤 포레스트가 약간 더 우수했습니다.
앙상블 방법인 AdaBoost와 XGBoost도 강력한 결과를 보여주어 복잡한 데이터셋을 처리하는 데 효과적임을 입증했습니다.
선형 모델은 특히 다항 특징으로 확장되었을 때 항상 더 나은 성능을 보이지 않으며, 때로는 모델의 효용성을 저하시킬 수 있습니다.
서포트 벡터 회귀 (SVR)는 고차원성이나 비선형 패턴이 덜 뚜렷한 데이터셋에는 적합하지 않을 수 있습니다.


주요 요점:

데이터 전처리: 누락된 값 처리와 범주형 변수 인코딩은 모델 성능에 크게 영향을 미치는 중요한 단계입니다.
특징 스케일링: 모든 특징이 동등하게 기여하도록 보장하여 그래디언트 기반 알고리즘의 효율성을 향상시킵니다.
모델 선택: 랜덤 포레스트 및 XGBoost와 같은 앙상블 방법은 회귀 작업에서 종종 우수한 성능을 제공합니다.
모델 평가: R² 점수는 예측이 실제 결과를 얼마나 잘 근사하는지를 평가하는 데 유용한 지표입니다.


고급 회귀 모델을 사용한 자동차 가격 예측은 예측 정확성을 향상시킬 뿐만 아니라 시장 동향에 대한 실행 가능한 통찰력을 제공하여 이해관계자에게 유용한 정보를 제공합니다. 머신 러닝 분야가 지속적으로 발전함에 따라 최신 알고리즘과 기술을 숙지하는 것은 데이터 애호가와 전문가 모두에게 필수적입니다.
모델	R² 점수
선형 회귀	0.10
다항 선형 회귀	-0.45
의사결정나무 회귀	0.88
랜덤 포레스트 회귀	0.91
AdaBoost 회귀	0.88
XGBoost 회귀	0.89
서포트 벡터 회귀 (SVR)	-0.03