최적의 머신러닝 모델을 위한 GridSearchCV 마스터하기: 종합 가이드

GridSearchCV 소개
데이터셋 이해하기
데이터 전처리
- 결측치 처리
- 범주형 변수 인코딩
- 특징 선택
- 특징 스케일링
GridSearchCV 구현하기
- StratifiedKFold를 사용한 교차 검증 설정
- GridSearchCV 매개변수 설명
머신러닝 모델 구축 및 튜닝
- K-최근접 이웃 (KNN)
- 로지스틱 회귀
- 가우시안 나이브 베이즈
- 서포트 벡터 머신 (SVM)
- 의사결정 나무
- 랜덤 포레스트
- AdaBoost
- XGBoost
성능 분석
GridSearchCV 최적화
결론 및 다음 단계

1. GridSearchCV 소개

GridSearchCV는 머신러닝에서 하이퍼파라미터 튜닝을 위해 사용되는 기법입니다. 하이퍼파라미터는 학습 과정과 모델의 구조를 제어하는 중요한 매개변수로, 일반적인 파라미터와는 달리 학습 단계가 시작되기 전에 설정되며 모델의 성능에 상당한 영향을 미칠 수 있습니다.

GridSearchCV는 지정된 매개변수 그리드를 철저히 탐색하여 교차 검증을 사용해 각 조합을 평가하고, 선택된 메트릭(예: F1-score, 정확도)을 기반으로 최상의 성능을 발휘하는 조합을 식별합니다.

왜 GridSearchCV인가?

포괄적인 탐색: 모든 가능한 하이퍼파라미터 조합을 평가합니다.
교차 검증: 모델의 성능이 특정 데이터 하위 집합에만 특화되지 않고 견고하도록 보장합니다.
자동화: 튜닝 과정을 간소화하여 시간과 계산 자원을 절약할 수 있습니다.

그러나 GridSearchCV는 특히 큰 데이터셋과 광범위한 매개변수 그리드의 경우 계산 집약적일 수 있다는 점을 염두에 두어야 합니다. 이 가이드는 이러한 문제를 효과적으로 관리하기 위한 전략을 탐구합니다.

2. 데이터셋 이해하기

이번 데모에서는 항공사 승객 만족도에 초점을 맞춘 데이터셋을 사용합니다. 이 데이터셋은 원래 100,000개 이상의 레코드로 구성되어 있으나, 이 예제의 실현 가능성을 위해 5,000개 레코드로 축소되었습니다. 각 레코드는 인구 통계 정보, 비행 세부 사항, 만족도 수준을 포함한 23개의 특징을 포함합니다.

데이터셋 샘플

성별	고객 유형	나이	여행 유형	클래스	비행 거리	…	만족도
여성	충성 고객	41	개인 여행	Eco Plus	746	…	보통 또는 불만족
남성	충성 고객	53	비즈니스 여행	비즈니스	3095	…	만족
남성	비충성 고객	21	비즈니스 여행	Eco	125	…	만족
…	…	…	…	…	…	…	…

타겟 변수는 만족도로, “만족” 또는 “보통 또는 불만족”으로 분류됩니다.

3. 데이터 전처리

효과적인 데이터 전처리는 머신러닝 모델의 최적 성능을 보장하는 데 매우 중요합니다. 단계는 결측치 처리, 범주형 변수 인코딩, 특징 선택, 그리고 특징 스케일링을 포함합니다.

결측치 처리

수치 데이터: 수치 열의 결측치는 평균 대체 전략을 사용하여 처리됩니다.

from sklearn.impute import SimpleImputer
import numpy as np

# Initialize imputer for numeric data
imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean')

# Identify numerical columns
numerical_cols = list(np.where((X.dtypes == np.int64) | (X.dtypes == np.float64))[0])

# Fit and transform the data
imp_mean.fit(X.iloc[:, numerical_cols])
X.iloc[:, numerical_cols] = imp_mean.transform(X.iloc[:, numerical_cols])

from sklearn.impute import SimpleImputer

import numpy as np

# Initialize imputer for numeric data

imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean')

# Identify numerical columns

numerical_cols = list(np.where((X.dtypes == np.int64) | (X.dtypes == np.float64))[0])

# Fit and transform the data

imp_mean.fit(X.iloc[:, numerical_cols])

X.iloc[:, numerical_cols] = imp_mean.transform(X.iloc[:, numerical_cols])

범주형 데이터: 문자열 기반 열에 대해서는 최빈값 대체 전략을 사용합니다.

# Initialize imputer for categorical data
imp_freq = SimpleImputer(missing_values=np.nan, strategy='most_frequent')

# Identify string columns
string_cols = list(np.where((X.dtypes == object))[0])

# Fit and transform the data
imp_freq.fit(X.iloc[:, string_cols])
X.iloc[:, string_cols] = imp_freq.transform(X.iloc[:, string_cols])

# Initialize imputer for categorical data

imp_freq = SimpleImputer(missing_values=np.nan, strategy='most_frequent')

# Identify string columns

string_cols = list(np.where((X.dtypes == object))[0])

# Fit and transform the data

imp_freq.fit(X.iloc[:, string_cols])

X.iloc[:, string_cols] = imp_freq.transform(X.iloc[:, string_cols])

범주형 변수 인코딩

범주형 변수는 라벨 인코딩과 원-핫 인코딩을 사용하여 수치 형식으로 변환됩니다.

from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer

def LabelEncoderMethod(series):
    le = LabelEncoder()
    return le.fit_transform(series)

def OneHotEncoderMethod(indices, data):
    columnTransformer = ColumnTransformer(
        [('encoder', OneHotEncoder(), indices)],
        remainder='passthrough'
    )
    return columnTransformer.fit_transform(data)

def EncodingSelection(X, threshold=10):
    string_cols = list(np.where((X.dtypes == object))[0])
    one_hot_encoding_indices = []
    
    for col in string_cols:
        unique_vals = len(pd.unique(X[X.columns[col]]))
        if unique_vals == 2 or unique_vals > threshold:
            X[X.columns[col]] = LabelEncoderMethod(X[X.columns[col]])
        else:
            one_hot_encoding_indices.append(col)
                
    X = OneHotEncoderMethod(one_hot_encoding_indices, X)
    return X

# Apply encoding
X = EncodingSelection(X)

from sklearn.preprocessing import LabelEncoder, OneHotEncoder

from sklearn.compose import ColumnTransformer

def LabelEncoderMethod(series):

le = LabelEncoder()

return le.fit_transform(series)

def OneHotEncoderMethod(indices, data):

columnTransformer = ColumnTransformer(

[('encoder', OneHotEncoder(), indices)],

remainder='passthrough'

)

return columnTransformer.fit_transform(data)

def EncodingSelection(X, threshold=10):

string_cols = list(np.where((X.dtypes == object))[0])

one_hot_encoding_indices = []

for col in string_cols:

unique_vals = len(pd.unique(X[X.columns[col]]))

if unique_vals == 2 or unique_vals > threshold:

X[X.columns[col]] = LabelEncoderMethod(X[X.columns[col]])

else:

one_hot_encoding_indices.append(col)

X = OneHotEncoderMethod(one_hot_encoding_indices, X)

return X

# Apply encoding

X = EncodingSelection(X)

특징 선택

모델 성능을 향상시키고 계산 복잡성을 줄이기 위해, SelectKBest와 카이제곱 (χ²) 통계를 사용하여 상위 10개의 특징을 선택합니다.

from sklearn.feature_selection import SelectKBest, chi2
from sklearn.preprocessing import MinMaxScaler

# Initialize SelectKBest
kbest = SelectKBest(score_func=chi2, k=10)

# Scale features
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)

# Fit SelectKBest
X_selected = kbest.fit_transform(X_scaled, y)

from sklearn.feature_selection import SelectKBest, chi2

from sklearn.preprocessing import MinMaxScaler

# Initialize SelectKBest

kbest = SelectKBest(score_func=chi2, k=10)

# Scale features

scaler = MinMaxScaler()

X_scaled = scaler.fit_transform(X)

# Fit SelectKBest

X_selected = kbest.fit_transform(X_scaled, y)

특징 스케일링

특징 스케일링은 모든 특징이 모델의 성능에 동등하게 기여하도록 보장합니다.

from sklearn.preprocessing import StandardScaler

# Initialize StandardScaler
sc = StandardScaler(with_mean=False)

# Fit and transform the training data
sc.fit(X_train)
X_train = sc.transform(X_train)
X_test = sc.transform(X_test)

from sklearn.preprocessing import StandardScaler

# Initialize StandardScaler

sc = StandardScaler(with_mean=False)

# Fit and transform the training data

sc.fit(X_train)

X_train = sc.transform(X_train)

X_test = sc.transform(X_test)

4. GridSearchCV 구현하기

데이터 전처리가 완료되면, 다음 단계는 다양한 머신러닝 모델의 하이퍼파라미터를 튜닝하기 위해 GridSearchCV를 설정하는 것입니다.

StratifiedKFold를 사용한 교차 검증 설정

StratifiedKFold는 교차 검증의 각 폴드가 클래스 레이블의 동일한 비율을 유지하도록 하여, 특히 불균형 데이터셋에서 매우 중요합니다.

from sklearn.model_selection import StratifiedKFold

# Initialize StratifiedKFold
cv = StratifiedKFold(n_splits=2)

from sklearn.model_selection import StratifiedKFold

# Initialize StratifiedKFold

cv = StratifiedKFold(n_splits=2)

GridSearchCV 매개변수 설명

Estimator: 튜닝할 머신러닝 모델.
Param_grid: 탐색할 하이퍼파라미터와 해당 값들을 정의하는 딕셔너리.
Verbose: 출력의 자세함을 제어; 진행 상황을 표시하려면 1로 설정.
Scoring: 최적화할 성능 메트릭, 예: ‘f1’.
n_jobs: 사용할 CPU 코어 수; -1로 설정하면 모든 사용 가능한 코어를 사용.

from sklearn.model_selection import GridSearchCV

# Example: Setting up GridSearchCV for KNN
from sklearn.neighbors import KNeighborsClassifier

model = KNeighborsClassifier()
params = {
    'n_neighbors': [4, 5, 6, 7],
    'leaf_size': [1, 3, 5],
    'algorithm': ['auto', 'kd_tree'],
    'weights': ['uniform', 'distance']
}

grid_search_cv = GridSearchCV(
    estimator=model,
    param_grid=params,
    verbose=1,
    cv=cv,
    scoring='f1',
    n_jobs=-1
)

from sklearn.model_selection import GridSearchCV

# Example: Setting up GridSearchCV for KNN

from sklearn.neighbors import KNeighborsClassifier

model = KNeighborsClassifier()

params = {

'n_neighbors': [4, 5, 6, 7],

'leaf_size': [1, 3, 5],

'algorithm': ['auto', 'kd_tree'],

'weights': ['uniform', 'distance']

}

grid_search_cv = GridSearchCV(

estimator=model,

param_grid=params,

verbose=1,

cv=cv,

scoring='f1',

n_jobs=-1

)

5. 머신러닝 모델 구축 및 튜닝

5.1 K-최근접 이웃 (KNN)

KNN은 분류 작업에 간단하면서도 효과적인 알고리즘입니다. GridSearchCV는 최적의 이웃 수, 리프 크기, 알고리즘, 가중치 체계를 선택하는 데 도움을 줍니다.

# Fit GridSearchCV
grid_search_cv.fit(X_train, y_train)

# Best parameters
print("Best Estimator", grid_search_cv.best_estimator_)
print("Best score", grid_search_cv.best_score_)

# Fit GridSearchCV

grid_search_cv.fit(X_train, y_train)

# Best parameters

print("Best Estimator", grid_search_cv.best_estimator_)

print("Best score", grid_search_cv.best_score_)

출력:

Best Estimator KNeighborsClassifier(leaf_size=1)
Best score 0.8774673417446253

1 2	Best Estimator KNeighborsClassifier(leaf_size=1) Best score 0.8774673417446253

5.2 로지스틱 회귀

로지스틱 회귀는 이진 결과의 확률을 모델링합니다. GridSearchCV는 솔버 유형, 패널티, 정규화 강도를 튜닝합니다.

from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
params = {
    'solver': ['newton-cg', 'lbfgs', 'liblinear'],
    'penalty': ['l1', 'l2'],
    'C': [100, 10, 1.0, 0.1, 0.01]
}

grid_search_cv = GridSearchCV(
    estimator=model,
    param_grid=params,
    verbose=1,
    cv=cv,
    scoring='f1',
    n_jobs=-1
)

grid_search_cv.fit(X_train, y_train)
print("Best Estimator", grid_search_cv.best_estimator_)
print("Best score", grid_search_cv.best_score_)

from sklearn.linear_model import LogisticRegression

model = LogisticRegression()

params = {

'solver': ['newton-cg', 'lbfgs', 'liblinear'],

'penalty': ['l1', 'l2'],

'C': [100, 10, 1.0, 0.1, 0.01]

}

grid_search_cv = GridSearchCV(

estimator=model,

param_grid=params,

verbose=1,

cv=cv,

scoring='f1',

n_jobs=-1

)

grid_search_cv.fit(X_train, y_train)

print("Best Estimator", grid_search_cv.best_estimator_)

print("Best score", grid_search_cv.best_score_)

출력:

Best Estimator LogisticRegression(C=0.01, solver='newton-cg')
Best score 0.8295203666687819

1 2	Best Estimator LogisticRegression(C=0.01, solver='newton-cg') Best score 0.8295203666687819

5.3 가우시안 나이브 베이즈

가우시안 나이브 베이즈는 특징이 정규 분포를 따른다고 가정합니다. 하이퍼파라미터가 적어 GridSearchCV에 덜 집약적입니다.

from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, classification_report

model_GNB = GaussianNB()
model_GNB.fit(X_train, y_train)
y_pred = model_GNB.predict(X_test)

print(accuracy_score(y_pred, y_test))
print(classification_report(y_pred, y_test))

from sklearn.naive_bayes import GaussianNB

from sklearn.metrics import accuracy_score, classification_report

model_GNB = GaussianNB()

model_GNB.fit(X_train, y_train)

y_pred = model_GNB.predict(X_test)

print(accuracy_score(y_pred, y_test))

print(classification_report(y_pred, y_test))

출력:

0.84
              precision    recall  f1-score   support

           0       0.86      0.86      0.86       564
           1       0.82      0.81      0.82       436

    accuracy                           0.84      1000
   macro avg       0.84      0.84      0.84      1000
weighted avg       0.84      0.84      0.84      1000

0.84

precision recall f1-score support

0 0.86 0.86 0.86 564

1 0.82 0.81 0.82 436

accuracy 0.84 1000

macro avg 0.84 0.84 0.84 1000

weighted avg 0.84 0.84 0.84 1000

5.4 서포트 벡터 머신 (SVM)

SVM은 선형 및 비선형 데이터 모두에 잘 동작하는 다용도 분류기입니다. GridSearchCV는 커널 유형, 정규화 파라미터 C, 차수, 계수 coef0, 및 커널 계수 gamma를 튜닝합니다.

from sklearn.svm import SVC

model = SVC()
params = {
    'kernel': ['linear', 'poly', 'rbf', 'sigmoid'],
    'C': [1, 5, 10],
    'degree': [3, 8],
    'coef0': [0.01, 10, 0.5],
    'gamma': ['auto', 'scale']
}

grid_search_cv = GridSearchCV(
    estimator=model,
    param_grid=params,
    verbose=1,
    cv=cv,
    scoring='f1',
    n_jobs=-1
)

grid_search_cv.fit(X_train, y_train)
print("Best Estimator", grid_search_cv.best_estimator_)
print("Best score", grid_search_cv.best_score_)

from sklearn.svm import SVC

model = SVC()

params = {

'kernel': ['linear', 'poly', 'rbf', 'sigmoid'],

'C': [1, 5, 10],

'degree': [3, 8],

'coef0': [0.01, 10, 0.5],

'gamma': ['auto', 'scale']

}

grid_search_cv = GridSearchCV(

estimator=model,

param_grid=params,

verbose=1,

cv=cv,

scoring='f1',

n_jobs=-1

)

grid_search_cv.fit(X_train, y_train)

print("Best Estimator", grid_search_cv.best_estimator_)

print("Best score", grid_search_cv.best_score_)

출력:

Best Estimator SVC(C=5, coef0=0.01)
Best score 0.9168629045108148

1 2	Best Estimator SVC(C=5, coef0=0.01) Best score 0.9168629045108148

5.5 의사결정 나무

의사결정 나무는 특징 값을 기반으로 데이터를 분할하여 예측을 수행합니다. GridSearchCV는 최대 리프 노드 수, 내부 노드를 분할하는 데 필요한 최소 샘플 수 등의 매개변수를 최적화합니다.

from sklearn.tree import DecisionTreeClassifier

model = DecisionTreeClassifier()
params = {
    'max_leaf_nodes': list(range(2, 100)),
    'min_samples_split': [2, 3, 4]
}

grid_search_cv = GridSearchCV(
    estimator=model,
    param_grid=params,
    verbose=1,
    cv=cv,
    scoring='f1',
    n_jobs=-1
)

grid_search_cv.fit(X_train, y_train)
print("Best Estimator", grid_search_cv.best_estimator_)
print("Best score", grid_search_cv.best_score_)

from sklearn.tree import DecisionTreeClassifier

model = DecisionTreeClassifier()

params = {

'max_leaf_nodes': list(range(2, 100)),

'min_samples_split': [2, 3, 4]

}

grid_search_cv = GridSearchCV(

estimator=model,

param_grid=params,

verbose=1,

cv=cv,

scoring='f1',

n_jobs=-1

)

grid_search_cv.fit(X_train, y_train)

print("Best Estimator", grid_search_cv.best_estimator_)

print("Best score", grid_search_cv.best_score_)

출력:

Best Estimator DecisionTreeClassifier(max_leaf_nodes=29, min_samples_split=4)
Best score 0.9098148654372425

1 2	Best Estimator DecisionTreeClassifier(max_leaf_nodes=29, min_samples_split=4) Best score 0.9098148654372425

5.6 랜덤 포레스트

랜덤 포레스트는 여러 의사결정 나무를 집계하여 성능을 향상시키고 과적합을 제어합니다. GridSearchCV는 추정기 수, 최대 깊이, 특징 수, 샘플 분할 등을 튜닝합니다.

from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier()
params = {
    'bootstrap': [True],
    'max_depth': [80, 90, 100, 110],
    'max_features': [2, 3],
    'min_samples_leaf': [3, 4, 5],
    'min_samples_split': [8, 10, 12],
    'n_estimators': [100, 200, 300, 1000]
}

grid_search_cv = GridSearchCV(
    estimator=model,
    param_grid=params,
    verbose=1,
    cv=cv,
    scoring='f1',
    n_jobs=-1
)

grid_search_cv.fit(X_train, y_train)
print("Best Estimator", grid_search_cv.best_estimator_)
print("Best score", grid_search_cv.best_score_)

from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier()

params = {

'bootstrap': [True],

'max_depth': [80, 90, 100, 110],

'max_features': [2, 3],

'min_samples_leaf': [3, 4, 5],

'min_samples_split': [8, 10, 12],

'n_estimators': [100, 200, 300, 1000]

}

grid_search_cv = GridSearchCV(

estimator=model,

param_grid=params,

verbose=1,

cv=cv,

scoring='f1',

n_jobs=-1

)

grid_search_cv.fit(X_train, y_train)

print("Best Estimator", grid_search_cv.best_estimator_)

print("Best score", grid_search_cv.best_score_)

출력:

Best Estimator RandomForestClassifier(max_leaf_nodes=82, min_samples_split=4)
Best score 0.9225835186933584

1 2	Best Estimator RandomForestClassifier(max_leaf_nodes=82, min_samples_split=4) Best score 0.9225835186933584

5.7 AdaBoost

AdaBoost는 여러 약한 분류기를 결합하여 강력한 분류기를 형성합니다. GridSearchCV는 추정기 수와 학습률을 튜닝합니다.

from sklearn.ensemble import AdaBoostClassifier

model = AdaBoostClassifier()
params = {
    'n_estimators': np.arange(10, 300, 10),
    'learning_rate': [0.01, 0.05, 0.1, 1]
}

grid_search_cv = GridSearchCV(
    estimator=model,
    param_grid=params,
    verbose=1,
    cv=cv,
    scoring='f1',
    n_jobs=-1
)

grid_search_cv.fit(X_train, y_train)
print("Best Estimator", grid_search_cv.best_estimator_)
print("Best score", grid_search_cv.best_score_)

from sklearn.ensemble import AdaBoostClassifier

model = AdaBoostClassifier()

params = {

'n_estimators': np.arange(10, 300, 10),

'learning_rate': [0.01, 0.05, 0.1, 1]

}

grid_search_cv = GridSearchCV(

estimator=model,

param_grid=params,

verbose=1,

cv=cv,

scoring='f1',

n_jobs=-1

)

grid_search_cv.fit(X_train, y_train)

print("Best Estimator", grid_search_cv.best_estimator_)

print("Best score", grid_search_cv.best_score_)

출력:

Best Estimator AdaBoostClassifier(learning_rate=1, n_estimators=30)
Best score 0.8938313525749858

1 2	Best Estimator AdaBoostClassifier(learning_rate=1, n_estimators=30) Best score 0.8938313525749858

5.8 XGBoost

XGBoost는 그래디언트 부스팅의 매우 효율적이고 확장 가능한 구현입니다. 광범위한 하이퍼파라미터 공간으로 인해 GridSearchCV는 시간이 많이 소요될 수 있습니다.

import xgboost as xgb

model = xgb.XGBClassifier()
params = {
    'min_child_weight': [1, 5, 10],
    'gamma': [0.5, 1, 1.5, 2, 5],
    'subsample': [0.6, 0.8, 1.0],
    'colsample_bytree': [0.6, 0.8, 1.0],
    'max_depth': [3, 4, 5],
    'n_estimators': [100, 500, 1000],
    'learning_rate': [0.01, 0.3, 0.5, 0.1],
    'reg_lambda': [1, 2]
}

grid_search_cv = GridSearchCV(
    estimator=model,
    param_grid=params,
    verbose=1,
    cv=cv,
    scoring='f1',
    n_jobs=-1
)

grid_search_cv.fit(X_train, y_train)
print("Best Estimator", grid_search_cv.best_estimator_)
print("Best score", grid_search_cv.best_score_)

import xgboost as xgb

model = xgb.XGBClassifier()

params = {

'min_child_weight': [1, 5, 10],

'gamma': [0.5, 1, 1.5, 2, 5],

'subsample': [0.6, 0.8, 1.0],

'colsample_bytree': [0.6, 0.8, 1.0],

'max_depth': [3, 4, 5],

'n_estimators': [100, 500, 1000],

'learning_rate': [0.01, 0.3, 0.5, 0.1],

'reg_lambda': [1, 2]

}

grid_search_cv = GridSearchCV(

estimator=model,

param_grid=params,

verbose=1,

cv=cv,

scoring='f1',

n_jobs=-1

)

grid_search_cv.fit(X_train, y_train)

print("Best Estimator", grid_search_cv.best_estimator_)

print("Best score", grid_search_cv.best_score_)

출력:

Best Estimator XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=0.8, gamma=0.5, gpu_id=-1,
              importance_type='gain', interaction_constraints='',
              learning_rate=0.01, max_delta_step=0, max_depth=5,
              min_child_weight=1, missing=nan, monotone_constraints='()',
              n_estimators=500, n_jobs=12, num_parallel_tree=1, random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1.0,
              tree_method='exact', validate_parameters=1, verbosity=None)
Best score 0.9267223852716081

Best Estimator XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,

colsample_bynode=1, colsample_bytree=0.8, gamma=0.5, gpu_id=-1,

importance_type='gain', interaction_constraints='',

learning_rate=0.01, max_delta_step=0, max_depth=5,

min_child_weight=1, missing=nan, monotone_constraints='()',

n_estimators=500, n_jobs=12, num_parallel_tree=1, random_state=0,

reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1.0,

tree_method='exact', validate_parameters=1, verbosity=None)

Best score 0.9267223852716081

노트: XGBoost GridSearchCV 실행은 방대한 하이퍼파라미터 조합으로 인해 상당히 시간이 많이 소요됩니다.

6. 성능 분석

튜닝 후 각 모델은 달성된 최고 F1-스코어에 따라 다양한 수준의 성능을 나타냅니다:

KNN: 0.877
로지스틱 회귀: 0.830
가우시안 나이브 베이즈: 0.840
SVM: 0.917
의사결정 나무: 0.910
랜덤 포레스트: 0.923
AdaBoost: 0.894
XGBoost: 0.927

해석

XGBoost와 랜덤 포레스트는 가장 높은 F1-스코어를 보여 데이터셋에서 우수한 성능을 나타냅니다.
SVM도 견고한 성능을 보입니다.
KNN과 AdaBoost는 약간 낮은 F1-스코어로 경쟁력 있는 결과를 제공합니다.
로지스틱 회귀와 가우시안 나이브 베이즈는 더 단순하지만 여전히 괜찮은 성능 메트릭을 제공합니다.

7. GridSearchCV 최적화

GridSearchCV는 특히 큰 데이터셋이나 광범위한 매개변수 그리드를 사용할 때 계산 집약적이므로 최적화 전략을 탐색하는 것이 중요합니다:

7.1 RandomizedSearchCV

GridSearchCV와 달리, RandomizedSearchCV는 지정된 분포에서 고정된 수의 매개변수 설정을 샘플링합니다. 이 접근 방식은 다양한 하이퍼파라미터를 탐색하면서도 계산 시간을 크게 줄일 수 있습니다.

from sklearn.model_selection import RandomizedSearchCV

# Example setup for RandomizedSearchCV
random_search_cv = RandomizedSearchCV(
    estimator=model,
    param_distributions=params,
    n_iter=100,  # Number of parameter settings sampled
    verbose=1,
    cv=cv,
    scoring='f1',
    n_jobs=-1,
    random_state=42
)

random_search_cv.fit(X_train, y_train)
print("Best Estimator", random_search_cv.best_estimator_)
print("Best score", random_search_cv.best_score_)

from sklearn.model_selection import RandomizedSearchCV

# Example setup for RandomizedSearchCV

random_search_cv = RandomizedSearchCV(

estimator=model,

param_distributions=params,

n_iter=100, # Number of parameter settings sampled

verbose=1,

cv=cv,

scoring='f1',

n_jobs=-1,

random_state=42

)

random_search_cv.fit(X_train, y_train)

print("Best Estimator", random_search_cv.best_estimator_)

print("Best score", random_search_cv.best_score_)

7.2 매개변수 그리드 크기 줄이기

모델 성능에 상당한 영향을 미치는 하이퍼파라미터에 집중하세요. 탐색적 분석을 수행하거나 도메인 지식을 활용하여 특정 매개변수를 우선시하십시오.

7.3 병렬 처리 활용

GridSearchCV에서 n_jobs=-1을 설정하면 모든 사용 가능한 CPU 코어를 사용할 수 있어 계산 프로세스를 가속화할 수 있습니다.

7.4 조기 중단

만족스러운 성능 수준에 도달하면 탐색을 중단하는 조기 중단 메커니즘을 구현하여 불필요한 계산을 방지하십시오.

8. 결론 및 다음 단계

GridSearchCV는 하이퍼파라미터 튜닝을 위한 필수 도구로, 머신러닝 모델의 성능을 향상시키기 위한 체계적인 접근 방식을 제공합니다. 철저한 데이터 전처리, 전략적인 매개변수 그리드 구성, 그리고 계산 최적화를 통해 데이터 과학자는 GridSearchCV의 잠재력을 최대한 활용할 수 있습니다.

다음 단계:

RandomizedSearchCV 탐색을 통해 보다 효율적인 하이퍼파라미터 튜닝을 시도해보세요.
교차 검증 모범 사례 구현하여 모델의 견고성을 보장하세요.
특징 공학 기법 통합으로 모델 성능을 더욱 향상시키세요.
최적화된 모델 배포하여 실제 시나리오에서 그 성능을 모니터링하세요.

GridSearchCV와 그 최적화 방법을 마스터함으로써, 다양한 데이터 환경에서도 견고하고 신뢰할 수 있는 고성능 머신러닝 모델을 구축할 수 있는 능력을 갖추게 됩니다.

S28L01 – GridSearchCV를 사용한 업데이트된 템플릿

최적의 머신러닝 모델을 위한 GridSearchCV 마스터하기: 종합 가이드

목차

1. GridSearchCV 소개

왜 GridSearchCV인가?

2. 데이터셋 이해하기

데이터셋 샘플

3. 데이터 전처리

결측치 처리

범주형 변수 인코딩

특징 선택

특징 스케일링

4. GridSearchCV 구현하기

StratifiedKFold를 사용한 교차 검증 설정

GridSearchCV 매개변수 설명

5. 머신러닝 모델 구축 및 튜닝

5.1 K-최근접 이웃 (KNN)

5.2 로지스틱 회귀

5.3 가우시안 나이브 베이즈

5.4 서포트 벡터 머신 (SVM)

5.5 의사결정 나무

5.6 랜덤 포레스트

5.7 AdaBoost

5.8 XGBoost

6. 성능 분석

해석

7. GridSearchCV 최적화

7.1 RandomizedSearchCV

7.2 매개변수 그리드 크기 줄이기

7.3 병렬 처리 활용

7.4 조기 중단

8. 결론 및 다음 단계