S28L02 – 랜덤 서치 CV

html
기계 학습 모델 튜닝 최적화: GridSearchCV보다 RandomizedSearchCV 채택

기계 학습의 역동적인 세계에서 모델 튜닝은 최적의 성능을 달성하는 데 중요한 역할을 합니다. 전통적으로, GridSearchCV는 하이퍼파라미터 최적화를 위한 대표적인 방법이었습니다. 그러나 데이터 세트가 크기와 복잡성이 증가함에 따라 GridSearchCV는 자원 집약적인 병목 현상이 될 수 있습니다. 여기서 RandomizedSearchCV가 등장하는데, 이는 유사한 결과를 제공하면서 계산 비용을 크게 줄인 보다 효율적인 대안입니다. 이 기사에서는 두 방법의 복잡한 내용을 자세히 설명하고, 대규모 데이터 프로젝트에서 RandomizedSearchCV를 채택하는 것의 장점을 강조합니다.

목차

    GridSearchCV 이해 및 한계
    RandomizedSearchCV 소개
    비교 분석: GridSearchCV vs. RandomizedSearchCV
    데이터 준비 및 전처리
        
            데이터셋 로딩
            결측 데이터 처리
            범주형 변수 인코딩
            특징 선택
            학습-테스트 분할
            특징 스케일링
        
    
    모델 구축 및 하이퍼파라미터 튜닝
        
            k-최근접 이웃 (KNN)
            로지스틱 회귀
            가우시안 나이브 베이즈 (GaussianNB)
            서포트 벡터 머신 (SVM)
            의사 결정 트리
            랜덤 포레스트
            AdaBoost
            XGBoost
        
    
    결과 및 성능 평가
    결론: 언제 RandomizedSearchCV를 선택해야 하는가
    자료 및 추가 읽을거리




GridSearchCV 이해 및 한계

GridSearchCV는 하이퍼파라미터 튜닝을 위해 scikit-learn에서 사용되는 강력한 도구입니다. 지정된 메트릭을 기반으로 최상의 모델 성능을 제공하는 조합을 식별하기 위해 미리 정의된 하이퍼파라미터 집합을 철저하게 탐색합니다.

주요 특징:

    철저한 검색: 파라미터 격자 내의 모든 가능한 조합을 평가합니다.
    교차 검증 통합: 모델의 견고성을 보장하기 위해 교차 검증을 사용합니다.
    최고 추정기 선택: 성능 지표를 기반으로 최고의 모델을 반환합니다.


한계:

    계산 집약적: 파라미터 격자가 커질수록 조합의 수가 기하급수적으로 증가하여 계산 시간이 오래 걸립니다.
    메모리 소비: 수많은 파라미터 조합을 가진 대규모 데이터셋을 처리하는 것은 시스템 자원을 부담스럽게 할 수 있습니다.
    수익 체감: 모든 파라미터 조합이 모델 성능에 크게 기여하는 것은 아니므로, 철저한 검색이 비효율적일 수 있습니다.


예시: 강력한 하드웨어를 사용하더라도 GridSearchCV를 사용하여 129,000개 이상의 레코드가 있는 데이터셋을 처리하는 데 약 12시간이 소요되었습니다. 이는 대규모 애플리케이션에 있어 실용적이지 않음을 보여줍니다.



RandomizedSearchCV 소개

RandomizedSearchCV는 모든 가능한 조합을 평가하는 대신 지정된 분포에서 고정된 수의 하이퍼파라미터 조합을 샘플링함으로써 GridSearchCV에 대한 실용적인 대안을 제공합니다.

장점:

    효율성: 평가 횟수를 제한하여 계산 시간을 크게 줄입니다.
    유연성: 각 하이퍼파라미터에 대한 분포를 지정할 수 있어 보다 다양한 샘플링이 가능합니다.
    확장성: 대규모 데이터셋과 복잡한 모델에 더 적합합니다.


작동 원리:
RandomizedSearchCV는 하이퍼파라미터 조합의 하위 집합을 무작위로 선택하고, 이를 교차 검증을 사용하여 평가한 다음, 선택된 메트릭을 기반으로 가장 성능이 뛰어난 조합을 식별합니다.



비교 분석: GridSearchCV vs. RandomizedSearchCV


    
        측면
        GridSearchCV
        RandomizedSearchCV
    
    
        검색 방법
        철저한
        무작위 샘플링
    
    
        계산 시간
        높음
        낮음에서 중간
    
    
        자원 사용
        높음
        중간에서 낮음
    
    
        성능
        잠재적으로 최고
        적은 노력으로도 유사
    
    
        유연성
        고정된 조합
        확률 기반 샘플링
    


시각화: 실제로 RandomizedSearchCV는 성능 저하 없이 모델 튜닝 시간을 몇 시간에서 불과 몇 분으로 줄일 수 있습니다.



데이터 준비 및 전처리

효과적인 데이터 전처리는 성공적인 모델 학습의 기초를 마련합니다. 제공된 Jupyter Notebook을 기반으로 한 단계별 안내는 다음과 같습니다.

데이터셋 로딩

사용된 데이터셋은 Airline Passenger Satisfaction입니다. 이 데이터셋에는 승객 경험 및 만족도 수준과 관련된 23개의 특징을 가진 5,000개의 레코드가 포함되어 있습니다.





		
		
			
			
Java
			
			import pandas as pd 
import seaborn as sns

# Loading the small dataset
data = pd.read_csv('Airline2_tiny.csv')
print(data.shape)  # Output: (4999, 23)
			
				
					
				
					1
2
3
4
5
6
				
						import pandas as pd 
import seaborn as sns
 
# Loading the small dataset
data = pd.read_csv('Airline2_tiny.csv')
print(data.shape)  # Output: (4999, 23)
					
				
			
		



결측 데이터 처리

수치 데이터

수치 데이터의 결측값은 평균 전략을 사용하여 대체됩니다.





		
		
			
			
Java
			
			import numpy as np
from sklearn.impute import SimpleImputer

imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean')
numerical_cols = list(np.where((X.dtypes == np.int64) | (X.dtypes == np.float64))[0])
imp_mean.fit(X.iloc[:, numerical_cols])
X.iloc[:, numerical_cols] = imp_mean.transform(X.iloc[:, numerical_cols])
			
				
					
				
					1
2
3
4
5
6
7
				
						import numpy as np
from sklearn.impute import SimpleImputer
 
imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean')
numerical_cols = list(np.where((X.dtypes == np.int64) | (X.dtypes == np.float64))[0])
imp_mean.fit(X.iloc[:, numerical_cols])
X.iloc[:, numerical_cols] = imp_mean.transform(X.iloc[:, numerical_cols])
					
				
			
		



범주형 데이터

범주형 데이터의 결측값은 최빈값 전략을 사용하여 대체됩니다.





		
		
			
			
Java
			
			imp_mode = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
string_cols = list(np.where((X.dtypes == 'object'))[0])
imp_mode.fit(X.iloc[:, string_cols])
X.iloc[:, string_cols] = imp_mode.transform(X.iloc[:, string_cols])
			
				
					
				
					1
2
3
4
				
						imp_mode = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
string_cols = list(np.where((X.dtypes == 'object'))[0])
imp_mode.fit(X.iloc[:, string_cols])
X.iloc[:, string_cols] = imp_mode.transform(X.iloc[:, string_cols])
					
				
			
		



범주형 변수 인코딩

범주형 특징은 고유 카테고리 수에 따라 원-핫 인코딩과 레이블 인코딩의 조합을 사용하여 인코딩됩니다.





		
		
			
			
Java
			
			from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, LabelEncoder

def OneHotEncoderMethod(indices, data):
    columnTransformer = ColumnTransformer([('encoder', OneHotEncoder(), indices )], remainder='passthrough')
    return columnTransformer.fit_transform(data)

def LabelEncoderMethod(series):
    le = LabelEncoder()
    le.fit(series)
    return le.transform(series) 

def EncodingSelection(X, threshold=10):
    string_cols = list(np.where((X.dtypes == 'object'))[0])
    one_hot_encoding_indices = []
    
    for col in string_cols:
        length = len(pd.unique(X[X.columns[col]]))
        if length == 2 or length > threshold:
            X[X.columns[col]] = LabelEncoderMethod(X[X.columns[col]])
        else:
            one_hot_encoding_indices.append(col)
            
    X = OneHotEncoderMethod(one_hot_encoding_indices, X)
    return X

# Apply encoding
X = EncodingSelection(X)
print(X.shape)  # Output: (4999, 24)
			
				
					
				
					1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
				
						from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
 
def OneHotEncoderMethod(indices, data):
    columnTransformer = ColumnTransformer([('encoder', OneHotEncoder(), indices )], remainder='passthrough')
    return columnTransformer.fit_transform(data)
 
def LabelEncoderMethod(series):
    le = LabelEncoder()
    le.fit(series)
    return le.transform(series) 
 
def EncodingSelection(X, threshold=10):
    string_cols = list(np.where((X.dtypes == 'object'))[0])
    one_hot_encoding_indices = []
    
    for col in string_cols:
        length = len(pd.unique(X[X.columns[col]]))
        if length == 2 or length > threshold:
            X[X.columns[col]] = LabelEncoderMethod(X[X.columns[col]])
        else:
            one_hot_encoding_indices.append(col)
            
    X = OneHotEncoderMethod(one_hot_encoding_indices, X)
    return X
 
# Apply encoding
X = EncodingSelection(X)
print(X.shape)  # Output: (4999, 24)
					
				
			
		



특징 선택

가장 관련성 높은 특징을 선택하면 모델 성능이 향상되고 복잡성이 줄어듭니다.





		
		
			
			
Java
			
			from sklearn.feature_selection import SelectKBest, chi2
from sklearn.preprocessing import MinMaxScaler

kbest = SelectKBest(score_func=chi2, k='all')
MMS = MinMaxScaler()
K_features = 10

x_temp = MMS.fit_transform(X)
x_temp = kbest.fit(x_temp, y)
best_features = np.argsort(x_temp.scores_)[-K_features:]
features_to_delete = np.argsort(x_temp.scores_)[:-K_features]
X = np.delete(X, features_to_delete, axis=1)
print(X.shape)  # Output: (4999, 10)
			
				
					
				
					1
2
3
4
5
6
7
8
9
10
11
12
13
				
						from sklearn.feature_selection import SelectKBest, chi2
from sklearn.preprocessing import MinMaxScaler
 
kbest = SelectKBest(score_func=chi2, k='all')
MMS = MinMaxScaler()
K_features = 10
 
x_temp = MMS.fit_transform(X)
x_temp = kbest.fit(x_temp, y)
best_features = np.argsort(x_temp.scores_)[-K_features:]
features_to_delete = np.argsort(x_temp.scores_)[:-K_features]
X = np.delete(X, features_to_delete, axis=1)
print(X.shape)  # Output: (4999, 10)
					
				
			
		



학습-테스트 분할

데이터셋을 분할하면 모델이 보이지 않는 데이터에 대해 평가되어 편향되지 않은 성능 지표를 얻을 수 있습니다.





		
		
			
			
Java
			
			from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=1)
print(X_train.shape)  # Output: (3999, 10)
print(X_test.shape)   # Output: (1000, 10)
			
				
					
				
					1
2
3
4
5
				
						from sklearn.model_selection import train_test_split
 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=1)
print(X_train.shape)  # Output: (3999, 10)
print(X_test.shape)   # Output: (1000, 10)
					
				
			
		



특징 스케일링

특징을 스케일링하면 모든 특징이 모델 성능에 동일하게 기여할 수 있습니다.





		
		
			
			
Java
			
			from sklearn.preprocessing import StandardScaler

sc = StandardScaler(with_mean=False)
sc.fit(X_train)
X_train = sc.transform(X_train)
X_test = sc.transform(X_test)
			
				
					
				
					1
2
3
4
5
6
				
						from sklearn.preprocessing import StandardScaler
 
sc = StandardScaler(with_mean=False)
sc.fit(X_train)
X_train = sc.transform(X_train)
X_test = sc.transform(X_test)
					
				
			
		





모델 구축 및 하이퍼파라미터 튜닝

데이터 전처리가 완료되었으므로 RandomizedSearchCV를 사용하여 다양한 기계 학습 모델을 구축하고 최적화할 시간입니다.

k-최근접 이웃 (KNN)

KNN은 간단한 인스턴스 기반 학습 알고리즘입니다.





		
		
			
			
Java
			
			from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import RandomizedSearchCV, StratifiedKFold

model = KNeighborsClassifier()

params = {
    'n_neighbors': [4, 5, 6, 7],
    'leaf_size': [1, 3, 5],
    'algorithm': ['auto', 'kd_tree'],
    'weights': ['uniform', 'distance']
}

cv = StratifiedKFold(n_splits=2)
random_search_cv = RandomizedSearchCV(
    estimator=model,
    param_distributions=params,
    verbose=1,
    cv=cv,
    scoring='f1',
    n_jobs=-1
)

random_search_cv.fit(X_train, y_train)
print("Best Estimator", random_search_cv.best_estimator_)  # Output: KNeighborsClassifier(leaf_size=1)
print("Best score", random_search_cv.best_score_)          # Output: 0.8774673417446253
			
				
					
				
					1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
				
						from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import RandomizedSearchCV, StratifiedKFold
 
model = KNeighborsClassifier()
 
params = {
    'n_neighbors': [4, 5, 6, 7],
    'leaf_size': [1, 3, 5],
    'algorithm': ['auto', 'kd_tree'],
    'weights': ['uniform', 'distance']
}
 
cv = StratifiedKFold(n_splits=2)
random_search_cv = RandomizedSearchCV(
    estimator=model,
    param_distributions=params,
    verbose=1,
    cv=cv,
    scoring='f1',
    n_jobs=-1
)
 
random_search_cv.fit(X_train, y_train)
print("Best Estimator", random_search_cv.best_estimator_)  # Output: KNeighborsClassifier(leaf_size=1)
print("Best score", random_search_cv.best_score_)          # Output: 0.8774673417446253
					
				
			
		



로지스틱 회귀

이진 분류 작업에 사용되는 확률적 모델입니다.





		
		
			
			
Java
			
			from sklearn.linear_model import LogisticRegression

model = LogisticRegression()

params = {
    'solver': ['newton-cg', 'lbfgs', 'liblinear'],
    'penalty': ['l1', 'l2'],
    'C': [100, 10, 1.0, 0.1, 0.01]
}

random_search_cv = RandomizedSearchCV(
    estimator=model,
    param_distributions=params,
    verbose=1,
    cv=cv,
    scoring='f1',
    n_jobs=-1
)

random_search_cv.fit(X_train, y_train)
print("Best Estimator", random_search_cv.best_estimator_)  # Output: LogisticRegression(C=0.01)
print("Best score", random_search_cv.best_score_)          # Output: 0.8295203666687819
			
				
					
				
					1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
				
						from sklearn.linear_model import LogisticRegression
 
model = LogisticRegression()
 
params = {
    'solver': ['newton-cg', 'lbfgs', 'liblinear'],
    'penalty': ['l1', 'l2'],
    'C': [100, 10, 1.0, 0.1, 0.01]
}
 
random_search_cv = RandomizedSearchCV(
    estimator=model,
    param_distributions=params,
    verbose=1,
    cv=cv,
    scoring='f1',
    n_jobs=-1
)
 
random_search_cv.fit(X_train, y_train)
print("Best Estimator", random_search_cv.best_estimator_)  # Output: LogisticRegression(C=0.01)
print("Best score", random_search_cv.best_score_)          # Output: 0.8295203666687819
					
				
			
		



가우시안 나이브 베이즈 (GaussianNB)

베이즈 정리를 기반으로 한 간단하면서도 효과적인 확률적 분류기입니다.





		
		
			
			
Java
			
			from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, classification_report

model_GNB = GaussianNB()
model_GNB.fit(X_train, y_train)
y_pred = model_GNB.predict(X_test)
print(accuracy_score(y_pred, y_test))  # Output: 0.84
print(classification_report(y_pred, y_test))
			
				
					
				
					1
2
3
4
5
6
7
8
				
						from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, classification_report
 
model_GNB = GaussianNB()
model_GNB.fit(X_train, y_train)
y_pred = model_GNB.predict(X_test)
print(accuracy_score(y_pred, y_test))  # Output: 0.84
print(classification_report(y_pred, y_test))
					
				
			
		



출력:




		
		
			
			
Java
			
			              precision    recall  f1-score   support

           0       0.86      0.86      0.86       564
           1       0.82      0.81      0.82       436

    accuracy                           0.84      1000
   macro avg       0.84      0.84      0.84      1000
weighted avg       0.84      0.84      0.84      1000
			
				
					
				
					1
2
3
4
5
6
7
8
				
						              precision    recall  f1-score   support
 
           0       0.86      0.86      0.86       564
           1       0.82      0.81      0.82       436
 
    accuracy                           0.84      1000
   macro avg       0.84      0.84      0.84      1000
weighted avg       0.84      0.84      0.84      1000
					
				
			
		



서포트 벡터 머신 (SVM)

고차원 공간에서 효과적인 견고한 분류기입니다.





		
		
			
			
Java
			
			from sklearn.svm import SVC

model = SVC()

params = {
    'kernel': ['linear', 'poly', 'rbf', 'sigmoid'],
    'C': [1, 5, 10],
    'degree': [3, 8],
    'coef0': [0.01, 10, 0.5],
    'gamma': ['auto', 'scale']
}

random_search_cv = RandomizedSearchCV(
    estimator=model,
    param_distributions=params,
    verbose=1,
    cv=cv,
    scoring='f1',
    n_jobs=-1
)

random_search_cv.fit(X_train, y_train)
print("Best Estimator", random_search_cv.best_estimator_)  # Output: SVC(C=10, coef0=0.5, degree=8)
print("Best score", random_search_cv.best_score_)          # Output: 0.9165979221213969
			
				
					
				
					1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
				
						from sklearn.svm import SVC
 
model = SVC()
 
params = {
    'kernel': ['linear', 'poly', 'rbf', 'sigmoid'],
    'C': [1, 5, 10],
    'degree': [3, 8],
    'coef0': [0.01, 10, 0.5],
    'gamma': ['auto', 'scale']
}
 
random_search_cv = RandomizedSearchCV(
    estimator=model,
    param_distributions=params,
    verbose=1,
    cv=cv,
    scoring='f1',
    n_jobs=-1
)
 
random_search_cv.fit(X_train, y_train)
print("Best Estimator", random_search_cv.best_estimator_)  # Output: SVC(C=10, coef0=0.5, degree=8)
print("Best score", random_search_cv.best_score_)          # Output: 0.9165979221213969
					
				
			
		



의사 결정 트리

특징 분할을 기반으로 결정을 내리는 계층적 모델입니다.





		
		
			
			
Java
			
			from sklearn.tree import DecisionTreeClassifier

model = DecisionTreeClassifier()

params = {
    'max_leaf_nodes': list(range(2, 100)),
    'min_samples_split': [2, 3, 4]
}

random_search_cv = RandomizedSearchCV(
    estimator=model,
    param_distributions=params,
    verbose=1,
    cv=cv,
    scoring='f1',
    n_jobs=-1
)

random_search_cv.fit(X_train, y_train)
print("Best Estimator", random_search_cv.best_estimator_)  # Output: DecisionTreeClassifier(max_leaf_nodes=30, min_samples_split=4)
print("Best score", random_search_cv.best_score_)          # Output: 0.9069240944070234
			
				
					
				
					1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
				
						from sklearn.tree import DecisionTreeClassifier
 
model = DecisionTreeClassifier()
 
params = {
    'max_leaf_nodes': list(range(2, 100)),
    'min_samples_split': [2, 3, 4]
}
 
random_search_cv = RandomizedSearchCV(
    estimator=model,
    param_distributions=params,
    verbose=1,
    cv=cv,
    scoring='f1',
    n_jobs=-1
)
 
random_search_cv.fit(X_train, y_train)
print("Best Estimator", random_search_cv.best_estimator_)  # Output: DecisionTreeClassifier(max_leaf_nodes=30, min_samples_split=4)
print("Best score", random_search_cv.best_score_)          # Output: 0.9069240944070234
					
				
			
		



랜덤 포레스트

예측 성능을 향상시키기 위해 여러 의사 결정 트리를 활용하는 앙상블 방법입니다.





		
		
			
			
Java
			
			from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier()

params = {
    'bootstrap': [True],
    'max_depth': [80, 90, 100, 110],
    'max_features': [2, 3],
    'min_samples_leaf': [3, 4, 5],
    'min_samples_split': [8, 10, 12],
    'n_estimators': [100, 200, 300, 1000]
}

random_search_cv = RandomizedSearchCV(
    estimator=model,
    param_distributions=params,
    verbose=1,
    cv=cv,
    scoring='f1',
    n_jobs=-1
)

random_search_cv.fit(X_train, y_train)
print("Best Estimator", random_search_cv.best_estimator_)  # Output: RandomForestClassifier(max_leaf_nodes=96, min_samples_split=3)
print("Best score", random_search_cv.best_score_)          # Output: 0.9227615146702333
			
				
					
				
					1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
				
						from sklearn.ensemble import RandomForestClassifier
 
model = RandomForestClassifier()
 
params = {
    'bootstrap': [True],
    'max_depth': [80, 90, 100, 110],
    'max_features': [2, 3],
    'min_samples_leaf': [3, 4, 5],
    'min_samples_split': [8, 10, 12],
    'n_estimators': [100, 200, 300, 1000]
}
 
random_search_cv = RandomizedSearchCV(
    estimator=model,
    param_distributions=params,
    verbose=1,
    cv=cv,
    scoring='f1',
    n_jobs=-1
)
 
random_search_cv.fit(X_train, y_train)
print("Best Estimator", random_search_cv.best_estimator_)  # Output: RandomForestClassifier(max_leaf_nodes=96, min_samples_split=3)
print("Best score", random_search_cv.best_score_)          # Output: 0.9227615146702333
					
				
			
		



AdaBoost

여러 약한 학습기를 결합하여 강력한 학습기를 형성하는 부스팅 앙상블 방법입니다.





		
		
			
			
Java
			
			from sklearn.ensemble import AdaBoostClassifier

model = AdaBoostClassifier()

params = {
    'n_estimators': np.arange(10, 300, 10),
    'learning_rate': [0.01, 0.05, 0.1, 1]
}

random_search_cv = RandomizedSearchCV(
    estimator=model,
    param_distributions=params,
    verbose=1,
    cv=cv,
    scoring='f1',
    n_jobs=-1
)

random_search_cv.fit(X_train, y_train)
print("Best Estimator", random_search_cv.best_estimator_)  # Output: AdaBoostClassifier(learning_rate=0.1, n_estimators=200)
print("Best score", random_search_cv.best_score_)          # Output: 0.8906331862757826
			
				
					
				
					1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
				
						from sklearn.ensemble import AdaBoostClassifier
 
model = AdaBoostClassifier()
 
params = {
    'n_estimators': np.arange(10, 300, 10),
    'learning_rate': [0.01, 0.05, 0.1, 1]
}
 
random_search_cv = RandomizedSearchCV(
    estimator=model,
    param_distributions=params,
    verbose=1,
    cv=cv,
    scoring='f1',
    n_jobs=-1
)
 
random_search_cv.fit(X_train, y_train)
print("Best Estimator", random_search_cv.best_estimator_)  # Output: AdaBoostClassifier(learning_rate=0.1, n_estimators=200)
print("Best score", random_search_cv.best_score_)          # Output: 0.8906331862757826
					
				
			
		



XGBoost

성능과 속도로 유명한 최적화된 그래디언트 부스팅 프레임워크입니다.





		
		
			
			
Java
			
			import xgboost as xgb
from sklearn.metrics import accuracy_score, classification_report

model = xgb.XGBClassifier()

params = {
    'min_child_weight': [1, 5, 10],
    'gamma': [0.5, 1, 1.5, 2, 5],
    'subsample': [0.6, 0.8, 1.0],
    'colsample_bytree': [0.6, 0.8, 1.0],
    'max_depth': [3, 4, 5],
    'n_estimators': [100, 500, 1000],
    'learning_rate': [0.01, 0.3, 0.5, 0.1],
    'reg_lambda': [1, 2]
}

random_search_cv = RandomizedSearchCV(
    estimator=model,
    param_distributions=params,
    verbose=1,
    cv=cv,
    scoring='f1',
    n_jobs=-1
)

random_search_cv.fit(X_train, y_train)
print("Best Estimator", random_search_cv.best_estimator_)  # Output: XGBClassifier with best parameters
print("Best score", random_search_cv.best_score_)          # Output: 0.922052180776655

# Model Evaluation
model_best = random_search_cv.best_estimator_
model_best.fit(X_train, y_train)
y_pred = model_best.predict(X_test)
print(accuracy_score(y_pred, y_test))  # Output: 0.937
print(classification_report(y_pred, y_test))
			
				
					
				
					1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
				
						import xgboost as xgb
from sklearn.metrics import accuracy_score, classification_report
 
model = xgb.XGBClassifier()
 
params = {
    'min_child_weight': [1, 5, 10],
    'gamma': [0.5, 1, 1.5, 2, 5],
    'subsample': [0.6, 0.8, 1.0],
    'colsample_bytree': [0.6, 0.8, 1.0],
    'max_depth': [3, 4, 5],
    'n_estimators': [100, 500, 1000],
    'learning_rate': [0.01, 0.3, 0.5, 0.1],
    'reg_lambda': [1, 2]
}
 
random_search_cv = RandomizedSearchCV(
    estimator=model,
    param_distributions=params,
    verbose=1,
    cv=cv,
    scoring='f1',
    n_jobs=-1
)
 
random_search_cv.fit(X_train, y_train)
print("Best Estimator", random_search_cv.best_estimator_)  # Output: XGBClassifier with best parameters
print("Best score", random_search_cv.best_score_)          # Output: 0.922052180776655
 
# Model Evaluation
model_best = random_search_cv.best_estimator_
model_best.fit(X_train, y_train)
y_pred = model_best.predict(X_test)
print(accuracy_score(y_pred, y_test))  # Output: 0.937
print(classification_report(y_pred, y_test))
					
				
			
		



출력:




		
		
			
			
Java
			
			              precision    recall  f1-score   support

           0       0.96      0.93      0.95       583
           1       0.91      0.94      0.93       417

    accuracy                           0.94      1000
   macro avg       0.93      0.94      0.94      1000
weighted avg       0.94      0.94      0.94      1000
			
				
					
				
					1
2
3
4
5
6
7
8
				
						              precision    recall  f1-score   support
 
           0       0.96      0.93      0.95       583
           1       0.91      0.94      0.93       417
 
    accuracy                           0.94      1000
   macro avg       0.93      0.94      0.94      1000
weighted avg       0.94      0.94      0.94      1000
					
				
			
		





결과 및 성능 평가

RandomizedSearchCV의 효과는 모델 성능을 통해 분명하게 나타납니다:

    KNN은 약 0.877의 F1-점수를 달성했습니다.
    로지스틱 회귀는 약 0.830의 F1-점수를 제공했습니다.
    GaussianNB는 84%의 정확도를 유지했습니다.
    SVM은 약 0.917의 인상적인 F1-점수로 돋보였습니다.
    의사 결정 트리는 약 0.907의 F1-점수를 얻었습니다.
    랜덤 포레스트는 약 0.923의 F1-점수로 선두를 달렸습니다.
    AdaBoost는 약 0.891의 F1-점수를 달성했습니다.
    XGBoost는 약 0.922의 F1-점수와 93.7%의 정확도로 뛰어났습니다.


주요 관찰 사항:

    RandomForestClassifier와 XGBoost는 우수한 성능을 보였습니다.
    RandomizedSearchCV는 계산 시간을 12시간 이상 (GridSearchCV)에서 불과 몇 분으로 효과적으로 단축시켰으며, 모델 정확도는 유지되었습니다.




결론: 언제 RandomizedSearchCV를 선택해야 하는가

GridSearchCV는 철저한 하이퍼파라미터 튜닝을 제공하지만, 그 계산적 요구 사항은 대규모 데이터셋에 대해 금지적일 수 있습니다. RandomizedSearchCV는 효율성과 성능의 균형을 이루는 실용적인 솔루션으로 부상합니다. 특히 다음과 같은 경우에 유리합니다:

    시간이 제한될 때: 신속한 모델 튜닝이 필수적일 때.
    계산 자원이 제한될 때: 시스템 자원의 부담을 줄이고자 할 때.
    고차원 하이퍼파라미터 공간: 검색 과정을 단순화하고자 할 때.


RandomizedSearchCV를 채택하면 기계 학습 워크플로우를 간소화하여, 실무자들이 긴 튜닝 절차보다는 모델 해석 및 배포에 더욱 집중할 수 있게 됩니다.



자료 및 추가 읽을거리


    Scikit-learn 문서: RandomizedSearchCV
    Kaggle: Airline Passenger Satisfaction Dataset
    XGBoost 공식 문서
    하이퍼파라미터 튜닝에 대한 종합 가이드




RandomizedSearchCV를 활용함으로써 기계 학습 실무자들은 효율적이고 효과적인 모델 튜닝을 달성할 수 있으며, 데이터 기반 애플리케이션에서 확장 가능하고 고성능의 솔루션을 보장할 수 있습니다.
측면	GridSearchCV	RandomizedSearchCV
검색 방법	철저한	무작위 샘플링
계산 시간	높음	낮음에서 중간
자원 사용	높음	중간에서 낮음
성능	잠재적으로 최고	적은 노력으로도 유사
유연성	고정된 조합	확률 기반 샘플링