Optimizing Machine Learning Model Tuning: Embracing RandomizedSearchCV Over GridSearchCV

In the dynamic world of machine learning, model tuning is pivotal for achieving optimal performance. Traditionally, GridSearchCV has been the go-to method for hyperparameter optimization. However, as datasets grow in size and complexity, GridSearchCV can become a resource-intensive bottleneck. Enter RandomizedSearchCV—a more efficient alternative that offers comparable results with significantly reduced computational overhead. This article delves into the intricacies of both methods, highlighting the advantages of adopting RandomizedSearchCV for large-scale data projects.

Understanding GridSearchCV and Its Limitations
Introducing RandomizedSearchCV
Comparative Analysis: GridSearchCV vs. RandomizedSearchCV
Data Preparation and Preprocessing
Model Building and Hyperparameter Tuning
Results and Performance Evaluation
Conclusion: When to Choose RandomizedSearchCV
Resources and Further Reading

Understanding GridSearchCV and Its Limitations

GridSearchCV is a powerful tool in scikit-learn used for hyperparameter tuning. It exhaustively searches through a predefined set of hyperparameters to identify the combination that yields the best model performance based on a specified metric.

Key Characteristics:

Exhaustive Search: Evaluates all possible combinations in the parameter grid.
Cross-Validation Integration: Uses cross-validation to ensure model robustness.
Best Estimator Selection: Returns the best model based on performance metrics.

Limitations:

Computationally Intensive: As the parameter grid grows, the number of combinations increases exponentially, leading to longer computation times.
Memory Consumption: Handling large datasets with numerous parameter combinations can strain system resources.
Diminishing Returns: Not all parameter combinations contribute significantly to model performance, making exhaustive search inefficient.

Case in Point: Processing a dataset with over 129,000 records using GridSearchCV took approximately 12 hours, even with robust hardware. This showcases its impracticality for large-scale applications.

Introducing RandomizedSearchCV

RandomizedSearchCV offers a pragmatic alternative to GridSearchCV by sampling a fixed number of hyperparameter combinations from the specified distributions, rather than evaluating all possible combinations.

Advantages:

Efficiency: Significantly reduces computation time by limiting the number of evaluations.
Flexibility: Allows specifying distributions for each hyperparameter, enabling more diverse sampling.
Scalability: Better suited for large datasets and complex models.

How It Works:

RandomizedSearchCV randomly selects a subset of hyperparameter combinations, evaluates them using cross-validation, and identifies the best-performing combination based on the chosen metric.

Comparative Analysis: GridSearchCV vs. RandomizedSearchCV

Aspect	GridSearchCV	RandomizedSearchCV
Search Method	Exhaustive	Random Sampling
Computation Time	High	Low to Medium
Resource Usage	High	Moderate to Low
Performance	Potentially Best	Comparable with Less Effort
Flexibility	Fixed Combinations	Probability-Based Sampling

Visualization: In practice, RandomizedSearchCV can reduce model tuning time from hours to mere minutes without a significant drop in performance.

Data Preparation and Preprocessing

Effective data preprocessing lays the foundation for successful model training. Here’s a step-by-step walkthrough based on the provided Jupyter Notebook.

Loading the Dataset

The dataset used is Airline Passenger Satisfaction from Kaggle. It contains 5,000 records with 23 features related to passenger experiences and satisfaction levels.

import pandas as pd 
import seaborn as sns

# Loading the small dataset
data = pd.read_csv('Airline2_tiny.csv')
print(data.shape)  # Output: (4999, 23)

import pandas as pd

import seaborn as sns

# Loading the small dataset

data = pd.read_csv('Airline2_tiny.csv')

print(data.shape) # Output: (4999, 23)

Handling Missing Data

Numeric Data

Missing numeric values are imputed using the mean strategy.

import numpy as np
from sklearn.impute import SimpleImputer

imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean')
numerical_cols = list(np.where((X.dtypes == np.int64) | (X.dtypes == np.float64))[0])
imp_mean.fit(X.iloc[:, numerical_cols])
X.iloc[:, numerical_cols] = imp_mean.transform(X.iloc[:, numerical_cols])

import numpy as np

from sklearn.impute import SimpleImputer

imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean')

numerical_cols = list(np.where((X.dtypes == np.int64) | (X.dtypes == np.float64))[0])

imp_mean.fit(X.iloc[:, numerical_cols])

X.iloc[:, numerical_cols] = imp_mean.transform(X.iloc[:, numerical_cols])

Categorical Data

Missing categorical values are imputed using the most frequent strategy.

imp_mode = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
string_cols = list(np.where((X.dtypes == 'object'))[0])
imp_mode.fit(X.iloc[:, string_cols])
X.iloc[:, string_cols] = imp_mode.transform(X.iloc[:, string_cols])

imp_mode = SimpleImputer(missing_values=np.nan, strategy='most_frequent')

string_cols = list(np.where((X.dtypes == 'object'))[0])

imp_mode.fit(X.iloc[:, string_cols])

X.iloc[:, string_cols] = imp_mode.transform(X.iloc[:, string_cols])

Encoding Categorical Variables

Categorical features are encoded using a combination of One-Hot Encoding and Label Encoding based on the number of unique categories.

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, LabelEncoder

def OneHotEncoderMethod(indices, data):
    columnTransformer = ColumnTransformer([('encoder', OneHotEncoder(), indices )], remainder='passthrough')
    return columnTransformer.fit_transform(data)

def LabelEncoderMethod(series):
    le = LabelEncoder()
    le.fit(series)
    return le.transform(series) 

def EncodingSelection(X, threshold=10):
    string_cols = list(np.where((X.dtypes == 'object'))[0])
    one_hot_encoding_indices = []
    
    for col in string_cols:
        length = len(pd.unique(X[X.columns[col]]))
        if length == 2 or length > threshold:
            X[X.columns[col]] = LabelEncoderMethod(X[X.columns[col]])
        else:
            one_hot_encoding_indices.append(col)
            
    X = OneHotEncoderMethod(one_hot_encoding_indices, X)
    return X

# Apply encoding
X = EncodingSelection(X)
print(X.shape)  # Output: (4999, 24)

from sklearn.compose import ColumnTransformer

from sklearn.preprocessing import OneHotEncoder, LabelEncoder

def OneHotEncoderMethod(indices, data):

columnTransformer = ColumnTransformer([('encoder', OneHotEncoder(), indices )], remainder='passthrough')

return columnTransformer.fit_transform(data)

def LabelEncoderMethod(series):

le = LabelEncoder()

le.fit(series)

return le.transform(series)

def EncodingSelection(X, threshold=10):

string_cols = list(np.where((X.dtypes == 'object'))[0])

one_hot_encoding_indices = []

for col in string_cols:

length = len(pd.unique(X[X.columns[col]]))

if length == 2 or length > threshold:

X[X.columns[col]] = LabelEncoderMethod(X[X.columns[col]])

else:

one_hot_encoding_indices.append(col)

X = OneHotEncoderMethod(one_hot_encoding_indices, X)

return X

# Apply encoding

X = EncodingSelection(X)

print(X.shape) # Output: (4999, 24)

Feature Selection

Selecting the most relevant features enhances model performance and reduces complexity.

from sklearn.feature_selection import SelectKBest, chi2
from sklearn.preprocessing import MinMaxScaler

kbest = SelectKBest(score_func=chi2, k='all')
MMS = MinMaxScaler()
K_features = 10

x_temp = MMS.fit_transform(X)
x_temp = kbest.fit(x_temp, y)
best_features = np.argsort(x_temp.scores_)[-K_features:]
features_to_delete = np.argsort(x_temp.scores_)[:-K_features]
X = np.delete(X, features_to_delete, axis=1)
print(X.shape)  # Output: (4999, 10)

from sklearn.feature_selection import SelectKBest, chi2

from sklearn.preprocessing import MinMaxScaler

kbest = SelectKBest(score_func=chi2, k='all')

MMS = MinMaxScaler()

K_features = 10

x_temp = MMS.fit_transform(X)

x_temp = kbest.fit(x_temp, y)

best_features = np.argsort(x_temp.scores_)[-K_features:]

features_to_delete = np.argsort(x_temp.scores_)[:-K_features]

X = np.delete(X, features_to_delete, axis=1)

print(X.shape) # Output: (4999, 10)

Train-Test Split

Splitting the dataset ensures that the model is evaluated on unseen data, facilitating unbiased performance metrics.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=1)
print(X_train.shape)  # Output: (3999, 10)
print(X_test.shape)   # Output: (1000, 10)

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=1)

print(X_train.shape) # Output: (3999, 10)

print(X_test.shape) # Output: (1000, 10)

Feature Scaling

Scaling features ensures that all features contribute equally to the model performance.

from sklearn.preprocessing import StandardScaler

sc = StandardScaler(with_mean=False)
sc.fit(X_train)
X_train = sc.transform(X_train)
X_test = sc.transform(X_test)

from sklearn.preprocessing import StandardScaler

sc = StandardScaler(with_mean=False)

sc.fit(X_train)

X_train = sc.transform(X_train)

X_test = sc.transform(X_test)

Model Building and Hyperparameter Tuning

With the data preprocessed, it’s time to build and optimize various machine learning models using RandomizedSearchCV.

K-Nearest Neighbors (KNN)

KNN is a simple, instance-based learning algorithm.

from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import RandomizedSearchCV, StratifiedKFold

model = KNeighborsClassifier()

params = {
    'n_neighbors': [4, 5, 6, 7],
    'leaf_size': [1, 3, 5],
    'algorithm': ['auto', 'kd_tree'],
    'weights': ['uniform', 'distance']
}

cv = StratifiedKFold(n_splits=2)
random_search_cv = RandomizedSearchCV(
    estimator=model,
    param_distributions=params,
    verbose=1,
    cv=cv,
    scoring='f1',
    n_jobs=-1
)

random_search_cv.fit(X_train, y_train)
print("Best Estimator", random_search_cv.best_estimator_)  # Output: KNeighborsClassifier(leaf_size=1)
print("Best score", random_search_cv.best_score_)          # Output: 0.8774673417446253

from sklearn.neighbors import KNeighborsClassifier

from sklearn.model_selection import RandomizedSearchCV, StratifiedKFold

model = KNeighborsClassifier()

params = {

'n_neighbors': [4, 5, 6, 7],

'leaf_size': [1, 3, 5],

'algorithm': ['auto', 'kd_tree'],

'weights': ['uniform', 'distance']

}

cv = StratifiedKFold(n_splits=2)

random_search_cv = RandomizedSearchCV(

estimator=model,

param_distributions=params,

verbose=1,

cv=cv,

scoring='f1',

n_jobs=-1

)

random_search_cv.fit(X_train, y_train)

print("Best Estimator", random_search_cv.best_estimator_) # Output: KNeighborsClassifier(leaf_size=1)

print("Best score", random_search_cv.best_score_) # Output: 0.8774673417446253

Logistic Regression

A probabilistic model used for binary classification tasks.

from sklearn.linear_model import LogisticRegression

model = LogisticRegression()

params = {
    'solver': ['newton-cg', 'lbfgs', 'liblinear'],
    'penalty': ['l1', 'l2'],
    'C': [100, 10, 1.0, 0.1, 0.01]
}

random_search_cv = RandomizedSearchCV(
    estimator=model,
    param_distributions=params,
    verbose=1,
    cv=cv,
    scoring='f1',
    n_jobs=-1
)

random_search_cv.fit(X_train, y_train)
print("Best Estimator", random_search_cv.best_estimator_)  # Output: LogisticRegression(C=0.01)
print("Best score", random_search_cv.best_score_)          # Output: 0.8295203666687819

from sklearn.linear_model import LogisticRegression

model = LogisticRegression()

params = {

'solver': ['newton-cg', 'lbfgs', 'liblinear'],

'penalty': ['l1', 'l2'],

'C': [100, 10, 1.0, 0.1, 0.01]

}

random_search_cv = RandomizedSearchCV(

estimator=model,

param_distributions=params,

verbose=1,

cv=cv,

scoring='f1',

n_jobs=-1

)

random_search_cv.fit(X_train, y_train)

print("Best Estimator", random_search_cv.best_estimator_) # Output: LogisticRegression(C=0.01)

print("Best score", random_search_cv.best_score_) # Output: 0.8295203666687819

Gaussian Naive Bayes (GaussianNB)

A simple yet effective probabilistic classifier based on Bayes’ theorem.

from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, classification_report

model_GNB = GaussianNB()
model_GNB.fit(X_train, y_train)
y_pred = model_GNB.predict(X_test)
print(accuracy_score(y_pred, y_test))  # Output: 0.84
print(classification_report(y_pred, y_test))

from sklearn.naive_bayes import GaussianNB

from sklearn.metrics import accuracy_score, classification_report

model_GNB = GaussianNB()

model_GNB.fit(X_train, y_train)

y_pred = model_GNB.predict(X_test)

print(accuracy_score(y_pred, y_test)) # Output: 0.84

print(classification_report(y_pred, y_test))

Output:

              precision    recall  f1-score   support

           0       0.86      0.86      0.86       564
           1       0.82      0.81      0.82       436

    accuracy                           0.84      1000
   macro avg       0.84      0.84      0.84      1000
weighted avg       0.84      0.84      0.84      1000

precision recall f1-score support

0 0.86 0.86 0.86 564

1 0.82 0.81 0.82 436

accuracy 0.84 1000

macro avg 0.84 0.84 0.84 1000

weighted avg 0.84 0.84 0.84 1000

Support Vector Machine (SVM)

A robust classifier effective in high-dimensional spaces.

from sklearn.svm import SVC

model = SVC()

params = {
    'kernel': ['linear', 'poly', 'rbf', 'sigmoid'],
    'C': [1, 5, 10],
    'degree': [3, 8],
    'coef0': [0.01, 10, 0.5],
    'gamma': ['auto', 'scale']
}

random_search_cv = RandomizedSearchCV(
    estimator=model,
    param_distributions=params,
    verbose=1,
    cv=cv,
    scoring='f1',
    n_jobs=-1
)

random_search_cv.fit(X_train, y_train)
print("Best Estimator", random_search_cv.best_estimator_)  # Output: SVC(C=10, coef0=0.5, degree=8)
print("Best score", random_search_cv.best_score_)          # Output: 0.9165979221213969

from sklearn.svm import SVC

model = SVC()

params = {

'kernel': ['linear', 'poly', 'rbf', 'sigmoid'],

'C': [1, 5, 10],

'degree': [3, 8],

'coef0': [0.01, 10, 0.5],

'gamma': ['auto', 'scale']

}

random_search_cv = RandomizedSearchCV(

estimator=model,

param_distributions=params,

verbose=1,

cv=cv,

scoring='f1',

n_jobs=-1

)

random_search_cv.fit(X_train, y_train)

print("Best Estimator", random_search_cv.best_estimator_) # Output: SVC(C=10, coef0=0.5, degree=8)

print("Best score", random_search_cv.best_score_) # Output: 0.9165979221213969

Decision Tree

A hierarchical model that makes decisions based on feature splits.

from sklearn.tree import DecisionTreeClassifier

model = DecisionTreeClassifier()

params = {
    'max_leaf_nodes': list(range(2, 100)),
    'min_samples_split': [2, 3, 4]
}

random_search_cv = RandomizedSearchCV(
    estimator=model,
    param_distributions=params,
    verbose=1,
    cv=cv,
    scoring='f1',
    n_jobs=-1
)

random_search_cv.fit(X_train, y_train)
print("Best Estimator", random_search_cv.best_estimator_)  # Output: DecisionTreeClassifier(max_leaf_nodes=30, min_samples_split=4)
print("Best score", random_search_cv.best_score_)          # Output: 0.9069240944070234

from sklearn.tree import DecisionTreeClassifier

model = DecisionTreeClassifier()

params = {

'max_leaf_nodes': list(range(2, 100)),

'min_samples_split': [2, 3, 4]

}

random_search_cv = RandomizedSearchCV(

estimator=model,

param_distributions=params,

verbose=1,

cv=cv,

scoring='f1',

n_jobs=-1

)

random_search_cv.fit(X_train, y_train)

print("Best Estimator", random_search_cv.best_estimator_) # Output: DecisionTreeClassifier(max_leaf_nodes=30, min_samples_split=4)

print("Best score", random_search_cv.best_score_) # Output: 0.9069240944070234

Random Forest

An ensemble method leveraging multiple decision trees to enhance predictive performance.

from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier()

params = {
    'bootstrap': [True],
    'max_depth': [80, 90, 100, 110],
    'max_features': [2, 3],
    'min_samples_leaf': [3, 4, 5],
    'min_samples_split': [8, 10, 12],
    'n_estimators': [100, 200, 300, 1000]
}

random_search_cv = RandomizedSearchCV(
    estimator=model,
    param_distributions=params,
    verbose=1,
    cv=cv,
    scoring='f1',
    n_jobs=-1
)

random_search_cv.fit(X_train, y_train)
print("Best Estimator", random_search_cv.best_estimator_)  # Output: RandomForestClassifier(max_leaf_nodes=96, min_samples_split=3)
print("Best score", random_search_cv.best_score_)          # Output: 0.9227615146702333

from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier()

params = {

'bootstrap': [True],

'max_depth': [80, 90, 100, 110],

'max_features': [2, 3],

'min_samples_leaf': [3, 4, 5],

'min_samples_split': [8, 10, 12],

'n_estimators': [100, 200, 300, 1000]

}

random_search_cv = RandomizedSearchCV(

estimator=model,

param_distributions=params,

verbose=1,

cv=cv,

scoring='f1',

n_jobs=-1

)

random_search_cv.fit(X_train, y_train)

print("Best Estimator", random_search_cv.best_estimator_) # Output: RandomForestClassifier(max_leaf_nodes=96, min_samples_split=3)

print("Best score", random_search_cv.best_score_) # Output: 0.9227615146702333

AdaBoost

A boosting ensemble method that combines multiple weak learners to form a strong learner.

from sklearn.ensemble import AdaBoostClassifier

model = AdaBoostClassifier()

params = {
    'n_estimators': np.arange(10, 300, 10),
    'learning_rate': [0.01, 0.05, 0.1, 1]
}

random_search_cv = RandomizedSearchCV(
    estimator=model,
    param_distributions=params,
    verbose=1,
    cv=cv,
    scoring='f1',
    n_jobs=-1
)

random_search_cv.fit(X_train, y_train)
print("Best Estimator", random_search_cv.best_estimator_)  # Output: AdaBoostClassifier(learning_rate=0.1, n_estimators=200)
print("Best score", random_search_cv.best_score_)          # Output: 0.8906331862757826

from sklearn.ensemble import AdaBoostClassifier

model = AdaBoostClassifier()

params = {

'n_estimators': np.arange(10, 300, 10),

'learning_rate': [0.01, 0.05, 0.1, 1]

}

random_search_cv = RandomizedSearchCV(

estimator=model,

param_distributions=params,

verbose=1,

cv=cv,

scoring='f1',

n_jobs=-1

)

random_search_cv.fit(X_train, y_train)

print("Best Estimator", random_search_cv.best_estimator_) # Output: AdaBoostClassifier(learning_rate=0.1, n_estimators=200)

print("Best score", random_search_cv.best_score_) # Output: 0.8906331862757826

XGBoost

An optimized gradient boosting framework known for its performance and speed.

import xgboost as xgb
from sklearn.metrics import accuracy_score, classification_report

model = xgb.XGBClassifier()

params = {
    'min_child_weight': [1, 5, 10],
    'gamma': [0.5, 1, 1.5, 2, 5],
    'subsample': [0.6, 0.8, 1.0],
    'colsample_bytree': [0.6, 0.8, 1.0],
    'max_depth': [3, 4, 5],
    'n_estimators': [100, 500, 1000],
    'learning_rate': [0.01, 0.3, 0.5, 0.1],
    'reg_lambda': [1, 2]
}

random_search_cv = RandomizedSearchCV(
    estimator=model,
    param_distributions=params,
    verbose=1,
    cv=cv,
    scoring='f1',
    n_jobs=-1
)

random_search_cv.fit(X_train, y_train)
print("Best Estimator", random_search_cv.best_estimator_)  # Output: XGBClassifier with best parameters
print("Best score", random_search_cv.best_score_)          # Output: 0.922052180776655

# Model Evaluation
model_best = random_search_cv.best_estimator_
model_best.fit(X_train, y_train)
y_pred = model_best.predict(X_test)
print(accuracy_score(y_pred, y_test))  # Output: 0.937
print(classification_report(y_pred, y_test))

import xgboost as xgb

from sklearn.metrics import accuracy_score, classification_report

model = xgb.XGBClassifier()

params = {

'min_child_weight': [1, 5, 10],

'gamma': [0.5, 1, 1.5, 2, 5],

'subsample': [0.6, 0.8, 1.0],

'colsample_bytree': [0.6, 0.8, 1.0],

'max_depth': [3, 4, 5],

'n_estimators': [100, 500, 1000],

'learning_rate': [0.01, 0.3, 0.5, 0.1],

'reg_lambda': [1, 2]

}

random_search_cv = RandomizedSearchCV(

estimator=model,

param_distributions=params,

verbose=1,

cv=cv,

scoring='f1',

n_jobs=-1

)

random_search_cv.fit(X_train, y_train)

print("Best Estimator", random_search_cv.best_estimator_) # Output: XGBClassifier with best parameters

print("Best score", random_search_cv.best_score_) # Output: 0.922052180776655

# Model Evaluation

model_best = random_search_cv.best_estimator_

model_best.fit(X_train, y_train)

y_pred = model_best.predict(X_test)

print(accuracy_score(y_pred, y_test)) # Output: 0.937

print(classification_report(y_pred, y_test))

Output:

              precision    recall  f1-score   support

           0       0.96      0.93      0.95       583
           1       0.91      0.94      0.93       417

    accuracy                           0.94      1000
   macro avg       0.93      0.94      0.94      1000
weighted avg       0.94      0.94      0.94      1000

precision recall f1-score support

0 0.96 0.93 0.95 583

1 0.91 0.94 0.93 417

accuracy 0.94 1000

macro avg 0.93 0.94 0.94 1000

weighted avg 0.94 0.94 0.94 1000

Results and Performance Evaluation

The effectiveness of RandomizedSearchCV is evident from the model performances:

KNN achieved an F1-score of ~0.877.
Logistic Regression delivered an F1-score of ~0.830.
GaussianNB maintained an accuracy of 84%.
SVM stood out with an impressive F1-score of ~0.917.
Decision Tree garnered an F1-score of ~0.907.
Random Forest led with an F1-score of ~0.923.
AdaBoost achieved an F1-score of ~0.891.
XGBoost excelled with an F1-score of ~0.922 and an accuracy of 93.7%.

Key Observations:

RandomForestClassifier and XGBoost demonstrated superior performance.
RandomizedSearchCV effectively reduced computation time from over 12 hours (GridSearchCV) to mere minutes without compromising model accuracy.

Conclusion: When to Choose RandomizedSearchCV

While GridSearchCV offers exhaustive hyperparameter tuning, its computational demands can be prohibitive for large datasets. RandomizedSearchCV emerges as a pragmatic solution, balancing efficiency and performance. It is particularly advantageous when:

Time is a Constraint: Rapid model tuning is essential.
Computational Resources are Limited: Reduces the burden on system resources.
High-Dimensional Hyperparameter Spaces: Simplifies the search process.

Adopting RandomizedSearchCV can streamline the machine learning workflow, enabling practitioners to focus on model interpretation and deployment rather than lengthy tuning procedures.

Resources and Further Reading

By leveraging RandomizedSearchCV, machine learning practitioners can achieve efficient and effective model tuning, ensuring scalable and high-performing solutions in data-driven applications.

S28L02 -RandomizedSearchCV

Optimizing Machine Learning Model Tuning: Embracing RandomizedSearchCV Over GridSearchCV

Table of Contents

Understanding GridSearchCV and Its Limitations

Key Characteristics:

Limitations:

Introducing RandomizedSearchCV

Advantages:

How It Works:

Comparative Analysis: GridSearchCV vs. RandomizedSearchCV

Data Preparation and Preprocessing

Loading the Dataset

Handling Missing Data

Numeric Data

Categorical Data

Encoding Categorical Variables

Feature Selection

Train-Test Split

Feature Scaling

Model Building and Hyperparameter Tuning

K-Nearest Neighbors (KNN)

Logistic Regression

Gaussian Naive Bayes (GaussianNB)

Support Vector Machine (SVM)

Decision Tree

Random Forest

AdaBoost

XGBoost

Results and Performance Evaluation

Conclusion: When to Choose RandomizedSearchCV

Resources and Further Reading