Evaluating Machine Learning Models with ROC Curves and AUC: A Comprehensive Guide

In the realm of machine learning, selecting the right model for your dataset is crucial for achieving accurate and reliable predictions. One of the most effective ways to evaluate and compare models is through the Receiver Operating Characteristic (ROC) Curve and the Area Under the Curve (AUC). This guide delves deep into understanding ROC curves, calculating AUC, and leveraging these metrics to choose the best-performing model for your binary classification tasks. We’ll walk through a practical example using a Jupyter Notebook, demonstrating how to implement these concepts using various machine learning algorithms.

Introduction to ROC Curve and AUC
Why AUC Over Accuracy?
Dataset Overview
Data Preprocessing
Model Training and Evaluation
Choosing the Best Model
Conclusion
Resources

Introduction to ROC Curve and AUC

What is a ROC Curve?

A Receiver Operating Characteristic (ROC) Curve is a graphical representation that illustrates the diagnostic ability of a binary classifier system as its discrimination threshold varies. The ROC curve plots two parameters:

True Positive Rate (TPR): Also known as sensitivity or recall, it measures the proportion of actual positives correctly identified.
False Positive Rate (FPR): It measures the proportion of actual negatives that were incorrectly identified as positives.

The ROC curve enables the visualization of the trade-off between sensitivity and specificity (1 – FPR) across different threshold settings.

Understanding AUC

Area Under the Curve (AUC) quantifies the overall ability of the model to discriminate between positive and negative classes. The AUC value ranges from 0 to 1:

AUC = 1: Perfect classifier.
AUC = 0.5: No discrimination (equivalent to random guessing).
AUC < 0.5: Inversely predictive (worse than random).

A higher AUC indicates a better performing model.

Why AUC Over Accuracy?

While accuracy measures the proportion of correct predictions out of all predictions made, it can be misleading, especially in cases of class imbalance. For instance, if 95% of the data belongs to one class, a model predicting only that class will achieve 95% accuracy but fail to capture the minority class.

AUC, on the other hand, provides a more nuanced evaluation by considering the model’s performance across all classification thresholds, making it a more reliable metric for imbalanced datasets.

Dataset Overview

For our analysis, we’ll utilize the Weather Dataset from Kaggle. This dataset contains various weather-related attributes recorded daily across different Australian locations.

Objective: Predict whether it will rain tomorrow (RainTomorrow) based on today’s weather conditions.

Type: Binary Classification (Yes/No).

Data Preprocessing

Effective data preprocessing is the cornerstone of building robust machine learning models. Here’s a step-by-step breakdown:

1. Importing Libraries and Data

import pandas as pd 
import seaborn as sns

# Load the dataset
data = pd.read_csv('weatherAUS.csv')
data.tail()

import pandas as pd

import seaborn as sns

# Load the dataset

data = pd.read_csv('weatherAUS.csv')

data.tail()

2. Separating Features and Target

# Features (All columns except the last one)
X = data.iloc[:, :-1]

# Target variable
y = data.iloc[:, -1]

# Features (All columns except the last one)

X = data.iloc[:, :-1]

# Target variable

y = data.iloc[:, -1]

3. Handling Missing Data

a. Numeric Features

import numpy as np
from sklearn.impute import SimpleImputer

# Identify numeric columns
numerical_cols = list(np.where((X.dtypes == np.int64) | (X.dtypes == np.float64))[0])

# Impute missing values with mean
imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean')
X.iloc[:, numerical_cols] = imp_mean.fit_transform(X.iloc[:, numerical_cols])

import numpy as np

from sklearn.impute import SimpleImputer

# Identify numeric columns

numerical_cols = list(np.where((X.dtypes == np.int64) | (X.dtypes == np.float64))[0])

# Impute missing values with mean

imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean')

X.iloc[:, numerical_cols] = imp_mean.fit_transform(X.iloc[:, numerical_cols])

b. Categorical Features

# Identify object (categorical) columns
string_cols = list(np.where((X.dtypes == object))[0])

# Impute missing values with the most frequent value
imp_mode = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
X.iloc[:, string_cols] = imp_mode.fit_transform(X.iloc[:, string_cols])

# Identify object (categorical) columns

string_cols = list(np.where((X.dtypes == object))[0])

# Impute missing values with the most frequent value

imp_mode = SimpleImputer(missing_values=np.nan, strategy='most_frequent')

X.iloc[:, string_cols] = imp_mode.fit_transform(X.iloc[:, string_cols])

4. Encoding Categorical Variables

a. Label Encoding for Target

from sklearn.preprocessing import LabelEncoder

# Initialize Label Encoder
le = LabelEncoder()
y = le.fit_transform(y)

from sklearn.preprocessing import LabelEncoder

# Initialize Label Encoder

le = LabelEncoder()

y = le.fit_transform(y)

b. Encoding Features

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

# Function to perform One-Hot Encoding
def OneHotEncoderMethod(indices, data):
    columnTransformer = ColumnTransformer(
        [('encoder', OneHotEncoder(), indices)], remainder='passthrough'
    )
    return columnTransformer.fit_transform(data)

# Identify columns for One-Hot Encoding based on the number of unique categories
def EncodingSelection(X, threshold=10):
    string_cols = list(np.where((X.dtypes == object))[0])
    one_hot_encoding_indices = []
    
    for col in string_cols:
        unique_vals = len(pd.unique(X[X.columns[col]]))
        if unique_vals == 2 or unique_vals > threshold:
            X[X.columns[col]] = le.fit_transform(X[X.columns[col]])
        else:
            one_hot_encoding_indices.append(col)
    
    X = OneHotEncoderMethod(one_hot_encoding_indices, X)
    return X

X = EncodingSelection(X)

from sklearn.compose import ColumnTransformer

from sklearn.preprocessing import OneHotEncoder

# Function to perform One-Hot Encoding

def OneHotEncoderMethod(indices, data):

columnTransformer = ColumnTransformer(

[('encoder', OneHotEncoder(), indices)], remainder='passthrough'

)

return columnTransformer.fit_transform(data)

# Identify columns for One-Hot Encoding based on the number of unique categories

def EncodingSelection(X, threshold=10):

string_cols = list(np.where((X.dtypes == object))[0])

one_hot_encoding_indices = []

for col in string_cols:

unique_vals = len(pd.unique(X[X.columns[col]]))

if unique_vals == 2 or unique_vals > threshold:

X[X.columns[col]] = le.fit_transform(X[X.columns[col]])

else:

one_hot_encoding_indices.append(col)

X = OneHotEncoderMethod(one_hot_encoding_indices, X)

return X

X = EncodingSelection(X)

5. Feature Selection

To reduce model complexity and improve performance, we’ll select the top 10 features using the Chi-Squared (Chi2) test.

from sklearn.feature_selection import SelectKBest, chi2
from sklearn.preprocessing import MinMaxScaler

# Initialize SelectKBest
kbest = SelectKBest(score_func=chi2, k=10)
scaler = MinMaxScaler()

# Scale features
X_scaled = scaler.fit_transform(X)

# Fit SelectKBest
kbest.fit(X_scaled, y)

# Get top 10 feature indices
best_features = np.argsort(kbest.scores_)[-10:]

# Select top features
X = X[:, best_features]

from sklearn.feature_selection import SelectKBest, chi2

from sklearn.preprocessing import MinMaxScaler

# Initialize SelectKBest

kbest = SelectKBest(score_func=chi2, k=10)

scaler = MinMaxScaler()

# Scale features

X_scaled = scaler.fit_transform(X)

# Fit SelectKBest

kbest.fit(X_scaled, y)

# Get top 10 feature indices

best_features = np.argsort(kbest.scores_)[-10:]

# Select top features

X = X[:, best_features]

6. Splitting the Dataset

from sklearn.model_selection import train_test_split

# Split data into training and testing sets (80-20 split)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.20, random_state=1
)

from sklearn.model_selection import train_test_split

# Split data into training and testing sets (80-20 split)

X_train, X_test, y_train, y_test = train_test_split(

X, y, test_size=0.20, random_state=1

)

7. Feature Scaling

Standardizing features ensures that each contributes equally to the result.

from sklearn.preprocessing import StandardScaler

sc = StandardScaler(with_mean=False)
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

from sklearn.preprocessing import StandardScaler

sc = StandardScaler(with_mean=False)

X_train = sc.fit_transform(X_train)

X_test = sc.transform(X_test)

Model Training and Evaluation

We’ll train several classification models and evaluate their performance using both Accuracy and AUC.

K-Nearest Neighbors (KNN)

from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, roc_curve, auc
import matplotlib.pyplot as plt
from sklearn import metrics

# Initialize and train KNN
knnClassifier = KNeighborsClassifier(n_neighbors=3)
knnClassifier.fit(X_train, y_train)

# Predict and evaluate
y_pred_knn = knnClassifier.predict(X_test)
accuracy_knn = accuracy_score(y_pred_knn, y_test)
print(f'KNN Accuracy: {accuracy_knn:.2f}')

# Plot ROC Curve
metrics.plot_roc_curve(knnClassifier, X_test, y_test)  
plt.title('KNN ROC Curve')
plt.show()

from sklearn.neighbors import KNeighborsClassifier

from sklearn.metrics import accuracy_score, roc_curve, auc

import matplotlib.pyplot as plt

from sklearn import metrics

# Initialize and train KNN

knnClassifier = KNeighborsClassifier(n_neighbors=3)

knnClassifier.fit(X_train, y_train)

# Predict and evaluate

y_pred_knn = knnClassifier.predict(X_test)

accuracy_knn = accuracy_score(y_pred_knn, y_test)

print(f'KNN Accuracy: {accuracy_knn:.2f}')

# Plot ROC Curve

metrics.plot_roc_curve(knnClassifier, X_test, y_test)

plt.title('KNN ROC Curve')

plt.show()

Output:

KNN Accuracy: 0.82

1	KNN Accuracy: 0.82

Logistic Regression

from sklearn.linear_model import LogisticRegression

# Initialize and train Logistic Regression
LRM = LogisticRegression(random_state=0, max_iter=200)
LRM.fit(X_train, y_train)

# Predict and evaluate
y_pred_lr = LRM.predict(X_test)
accuracy_lr = accuracy_score(y_pred_lr, y_test)
print(f'Logistic Regression Accuracy: {accuracy_lr:.2f}')

# Plot ROC Curve
metrics.plot_roc_curve(LRM, X_test, y_test)  
plt.title('Logistic Regression ROC Curve')
plt.show()

from sklearn.linear_model import LogisticRegression

# Initialize and train Logistic Regression

LRM = LogisticRegression(random_state=0, max_iter=200)

LRM.fit(X_train, y_train)

# Predict and evaluate

y_pred_lr = LRM.predict(X_test)

accuracy_lr = accuracy_score(y_pred_lr, y_test)

print(f'Logistic Regression Accuracy: {accuracy_lr:.2f}')

# Plot ROC Curve

metrics.plot_roc_curve(LRM, X_test, y_test)

plt.title('Logistic Regression ROC Curve')

plt.show()

Output:

Logistic Regression Accuracy: 0.84

1	Logistic Regression Accuracy: 0.84

Note: If you encounter a convergence warning, consider increasing max_iter or standardizing your data.

Gaussian Naive Bayes

from sklearn.naive_bayes import GaussianNB

# Initialize and train GaussianNB
model_GNB = GaussianNB()
model_GNB.fit(X_train, y_train)

# Predict and evaluate
y_pred_gnb = model_GNB.predict(X_test)
accuracy_gnb = accuracy_score(y_pred_gnb, y_test)
print(f'Gaussian Naive Bayes Accuracy: {accuracy_gnb:.2f}')

# Plot ROC Curve
metrics.plot_roc_curve(model_GNB, X_test, y_test)  
plt.title('Gaussian Naive Bayes ROC Curve')
plt.show()

from sklearn.naive_bayes import GaussianNB

# Initialize and train GaussianNB

model_GNB = GaussianNB()

model_GNB.fit(X_train, y_train)

# Predict and evaluate

y_pred_gnb = model_GNB.predict(X_test)

accuracy_gnb = accuracy_score(y_pred_gnb, y_test)

print(f'Gaussian Naive Bayes Accuracy: {accuracy_gnb:.2f}')

# Plot ROC Curve

metrics.plot_roc_curve(model_GNB, X_test, y_test)

plt.title('Gaussian Naive Bayes ROC Curve')

plt.show()

Output:

Gaussian Naive Bayes Accuracy: 0.81

1	Gaussian Naive Bayes Accuracy: 0.81

Support Vector Machine (SVM)

from sklearn.svm import SVC

# Initialize and train SVM
model_SVC = SVC(probability=True)
model_SVC.fit(X_train, y_train)

# Predict and evaluate
y_pred_svc = model_SVC.predict(X_test)
accuracy_svc = accuracy_score(y_pred_svc, y_test)
print(f'SVM Accuracy: {accuracy_svc:.2f}')

# Plot ROC Curve
metrics.plot_roc_curve(model_SVC, X_test, y_test)  
plt.title('SVM ROC Curve')
plt.show()

from sklearn.svm import SVC

# Initialize and train SVM

model_SVC = SVC(probability=True)

model_SVC.fit(X_train, y_train)

# Predict and evaluate

y_pred_svc = model_SVC.predict(X_test)

accuracy_svc = accuracy_score(y_pred_svc, y_test)

print(f'SVM Accuracy: {accuracy_svc:.2f}')

# Plot ROC Curve

metrics.plot_roc_curve(model_SVC, X_test, y_test)

plt.title('SVM ROC Curve')

plt.show()

Output:

SVM Accuracy: 0.84

1	SVM Accuracy: 0.84

Decision Tree

from sklearn.tree import DecisionTreeClassifier

# Initialize and train Decision Tree
model_DTC = DecisionTreeClassifier()
model_DTC.fit(X_train, y_train)

# Predict and evaluate
y_pred_dtc = model_DTC.predict(X_test)
accuracy_dtc = accuracy_score(y_pred_dtc, y_test)
print(f'Decision Tree Accuracy: {accuracy_dtc:.2f}')

# Plot ROC Curve
metrics.plot_roc_curve(model_DTC, X_test, y_test)  
plt.title('Decision Tree ROC Curve')
plt.show()

from sklearn.tree import DecisionTreeClassifier

# Initialize and train Decision Tree

model_DTC = DecisionTreeClassifier()

model_DTC.fit(X_train, y_train)

# Predict and evaluate

y_pred_dtc = model_DTC.predict(X_test)

accuracy_dtc = accuracy_score(y_pred_dtc, y_test)

print(f'Decision Tree Accuracy: {accuracy_dtc:.2f}')

# Plot ROC Curve

metrics.plot_roc_curve(model_DTC, X_test, y_test)

plt.title('Decision Tree ROC Curve')

plt.show()

Output:

Decision Tree Accuracy: 0.78

1	Decision Tree Accuracy: 0.78

Random Forest

from sklearn.ensemble import RandomForestClassifier

# Initialize and train Random Forest
model_RFC = RandomForestClassifier(n_estimators=500, max_depth=5)
model_RFC.fit(X_train, y_train)

# Predict and evaluate
y_pred_rfc = model_RFC.predict(X_test)
accuracy_rfc = accuracy_score(y_pred_rfc, y_test)
print(f'Random Forest Accuracy: {accuracy_rfc:.2f}')

# Plot ROC Curve
metrics.plot_roc_curve(model_RFC, X_test, y_test)  
plt.title('Random Forest ROC Curve')
plt.show()

from sklearn.ensemble import RandomForestClassifier

# Initialize and train Random Forest

model_RFC = RandomForestClassifier(n_estimators=500, max_depth=5)

model_RFC.fit(X_train, y_train)

# Predict and evaluate

y_pred_rfc = model_RFC.predict(X_test)

accuracy_rfc = accuracy_score(y_pred_rfc, y_test)

print(f'Random Forest Accuracy: {accuracy_rfc:.2f}')

# Plot ROC Curve

metrics.plot_roc_curve(model_RFC, X_test, y_test)

plt.title('Random Forest ROC Curve')

plt.show()

Output:

Random Forest Accuracy: 0.84

1	Random Forest Accuracy: 0.84

AdaBoost

from sklearn.ensemble import AdaBoostClassifier

# Initialize and train AdaBoost
model_ABC = AdaBoostClassifier()
model_ABC.fit(X_train, y_train)

# Predict and evaluate
y_pred_abc = model_ABC.predict(X_test)
accuracy_abc = accuracy_score(y_pred_abc, y_test)
print(f'AdaBoost Accuracy: {accuracy_abc:.2f}')

# Plot ROC Curve
metrics.plot_roc_curve(model_ABC, X_test, y_test)  
plt.title('AdaBoost ROC Curve')
plt.show()

from sklearn.ensemble import AdaBoostClassifier

# Initialize and train AdaBoost

model_ABC = AdaBoostClassifier()

model_ABC.fit(X_train, y_train)

# Predict and evaluate

y_pred_abc = model_ABC.predict(X_test)

accuracy_abc = accuracy_score(y_pred_abc, y_test)

print(f'AdaBoost Accuracy: {accuracy_abc:.2f}')

# Plot ROC Curve

metrics.plot_roc_curve(model_ABC, X_test, y_test)

plt.title('AdaBoost ROC Curve')

plt.show()

Output:

AdaBoost Accuracy: 0.84

1	AdaBoost Accuracy: 0.84

XGBoost

import xgboost as xgb
from sklearn.exceptions import ConvergenceWarning
import warnings

# Suppress warnings
warnings.filterwarnings("ignore", category=ConvergenceWarning)
warnings.filterwarnings("ignore", category=UserWarning)

# Initialize and train XGBoost
model_xgb = xgb.XGBClassifier(use_label_encoder=False, eval_metric='logloss')
model_xgb.fit(X_train, y_train)

# Predict and evaluate
y_pred_xgb = model_xgb.predict(X_test)
accuracy_xgb = accuracy_score(y_pred_xgb, y_test)
print(f'XGBoost Accuracy: {accuracy_xgb:.2f}')

# Plot ROC Curve
metrics.plot_roc_curve(model_xgb, X_test, y_test)  
plt.title('XGBoost ROC Curve')
plt.show()

import xgboost as xgb

from sklearn.exceptions import ConvergenceWarning

import warnings

# Suppress warnings

warnings.filterwarnings("ignore", category=ConvergenceWarning)

warnings.filterwarnings("ignore", category=UserWarning)

# Initialize and train XGBoost

model_xgb = xgb.XGBClassifier(use_label_encoder=False, eval_metric='logloss')

model_xgb.fit(X_train, y_train)

# Predict and evaluate

y_pred_xgb = model_xgb.predict(X_test)

accuracy_xgb = accuracy_score(y_pred_xgb, y_test)

print(f'XGBoost Accuracy: {accuracy_xgb:.2f}')

# Plot ROC Curve

metrics.plot_roc_curve(model_xgb, X_test, y_test)

plt.title('XGBoost ROC Curve')

plt.show()

Output:

XGBoost Accuracy: 0.85

1	XGBoost Accuracy: 0.85

Choosing the Best Model

After evaluating all the models, we observe the following accuracies:

Model	Accuracy	AUC
K-Nearest Neighbors	0.82	0.80
Logistic Regression	0.84	0.86
Gaussian Naive Bayes	0.81	0.81
SVM	0.84	0.86
Decision Tree	0.78	0.89
Random Forest	0.84	0.85
AdaBoost	0.84	0.86
XGBoost	0.85	0.87

Key Observations:

XGBoost emerges as the top performer with the highest accuracy (85%) and a strong AUC (0.87).
Logistic Regression, SVM, and AdaBoost also demonstrate commendable performance with accuracies around 84% and AUCs of 0.86.
Decision Tree shows the lowest accuracy (78%) but has a relatively high AUC (0.89), indicating potential in distinguishing classes despite lower prediction accuracy.

Conclusion: While accuracy provides a straightforward metric, AUC offers a deeper insight into the model’s performance across various thresholds. In this scenario, XGBoost stands out as the most reliable model, balancing both high accuracy and strong discriminative ability.

Conclusion

Evaluating machine learning models requires a multifaceted approach. Relying solely on accuracy can be misleading, especially in datasets with class imbalances. ROC curves and AUC provide a more comprehensive assessment of a model’s performance, highlighting its ability to distinguish between classes effectively.

In this guide, we explored how to preprocess data, train multiple classification models, and evaluate them using ROC curves and AUC. The practical implementation using a Jupyter Notebook showcased the strengths of each model, ultimately demonstrating that XGBoost was the superior choice for predicting rainfall based on the provided dataset.

Resources

By understanding and utilizing ROC curves and AUC, data scientists and machine learning practitioners can make more informed decisions when selecting models, ensuring higher performance and reliability in their predictive tasks.

S29L02 -ROC, AUC – Evaluating best model

Evaluating Machine Learning Models with ROC Curves and AUC: A Comprehensive Guide

Table of Contents

Introduction to ROC Curve and AUC

What is a ROC Curve?

Understanding AUC

Why AUC Over Accuracy?

Dataset Overview

Data Preprocessing

1. Importing Libraries and Data

2. Separating Features and Target

3. Handling Missing Data

a. Numeric Features

b. Categorical Features

4. Encoding Categorical Variables

a. Label Encoding for Target

b. Encoding Features

5. Feature Selection

6. Splitting the Dataset

7. Feature Scaling

Model Training and Evaluation

K-Nearest Neighbors (KNN)

Logistic Regression

Gaussian Naive Bayes

Support Vector Machine (SVM)

Decision Tree

Random Forest

AdaBoost

XGBoost

Choosing the Best Model

Conclusion

Resources