Evaluating Machine Learning Models with ROC Curves and AUC: A Comprehensive Guide
In the realm of machine learning, selecting the right model for your dataset is crucial for achieving accurate and reliable predictions. One of the most effective ways to evaluate and compare models is through the Receiver Operating Characteristic (ROC) Curve and the Area Under the Curve (AUC). This guide delves deep into understanding ROC curves, calculating AUC, and leveraging these metrics to choose the best-performing model for your binary classification tasks. We’ll walk through a practical example using a Jupyter Notebook, demonstrating how to implement these concepts using various machine learning algorithms.
Table of Contents
- Introduction to ROC Curve and AUC
- Why AUC Over Accuracy?
- Dataset Overview
- Data Preprocessing
- Model Training and Evaluation
- Choosing the Best Model
- Conclusion
- Resources
Introduction to ROC Curve and AUC
What is a ROC Curve?
A Receiver Operating Characteristic (ROC) Curve is a graphical representation that illustrates the diagnostic ability of a binary classifier system as its discrimination threshold varies. The ROC curve plots two parameters:
- True Positive Rate (TPR): Also known as sensitivity or recall, it measures the proportion of actual positives correctly identified.
- False Positive Rate (FPR): It measures the proportion of actual negatives that were incorrectly identified as positives.
The ROC curve enables the visualization of the trade-off between sensitivity and specificity (1 – FPR) across different threshold settings.
Understanding AUC
Area Under the Curve (AUC) quantifies the overall ability of the model to discriminate between positive and negative classes. The AUC value ranges from 0 to 1:
- AUC = 1: Perfect classifier.
- AUC = 0.5: No discrimination (equivalent to random guessing).
- AUC < 0.5: Inversely predictive (worse than random).
A higher AUC indicates a better performing model.
Why AUC Over Accuracy?
While accuracy measures the proportion of correct predictions out of all predictions made, it can be misleading, especially in cases of class imbalance. For instance, if 95% of the data belongs to one class, a model predicting only that class will achieve 95% accuracy but fail to capture the minority class.
AUC, on the other hand, provides a more nuanced evaluation by considering the model’s performance across all classification thresholds, making it a more reliable metric for imbalanced datasets.
Dataset Overview
For our analysis, we’ll utilize the Weather Dataset from Kaggle. This dataset contains various weather-related attributes recorded daily across different Australian locations.
Objective: Predict whether it will rain tomorrow (RainTomorrow
) based on today’s weather conditions.
Type: Binary Classification (Yes
/No
).
Data Preprocessing
Effective data preprocessing is the cornerstone of building robust machine learning models. Here’s a step-by-step breakdown:
1. Importing Libraries and Data
1 2 3 4 5 6 |
import pandas as pd import seaborn as sns # Load the dataset data = pd.read_csv('weatherAUS.csv') data.tail() |
2. Separating Features and Target
1 2 3 4 5 |
# Features (All columns except the last one) X = data.iloc[:, :-1] # Target variable y = data.iloc[:, -1] |
3. Handling Missing Data
a. Numeric Features
1 2 3 4 5 6 7 8 9 |
import numpy as np from sklearn.impute import SimpleImputer # Identify numeric columns numerical_cols = list(np.where((X.dtypes == np.int64) | (X.dtypes == np.float64))[0]) # Impute missing values with mean imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean') X.iloc[:, numerical_cols] = imp_mean.fit_transform(X.iloc[:, numerical_cols]) |
b. Categorical Features
1 2 3 4 5 6 |
# Identify object (categorical) columns string_cols = list(np.where((X.dtypes == object))[0]) # Impute missing values with the most frequent value imp_mode = SimpleImputer(missing_values=np.nan, strategy='most_frequent') X.iloc[:, string_cols] = imp_mode.fit_transform(X.iloc[:, string_cols]) |
4. Encoding Categorical Variables
a. Label Encoding for Target
1 2 3 4 5 |
from sklearn.preprocessing import LabelEncoder # Initialize Label Encoder le = LabelEncoder() y = le.fit_transform(y) |
b. Encoding Features
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 |
from sklearn.compose import ColumnTransformer from sklearn.preprocessing import OneHotEncoder # Function to perform One-Hot Encoding def OneHotEncoderMethod(indices, data): columnTransformer = ColumnTransformer( [('encoder', OneHotEncoder(), indices)], remainder='passthrough' ) return columnTransformer.fit_transform(data) # Identify columns for One-Hot Encoding based on the number of unique categories def EncodingSelection(X, threshold=10): string_cols = list(np.where((X.dtypes == object))[0]) one_hot_encoding_indices = [] for col in string_cols: unique_vals = len(pd.unique(X[X.columns[col]])) if unique_vals == 2 or unique_vals > threshold: X[X.columns[col]] = le.fit_transform(X[X.columns[col]]) else: one_hot_encoding_indices.append(col) X = OneHotEncoderMethod(one_hot_encoding_indices, X) return X X = EncodingSelection(X) |
5. Feature Selection
To reduce model complexity and improve performance, we’ll select the top 10 features using the Chi-Squared (Chi2) test.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
from sklearn.feature_selection import SelectKBest, chi2 from sklearn.preprocessing import MinMaxScaler # Initialize SelectKBest kbest = SelectKBest(score_func=chi2, k=10) scaler = MinMaxScaler() # Scale features X_scaled = scaler.fit_transform(X) # Fit SelectKBest kbest.fit(X_scaled, y) # Get top 10 feature indices best_features = np.argsort(kbest.scores_)[-10:] # Select top features X = X[:, best_features] |
6. Splitting the Dataset
1 2 3 4 5 6 |
from sklearn.model_selection import train_test_split # Split data into training and testing sets (80-20 split) X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.20, random_state=1 ) |
7. Feature Scaling
Standardizing features ensures that each contributes equally to the result.
1 2 3 4 5 |
from sklearn.preprocessing import StandardScaler sc = StandardScaler(with_mean=False) X_train = sc.fit_transform(X_train) X_test = sc.transform(X_test) |
Model Training and Evaluation
We’ll train several classification models and evaluate their performance using both Accuracy and AUC.
K-Nearest Neighbors (KNN)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
from sklearn.neighbors import KNeighborsClassifier from sklearn.metrics import accuracy_score, roc_curve, auc import matplotlib.pyplot as plt from sklearn import metrics # Initialize and train KNN knnClassifier = KNeighborsClassifier(n_neighbors=3) knnClassifier.fit(X_train, y_train) # Predict and evaluate y_pred_knn = knnClassifier.predict(X_test) accuracy_knn = accuracy_score(y_pred_knn, y_test) print(f'KNN Accuracy: {accuracy_knn:.2f}') # Plot ROC Curve metrics.plot_roc_curve(knnClassifier, X_test, y_test) plt.title('KNN ROC Curve') plt.show() |
Output:
1 |
KNN Accuracy: 0.82 |

Logistic Regression
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
from sklearn.linear_model import LogisticRegression # Initialize and train Logistic Regression LRM = LogisticRegression(random_state=0, max_iter=200) LRM.fit(X_train, y_train) # Predict and evaluate y_pred_lr = LRM.predict(X_test) accuracy_lr = accuracy_score(y_pred_lr, y_test) print(f'Logistic Regression Accuracy: {accuracy_lr:.2f}') # Plot ROC Curve metrics.plot_roc_curve(LRM, X_test, y_test) plt.title('Logistic Regression ROC Curve') plt.show() |
Output:
1 |
Logistic Regression Accuracy: 0.84 |

Note: If you encounter a convergence warning, consider increasing max_iter
or standardizing your data.
Gaussian Naive Bayes
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
from sklearn.naive_bayes import GaussianNB # Initialize and train GaussianNB model_GNB = GaussianNB() model_GNB.fit(X_train, y_train) # Predict and evaluate y_pred_gnb = model_GNB.predict(X_test) accuracy_gnb = accuracy_score(y_pred_gnb, y_test) print(f'Gaussian Naive Bayes Accuracy: {accuracy_gnb:.2f}') # Plot ROC Curve metrics.plot_roc_curve(model_GNB, X_test, y_test) plt.title('Gaussian Naive Bayes ROC Curve') plt.show() |
Output:
1 |
Gaussian Naive Bayes Accuracy: 0.81 |

Support Vector Machine (SVM)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
from sklearn.svm import SVC # Initialize and train SVM model_SVC = SVC(probability=True) model_SVC.fit(X_train, y_train) # Predict and evaluate y_pred_svc = model_SVC.predict(X_test) accuracy_svc = accuracy_score(y_pred_svc, y_test) print(f'SVM Accuracy: {accuracy_svc:.2f}') # Plot ROC Curve metrics.plot_roc_curve(model_SVC, X_test, y_test) plt.title('SVM ROC Curve') plt.show() |
Output:
1 |
SVM Accuracy: 0.84 |

Decision Tree
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
from sklearn.tree import DecisionTreeClassifier # Initialize and train Decision Tree model_DTC = DecisionTreeClassifier() model_DTC.fit(X_train, y_train) # Predict and evaluate y_pred_dtc = model_DTC.predict(X_test) accuracy_dtc = accuracy_score(y_pred_dtc, y_test) print(f'Decision Tree Accuracy: {accuracy_dtc:.2f}') # Plot ROC Curve metrics.plot_roc_curve(model_DTC, X_test, y_test) plt.title('Decision Tree ROC Curve') plt.show() |
Output:
1 |
Decision Tree Accuracy: 0.78 |

Random Forest
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
from sklearn.ensemble import RandomForestClassifier # Initialize and train Random Forest model_RFC = RandomForestClassifier(n_estimators=500, max_depth=5) model_RFC.fit(X_train, y_train) # Predict and evaluate y_pred_rfc = model_RFC.predict(X_test) accuracy_rfc = accuracy_score(y_pred_rfc, y_test) print(f'Random Forest Accuracy: {accuracy_rfc:.2f}') # Plot ROC Curve metrics.plot_roc_curve(model_RFC, X_test, y_test) plt.title('Random Forest ROC Curve') plt.show() |
Output:
1 |
Random Forest Accuracy: 0.84 |

AdaBoost
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
from sklearn.ensemble import AdaBoostClassifier # Initialize and train AdaBoost model_ABC = AdaBoostClassifier() model_ABC.fit(X_train, y_train) # Predict and evaluate y_pred_abc = model_ABC.predict(X_test) accuracy_abc = accuracy_score(y_pred_abc, y_test) print(f'AdaBoost Accuracy: {accuracy_abc:.2f}') # Plot ROC Curve metrics.plot_roc_curve(model_ABC, X_test, y_test) plt.title('AdaBoost ROC Curve') plt.show() |
Output:
1 |
AdaBoost Accuracy: 0.84 |

XGBoost
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
import xgboost as xgb from sklearn.exceptions import ConvergenceWarning import warnings # Suppress warnings warnings.filterwarnings("ignore", category=ConvergenceWarning) warnings.filterwarnings("ignore", category=UserWarning) # Initialize and train XGBoost model_xgb = xgb.XGBClassifier(use_label_encoder=False, eval_metric='logloss') model_xgb.fit(X_train, y_train) # Predict and evaluate y_pred_xgb = model_xgb.predict(X_test) accuracy_xgb = accuracy_score(y_pred_xgb, y_test) print(f'XGBoost Accuracy: {accuracy_xgb:.2f}') # Plot ROC Curve metrics.plot_roc_curve(model_xgb, X_test, y_test) plt.title('XGBoost ROC Curve') plt.show() |
Output:
1 |
XGBoost Accuracy: 0.85 |

Choosing the Best Model
After evaluating all the models, we observe the following accuracies:
Model | Accuracy | AUC |
---|---|---|
K-Nearest Neighbors | 0.82 | 0.80 |
Logistic Regression | 0.84 | 0.86 |
Gaussian Naive Bayes | 0.81 | 0.81 |
SVM | 0.84 | 0.86 |
Decision Tree | 0.78 | 0.89 |
Random Forest | 0.84 | 0.85 |
AdaBoost | 0.84 | 0.86 |
XGBoost | 0.85 | 0.87 |
Key Observations:
- XGBoost emerges as the top performer with the highest accuracy (85%) and a strong AUC (0.87).
- Logistic Regression, SVM, and AdaBoost also demonstrate commendable performance with accuracies around 84% and AUCs of 0.86.
- Decision Tree shows the lowest accuracy (78%) but has a relatively high AUC (0.89), indicating potential in distinguishing classes despite lower prediction accuracy.
Conclusion: While accuracy provides a straightforward metric, AUC offers a deeper insight into the model’s performance across various thresholds. In this scenario, XGBoost stands out as the most reliable model, balancing both high accuracy and strong discriminative ability.
Conclusion
Evaluating machine learning models requires a multifaceted approach. Relying solely on accuracy can be misleading, especially in datasets with class imbalances. ROC curves and AUC provide a more comprehensive assessment of a model’s performance, highlighting its ability to distinguish between classes effectively.
In this guide, we explored how to preprocess data, train multiple classification models, and evaluate them using ROC curves and AUC. The practical implementation using a Jupyter Notebook showcased the strengths of each model, ultimately demonstrating that XGBoost was the superior choice for predicting rainfall based on the provided dataset.
Resources
- ROC Curve Wikipedia
- AUC Explained
- Kaggle Weather Dataset
- Scikit-Learn Documentation
- XGBoost Documentation
By understanding and utilizing ROC curves and AUC, data scientists and machine learning practitioners can make more informed decisions when selecting models, ensuring higher performance and reliability in their predictive tasks.