Mastering Model Comparison with CAP Curves in Python: A Comprehensive Guide

In the rapidly evolving field of machine learning, selecting the best-performing model for your dataset is paramount. With numerous algorithms available, determining which one truly stands out can be daunting. Enter Cumulative Accuracy Profile (CAP) curves—a powerful tool that simplifies the process of comparing multiple models. In this comprehensive guide, we’ll delve into CAP curves, demonstrate how to implement them in Python, and showcase their effectiveness in both binary and multiclass classification scenarios. Whether you’re a data enthusiast or a seasoned practitioner, this article will equip you with the knowledge to elevate your model evaluation techniques.

Understanding CAP Curves
Setting Up Your Environment
Data Preprocessing
Building and Evaluating Models
Generating CAP Curves
Multiclass Classification with CAP Curves
Best Practices and Tips
Conclusion

Understanding CAP Curves

Cumulative Accuracy Profile (CAP) curves are graphical tools used to evaluate the performance of classification models. They provide a visual representation of a model’s ability to identify positive instances relative to a random model. By plotting the cumulative number of correctly predicted positives against the total number of observations, CAP curves help in assessing and comparing the efficacy of different models.

Why Use CAP Curves?

Intuitive Visualization: Offers a clear visual comparison between models.
Performance Metrics: Highlights differences in identifying positive instances.
Versatility: Applicable to both binary and multiclass classification problems.

Setting Up Your Environment

Before diving into CAP curves, ensure your Python environment is set up with the necessary libraries. We’ll be using libraries such as pandas, numpy, scikit-learn, matplotlib, and xgboost.

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import (LabelEncoder, OneHotEncoder, StandardScaler,
                                   MinMaxScaler)
from sklearn.impute import SimpleImputer
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.metrics import accuracy_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
import xgboost as xgb

import pandas as pd

import numpy as np

import seaborn as sns

import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split

from sklearn.preprocessing import (LabelEncoder, OneHotEncoder, StandardScaler,

MinMaxScaler)

from sklearn.impute import SimpleImputer

from sklearn.feature_selection import SelectKBest, chi2

from sklearn.metrics import accuracy_score

from sklearn.neighbors import KNeighborsClassifier

from sklearn.linear_model import LogisticRegression

from sklearn.naive_bayes import GaussianNB

from sklearn.svm import SVC

from sklearn.tree import DecisionTreeClassifier

from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier

import xgboost as xgb

Data Preprocessing

Data preprocessing is a critical step in machine learning workflows. It ensures that the data is clean, well-structured, and suitable for modeling.

Handling Missing Data

Missing data can skew results and reduce model accuracy. Here’s how to handle both numerical and categorical missing values:

# For numerical columns
import numpy as np
from sklearn.impute import SimpleImputer

imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean')
numerical_cols = X.select_dtypes(include=['int64', 'float64']).columns
imp_mean.fit(X[numerical_cols])
X[numerical_cols] = imp_mean.transform(X[numerical_cols])

# For categorical columns
from sklearn.impute import SimpleImputer

imp_mode = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
categorical_cols = X.select_dtypes(include=['object']).columns
imp_mode.fit(X[categorical_cols])
X[categorical_cols] = imp_mode.transform(X[categorical_cols])

# For numerical columns

import numpy as np

from sklearn.impute import SimpleImputer

imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean')

numerical_cols = X.select_dtypes(include=['int64', 'float64']).columns

imp_mean.fit(X[numerical_cols])

X[numerical_cols] = imp_mean.transform(X[numerical_cols])

# For categorical columns

from sklearn.impute import SimpleImputer

imp_mode = SimpleImputer(missing_values=np.nan, strategy='most_frequent')

categorical_cols = X.select_dtypes(include=['object']).columns

imp_mode.fit(X[categorical_cols])

X[categorical_cols] = imp_mode.transform(X[categorical_cols])

Encoding Categorical Variables

Most machine learning algorithms require numerical input. Encoding converts categorical variables into a numerical format.

One-Hot Encoding

Suitable for variables with more than two categories.

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

def OneHotEncoderMethod(indices, data):
    columnTransformer = ColumnTransformer([('encoder', OneHotEncoder(), indices)], remainder='passthrough')
    return columnTransformer.fit_transform(data)

X = OneHotEncoderMethod(categorical_cols, X)

from sklearn.compose import ColumnTransformer

from sklearn.preprocessing import OneHotEncoder

def OneHotEncoderMethod(indices, data):

columnTransformer = ColumnTransformer([('encoder', OneHotEncoder(), indices)], remainder='passthrough')

return columnTransformer.fit_transform(data)

X = OneHotEncoderMethod(categorical_cols, X)

Label Encoding

Suitable for categorical variables with two categories or variables with many categories where one-hot encoding may not be feasible.

from sklearn import preprocessing

def LabelEncoderMethod(series):
    le = preprocessing.LabelEncoder()
    le.fit(series)
    return le.transform(series)

# Apply label encoding to target variable
y = LabelEncoderMethod(y)

from sklearn import preprocessing

def LabelEncoderMethod(series):

le = preprocessing.LabelEncoder()

le.fit(series)

return le.transform(series)

# Apply label encoding to target variable

y = LabelEncoderMethod(y)

Feature Selection

Feature selection helps in reducing overfitting, improving accuracy, and reducing training time.

from sklearn.feature_selection import SelectKBest, chi2
from sklearn import preprocessing

# Scaling features
scaler = preprocessing.MinMaxScaler()
X_scaled = scaler.fit_transform(X)

# Selecting top 5 features based on chi-squared test
kbest = SelectKBest(score_func=chi2, k=5)
kbest.fit(X_scaled, y)
best_features = kbest.get_support(indices=True)
X = X[:, best_features]

from sklearn.feature_selection import SelectKBest, chi2

from sklearn import preprocessing

# Scaling features

scaler = preprocessing.MinMaxScaler()

X_scaled = scaler.fit_transform(X)

# Selecting top 5 features based on chi-squared test

kbest = SelectKBest(score_func=chi2, k=5)

kbest.fit(X_scaled, y)

best_features = kbest.get_support(indices=True)

X = X[:, best_features]

Feature Scaling

Scaling ensures that all features contribute equally to the model training.

from sklearn.preprocessing import StandardScaler

sc = StandardScaler(with_mean=False)
X = sc.fit_transform(X)

from sklearn.preprocessing import StandardScaler

sc = StandardScaler(with_mean=False)

X = sc.fit_transform(X)

Building and Evaluating Models

With preprocessed data, it’s time to build various classification models and evaluate their performance.

K-Nearest Neighbors (KNN)

from sklearn.neighbors import KNeighborsClassifier

knnClassifier = KNeighborsClassifier(n_neighbors=3)
knnClassifier.fit(X_train, y_train)
y_pred_knn = knnClassifier.predict(X_test)
accuracy_knn = accuracy_score(y_pred_knn, y_test)
print(f'KNN Accuracy: {accuracy_knn}')

from sklearn.neighbors import KNeighborsClassifier

knnClassifier = KNeighborsClassifier(n_neighbors=3)

knnClassifier.fit(X_train, y_train)

y_pred_knn = knnClassifier.predict(X_test)

accuracy_knn = accuracy_score(y_pred_knn, y_test)

print(f'KNN Accuracy: {accuracy_knn}')

Logistic Regression

from sklearn.linear_model import LogisticRegression

logreg = LogisticRegression(random_state=0, max_iter=200)
logreg.fit(X_train, y_train)
y_pred_logreg = logreg.predict(X_test)
accuracy_logreg = accuracy_score(y_pred_logreg, y_test)
print(f'Logistic Regression Accuracy: {accuracy_logreg}')

from sklearn.linear_model import LogisticRegression

logreg = LogisticRegression(random_state=0, max_iter=200)

logreg.fit(X_train, y_train)

y_pred_logreg = logreg.predict(X_test)

accuracy_logreg = accuracy_score(y_pred_logreg, y_test)

print(f'Logistic Regression Accuracy: {accuracy_logreg}')

Note: You might encounter a ConvergenceWarning. To resolve this, consider increasing max_iter or selecting a different solver.

Gaussian Naive Bayes

from sklearn.naive_bayes import GaussianNB

gnb = GaussianNB()
gnb.fit(X_train, y_train)
y_pred_gnb = gnb.predict(X_test)
accuracy_gnb = accuracy_score(y_pred_gnb, y_test)
print(f'Gaussian Naive Bayes Accuracy: {accuracy_gnb}')

from sklearn.naive_bayes import GaussianNB

gnb = GaussianNB()

gnb.fit(X_train, y_train)

y_pred_gnb = gnb.predict(X_test)

accuracy_gnb = accuracy_score(y_pred_gnb, y_test)

print(f'Gaussian Naive Bayes Accuracy: {accuracy_gnb}')

Support Vector Machine (SVM)

from sklearn.svm import SVC

svc = SVC()
svc.fit(X_train, y_train)
y_pred_svc = svc.predict(X_test)
accuracy_svc = accuracy_score(y_pred_svc, y_test)
print(f'SVM Accuracy: {accuracy_svc}')

from sklearn.svm import SVC

svc = SVC()

svc.fit(X_train, y_train)

y_pred_svc = svc.predict(X_test)

accuracy_svc = accuracy_score(y_pred_svc, y_test)

print(f'SVM Accuracy: {accuracy_svc}')

Decision Tree

from sklearn.tree import DecisionTreeClassifier

dtc = DecisionTreeClassifier()
dtc.fit(X_train, y_train)
y_pred_dtc = dtc.predict(X_test)
accuracy_dtc = accuracy_score(y_pred_dtc, y_test)
print(f'Decision Tree Accuracy: {accuracy_dtc}')

from sklearn.tree import DecisionTreeClassifier

dtc = DecisionTreeClassifier()

dtc.fit(X_train, y_train)

y_pred_dtc = dtc.predict(X_test)

accuracy_dtc = accuracy_score(y_pred_dtc, y_test)

print(f'Decision Tree Accuracy: {accuracy_dtc}')

Random Forest

from sklearn.ensemble import RandomForestClassifier

rfc = RandomForestClassifier(n_estimators=500, max_depth=5)
rfc.fit(X_train, y_train)
y_pred_rfc = rfc.predict(X_test)
accuracy_rfc = accuracy_score(y_pred_rfc, y_test)
print(f'Random Forest Accuracy: {accuracy_rfc}')

from sklearn.ensemble import RandomForestClassifier

rfc = RandomForestClassifier(n_estimators=500, max_depth=5)

rfc.fit(X_train, y_train)

y_pred_rfc = rfc.predict(X_test)

accuracy_rfc = accuracy_score(y_pred_rfc, y_test)

print(f'Random Forest Accuracy: {accuracy_rfc}')

AdaBoost

from sklearn.ensemble import AdaBoostClassifier

abc = AdaBoostClassifier()
abc.fit(X_train, y_train)
y_pred_abc = abc.predict(X_test)
accuracy_abc = accuracy_score(y_pred_abc, y_test)
print(f'AdaBoost Accuracy: {accuracy_abc}')

from sklearn.ensemble import AdaBoostClassifier

abc = AdaBoostClassifier()

abc.fit(X_train, y_train)

y_pred_abc = abc.predict(X_test)

accuracy_abc = accuracy_score(y_pred_abc, y_test)

print(f'AdaBoost Accuracy: {accuracy_abc}')

XGBoost

import xgboost as xgb

xgb_classifier = xgb.XGBClassifier(use_label_encoder=False, eval_metric='mlogloss')
xgb_classifier.fit(X_train, y_train)
y_pred_xgb = xgb_classifier.predict(X_test)
accuracy_xgb = accuracy_score(y_pred_xgb, y_test)
print(f'XGBoost Accuracy: {accuracy_xgb}')

import xgboost as xgb

xgb_classifier = xgb.XGBClassifier(use_label_encoder=False, eval_metric='mlogloss')

xgb_classifier.fit(X_train, y_train)

y_pred_xgb = xgb_classifier.predict(X_test)

accuracy_xgb = accuracy_score(y_pred_xgb, y_test)

print(f'XGBoost Accuracy: {accuracy_xgb}')

Note: XGBoost may emit warnings regarding label encoding and evaluation metrics. Adjust parameters as shown above to suppress warnings.

Generating CAP Curves

CAP curves provide a visual means to compare the performance of different models. Here’s how to generate them:

Defining the CAP Generation Function

def CAP_gen(model, X_test, y_test):
    pred = model.predict(X_test)
    _ = sorted(zip(pred, y_test), reverse=True)
    _cap = []
    for p, o in _:
        if p == o:
            _cap.append(p)
        else:
            _cap.append(o)
    y_values = np.append([0], np.cumsum(_cap))
    x_values = np.arange(0, len(y_test) + 1)
    return x_values, y_values

def CAP_gen(model, X_test, y_test):

pred = model.predict(X_test)

_ = sorted(zip(pred, y_test), reverse=True)

_cap = []

for p, o in _:

if p == o:

_cap.append(p)

else:

_cap.append(o)

y_values = np.append([0], np.cumsum(_cap))

x_values = np.arange(0, len(y_test) + 1)

return x_values, y_values

Plotting the CAP Curves

import matplotlib.pyplot as plt

total = len(y_test)
sum_count = np.sum(y_test)

plt.figure(figsize=(10, 6))

# Generate CAP for GaussianNB
x_gnb, y_gnb = CAP_gen(gnb, X_test, y_test)
plt.plot(x_gnb, y_gnb, linewidth=3, label='GaussianNB')

# Generate CAP for XGBoost
x_xgb, y_xgb = CAP_gen(xgb_classifier, X_test, y_test)
plt.plot(x_xgb, y_xgb, linewidth=3, label='XGBoost')

# Optional: Add more models
# x_abc, y_abc = CAP_gen(abc, X_test, y_test)
# plt.plot(x_abc, y_abc, linewidth=3, label='AdaBoost')

# x_rfc, y_rfc = CAP_gen(rfc, X_test, y_test)
# plt.plot(x_rfc, y_rfc, linewidth=3, label='Random Forest')

# Random Model line
plt.plot([0, total], [0, sum_count], linestyle='--', label='Random Model')

# Plot aesthetics
plt.xlabel('Total Observations', fontsize=16)
plt.ylabel('CAP Values', fontsize=16)
plt.title('Cumulative Accuracy Profile', fontsize=16)
plt.legend(loc='lower right', fontsize=16)
plt.show()

import matplotlib.pyplot as plt

total = len(y_test)

sum_count = np.sum(y_test)

plt.figure(figsize=(10, 6))

# Generate CAP for GaussianNB

x_gnb, y_gnb = CAP_gen(gnb, X_test, y_test)

plt.plot(x_gnb, y_gnb, linewidth=3, label='GaussianNB')

# Generate CAP for XGBoost

x_xgb, y_xgb = CAP_gen(xgb_classifier, X_test, y_test)

plt.plot(x_xgb, y_xgb, linewidth=3, label='XGBoost')

# Optional: Add more models

# x_abc, y_abc = CAP_gen(abc, X_test, y_test)

# plt.plot(x_abc, y_abc, linewidth=3, label='AdaBoost')

# x_rfc, y_rfc = CAP_gen(rfc, X_test, y_test)

# plt.plot(x_rfc, y_rfc, linewidth=3, label='Random Forest')

# Random Model line

plt.plot([0, total], [0, sum_count], linestyle='--', label='Random Model')

# Plot aesthetics

plt.xlabel('Total Observations', fontsize=16)

plt.ylabel('CAP Values', fontsize=16)

plt.title('Cumulative Accuracy Profile', fontsize=16)

plt.legend(loc='lower right', fontsize=16)

plt.show()

Interpreting CAP Curves

Diagonal Line: Represents the Random Model. A good model should stay above this line.
Model Curves: The curve closer to the top-left corner indicates a better-performing model.
Area Under the Curve (AUC): Higher AUC signifies better performance.

Multiclass Classification with CAP Curves

While CAP curves are traditionally used for binary classification, they can be adapted for multiclass problems. Here’s how to implement CAP curves in a multiclass setting using a Bengali music genre dataset (bangla.csv).

Data Overview

The bangla.csv dataset comprises 31 features representing various audio characteristics and a target variable label indicating the music genre. The genres include categories like rabindra, adhunik, and others.

Preprocessing Steps

The preprocessing steps remain largely similar to binary classification, with emphasis on encoding the multiclass target variable.

# Label Encoding for multiclass target
y = LabelEncoderMethod(y)

# Proceed with encoding selection, feature scaling, and splitting as before

# Label Encoding for multiclass target

y = LabelEncoderMethod(y)

# Proceed with encoding selection, feature scaling, and splitting as before

Building Multiclass Models

The same models used for binary classification are applicable here. The key difference lies in evaluating their performance across multiple classes.

# Example with XGBoost
xgb_classifier = xgb.XGBClassifier(use_label_encoder=False, eval_metric='mlogloss')
xgb_classifier.fit(X_train, y_train)
y_pred_xgb = xgb_classifier.predict(X_test)
accuracy_xgb = accuracy_score(y_pred_xgb, y_test)
print(f'XGBoost Multiclass Accuracy: {accuracy_xgb}')

# Example with XGBoost

xgb_classifier = xgb.XGBClassifier(use_label_encoder=False, eval_metric='mlogloss')

xgb_classifier.fit(X_train, y_train)

y_pred_xgb = xgb_classifier.predict(X_test)

accuracy_xgb = accuracy_score(y_pred_xgb, y_test)

print(f'XGBoost Multiclass Accuracy: {accuracy_xgb}')

Generating CAP Curves for Multiclass Models

The CAP generation function remains unchanged. However, the interpretation slightly varies as it now accounts for multiple classes.

# Generate CAP for GaussianNB
x_gnb, y_gnb = CAP_gen(gnb, X_test, y_test)
plt.plot(x_gnb, y_gnb, linewidth=3, label='GaussianNB')

# Generate CAP for XGBoost
x_xgb, y_xgb = CAP_gen(xgb_classifier, X_test, y_test)
plt.plot(x_xgb, y_xgb, linewidth=3, label='XGBoost')

# Random Model line
plt.plot([0, total], [0, sum_count], linestyle='--', label='Random Model')

# Plot aesthetics
plt.xlabel('Total Observations', fontsize=16)
plt.ylabel('CAP Values', fontsize=16)
plt.title('Cumulative Accuracy Profile for Multiclass Classification', fontsize=16)
plt.legend(loc='lower right', fontsize=16)
plt.show()

# Generate CAP for GaussianNB

x_gnb, y_gnb = CAP_gen(gnb, X_test, y_test)

plt.plot(x_gnb, y_gnb, linewidth=3, label='GaussianNB')

# Generate CAP for XGBoost

x_xgb, y_xgb = CAP_gen(xgb_classifier, X_test, y_test)

plt.plot(x_xgb, y_xgb, linewidth=3, label='XGBoost')

# Random Model line

plt.plot([0, total], [0, sum_count], linestyle='--', label='Random Model')

# Plot aesthetics

plt.xlabel('Total Observations', fontsize=16)

plt.ylabel('CAP Values', fontsize=16)

plt.title('Cumulative Accuracy Profile for Multiclass Classification', fontsize=16)

plt.legend(loc='lower right', fontsize=16)

plt.show()

Note: In multiclass scenarios, CAP curves may not be as straightforward to interpret as in binary classification. However, they still provide valuable insights into a model’s performance across different classes.

Best Practices and Tips

Data Quality: Ensure your data is clean and well-preprocessed to avoid misleading CAP curves.
Model Diversity: Compare models with different underlying algorithms to identify the best performer.
Multiclass Considerations: Be cautious when interpreting CAP curves in multiclass settings; consider supplementing with other evaluation metrics like confusion matrices or F1 scores.
Avoid Overfitting: Use techniques like cross-validation and regularization to ensure your models generalize well to unseen data.
Stay Updated: Machine learning is an ever-evolving field. Stay abreast of the latest tools and best practices to refine your model evaluation strategies.

Conclusion

Comparing multiple machine learning models can be challenging, but tools like CAP curves simplify the process by providing clear visual insights into model performance. Whether you’re dealing with binary or multiclass classification, implementing CAP curves in Python equips you with a robust method to evaluate and select the best model for your data. Remember to prioritize data quality, understand the nuances of different models, and interpret CAP curves judiciously to harness their full potential in your machine learning endeavors.

Happy modeling!

S29L07 – CAP curve with multiple models and multi-class

Mastering Model Comparison with CAP Curves in Python: A Comprehensive Guide

Table of Contents

Understanding CAP Curves

Why Use CAP Curves?

Setting Up Your Environment

Data Preprocessing

Handling Missing Data

Encoding Categorical Variables

One-Hot Encoding

Label Encoding

Feature Selection

Feature Scaling

Building and Evaluating Models

K-Nearest Neighbors (KNN)

Logistic Regression

Gaussian Naive Bayes

Support Vector Machine (SVM)

Decision Tree

Random Forest

AdaBoost

XGBoost

Generating CAP Curves

Defining the CAP Generation Function

Plotting the CAP Curves

Interpreting CAP Curves

Multiclass Classification with CAP Curves

Data Overview

Preprocessing Steps

Building Multiclass Models

Generating CAP Curves for Multiclass Models

Best Practices and Tips

Conclusion