Implementing Cumulative Accuracy Profile (CAP) Curves in Python: A Comprehensive Guide

In the realm of machine learning and data science, evaluating the performance of classification models is paramount. Among various evaluation metrics, the Cumulative Accuracy Profile (CAP) Curve stands out for its intuitive visualization of model performance, especially in binary and multi-class classification problems. This comprehensive guide delves into the concept of CAP Curves, their significance, and a step-by-step implementation using Python. Whether you’re a seasoned data scientist or a budding enthusiast, this article will equip you with the knowledge to harness CAP Curves effectively.

Introduction to CAP Curves
Understanding the Importance of CAP Curves
Data Preparation for CAP Curve Implementation
Handling Missing Data
Encoding Categorical Variables
Feature Selection and Scaling
Building and Evaluating Classification Models
Generating the CAP Curve
Comparing Multiple Models Using CAP Curves
Conclusion
References

1. Introduction to CAP Curves

The Cumulative Accuracy Profile (CAP) Curve is a graphical tool used to evaluate the performance of classification models. It plots the cumulative number of positive instances captured by the model against the total number of instances, providing a visual representation of the model’s ability to prioritize true positives.

Key Features of CAP Curves:

Intuitive Visualization: Offers a clear depiction of model performance compared to random selection.
Model Comparison: Facilitates the comparison of multiple models on the same dataset.
Performance Metric: The area under the CAP Curve (AUC) serves as a metric for model evaluation.

2. Understanding the Importance of CAP Curves

CAP Curves are particularly beneficial in scenarios where the order of predictions matters, such as in customer targeting or fraud detection. By visualizing how quickly a model accumulates positive instances, stakeholders can assess the model’s effectiveness in prioritizing high-value predictions.

Advantages of Using CAP Curves:

Assessing Model Performance: Quickly gauges how well a model performs relative to a random model.
Decision-Making Tool: Aids in selecting the optimal model based on visual performance.
Versatility: Applicable to both binary and multi-class classification problems.

3. Data Preparation for CAP Curve Implementation

Proper data preparation is crucial for accurate model evaluation and CAP Curve generation. Here’s a walkthrough of the data preprocessing steps using Python’s Pandas and Scikit-learn libraries.

Step-by-Step Data Preparation:

Importing Libraries:

Java

import pandas as pd import seaborn as sns

1
2

import pandas as pd
import seaborn as sns

Loading the Dataset:

data = pd.read_csv('bangla.csv')
data.tail()

1 2	data = pd.read_csv('bangla.csv') data.tail()

Sample Output:

                                                 file_name  zero_crossing  ...
1737  Tumi Robe Nirobe, Artist - DWIJEN  MUKHOPADHYA...          78516   
1738  TUMI SANDHYAR MEGHMALA  Srikanta Acharya  Rabi...         176887   
...

file_name zero_crossing ...

1737 Tumi Robe Nirobe, Artist - DWIJEN MUKHOPADHYA... 78516

1738 TUMI SANDHYAR MEGHMALA Srikanta Acharya Rabi... 176887

...

Separating Features and Target:

Java

X = data.iloc[:,:-1] y = data.iloc[:,-1]

1
2

X = data.iloc[:,:-1]
y = data.iloc[:,-1]

4. Handling Missing Data

Missing data can skew model performance. It’s essential to address missing values before training.

Handling Numeric Missing Values:

import numpy as np
from sklearn.impute import SimpleImputer

imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean')
numerical_cols = list(np.where((X.dtypes == np.int64) | (X.dtypes == np.float64))[0])
imp_mean.fit(X.iloc[:, numerical_cols])
X.iloc[:, numerical_cols] = imp_mean.transform(X.iloc[:, numerical_cols])

import numpy as np

from sklearn.impute import SimpleImputer

imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean')

numerical_cols = list(np.where((X.dtypes == np.int64) | (X.dtypes == np.float64))[0])

imp_mean.fit(X.iloc[:, numerical_cols])

X.iloc[:, numerical_cols] = imp_mean.transform(X.iloc[:, numerical_cols])

Handling Categorical Missing Values:

string_cols = list(np.where((X.dtypes == np.object))[0])
imp_mode = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
imp_mode.fit(X.iloc[:, string_cols])
X.iloc[:, string_cols] = imp_mode.transform(X.iloc[:, string_cols])

string_cols = list(np.where((X.dtypes == np.object))[0])

imp_mode = SimpleImputer(missing_values=np.nan, strategy='most_frequent')

imp_mode.fit(X.iloc[:, string_cols])

X.iloc[:, string_cols] = imp_mode.transform(X.iloc[:, string_cols])

5. Encoding Categorical Variables

Machine learning models require numerical input. Encoding categorical variables is pivotal for model training.

One-Hot Encoding Method:

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

def OneHotEncoderMethod(indices, data):
    columnTransformer = ColumnTransformer([('encoder', OneHotEncoder(), indices )], remainder='passthrough')
    return columnTransformer.fit_transform(data)

from sklearn.compose import ColumnTransformer

from sklearn.preprocessing import OneHotEncoder

def OneHotEncoderMethod(indices, data):

columnTransformer = ColumnTransformer([('encoder', OneHotEncoder(), indices )], remainder='passthrough')

return columnTransformer.fit_transform(data)

Label Encoding Method:

from sklearn import preprocessing

def LabelEncoderMethod(series):
    le = preprocessing.LabelEncoder()
    le.fit(series)
    return le.transform(series)

from sklearn import preprocessing

def LabelEncoderMethod(series):

le = preprocessing.LabelEncoder()

le.fit(series)

return le.transform(series)

Applying Encoding:

y = LabelEncoderMethod(y)

def EncodingSelection(X, threshold=10):
    string_cols = list(np.where((X.dtypes == np.object))[0])
    one_hot_encoding_indices = []
    
    for col in string_cols:
        length = len(pd.unique(X[X.columns[col]]))
        if length == 2 or length > threshold:
            X[X.columns[col]] = LabelEncoderMethod(X[X.columns[col]])
        else:
            one_hot_encoding_indices.append(col)
            
    X = OneHotEncoderMethod(one_hot_encoding_indices, X)
    return X

X = EncodingSelection(X)
X.shape  # Output: (1742, 30)

y = LabelEncoderMethod(y)

def EncodingSelection(X, threshold=10):

string_cols = list(np.where((X.dtypes == np.object))[0])

one_hot_encoding_indices = []

for col in string_cols:

length = len(pd.unique(X[X.columns[col]]))

if length == 2 or length > threshold:

X[X.columns[col]] = LabelEncoderMethod(X[X.columns[col]])

else:

one_hot_encoding_indices.append(col)

X = OneHotEncoderMethod(one_hot_encoding_indices, X)

return X

X = EncodingSelection(X)

X.shape # Output: (1742, 30)

6. Feature Selection and Scaling

Selecting relevant features and scaling ensures model efficiency and accuracy.

Feature Selection:

from sklearn.feature_selection import SelectKBest, chi2
from sklearn import preprocessing

kbest = SelectKBest(score_func=chi2, k=10)
MMS = preprocessing.MinMaxScaler()
K_features = 10

x_temp = MMS.fit_transform(X)
x_temp = kbest.fit(x_temp, y)
best_features = np.argsort(x_temp.scores_)[-K_features:]
features_to_delete = np.argsort(x_temp.scores_)[:-K_features]
X = np.delete(X, features_to_delete, axis=1)
X.shape  # Output: (1742, 10)
del x_temp

from sklearn.feature_selection import SelectKBest, chi2

from sklearn import preprocessing

kbest = SelectKBest(score_func=chi2, k=10)

MMS = preprocessing.MinMaxScaler()

K_features = 10

x_temp = MMS.fit_transform(X)

x_temp = kbest.fit(x_temp, y)

best_features = np.argsort(x_temp.scores_)[-K_features:]

features_to_delete = np.argsort(x_temp.scores_)[:-K_features]

X = np.delete(X, features_to_delete, axis=1)

X.shape # Output: (1742, 10)

del x_temp

Feature Scaling:

from sklearn import preprocessing

sc = preprocessing.StandardScaler(with_mean=False)
sc.fit(X_train)

X_train = sc.transform(X_train)
X_test = sc.transform(X_test)

from sklearn import preprocessing

sc = preprocessing.StandardScaler(with_mean=False)

sc.fit(X_train)

X_train = sc.transform(X_train)

X_test = sc.transform(X_test)

7. Building and Evaluating Classification Models

Multiple classification models are trained to evaluate their performance using CAP Curves.

Train-Test Split:

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=1)

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=1)

Building Models:

K-Nearest Neighbors (KNN):

from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

knnClassifier = KNeighborsClassifier(n_neighbors=3)
knnClassifier.fit(X_train, y_train)
y_pred_knn = knnClassifier.predict(X_test)
knn_accuracy = accuracy_score(y_pred_knn, y_test)  # Output: 0.6475

from sklearn.neighbors import KNeighborsClassifier

from sklearn.metrics import accuracy_score

knnClassifier = KNeighborsClassifier(n_neighbors=3)

knnClassifier.fit(X_train, y_train)

y_pred_knn = knnClassifier.predict(X_test)

knn_accuracy = accuracy_score(y_pred_knn, y_test) # Output: 0.6475

Logistic Regression:

from sklearn.linear_model import LogisticRegression

LRM = LogisticRegression(random_state=0, max_iter=200)
LRM.fit(X_train, y_train)
y_pred_lr = LRM.predict(X_test)
lr_accuracy = accuracy_score(y_pred_lr, y_test)  # Output: ~0.63

from sklearn.linear_model import LogisticRegression

LRM = LogisticRegression(random_state=0, max_iter=200)

LRM.fit(X_train, y_train)

y_pred_lr = LRM.predict(X_test)

lr_accuracy = accuracy_score(y_pred_lr, y_test) # Output: ~0.63

Gaussian Naive Bayes:

from sklearn.naive_bayes import GaussianNB

model_GNB = GaussianNB()
model_GNB.fit(X_train, y_train)
y_pred_gnb = model_GNB.predict(X_test)
gnb_accuracy = accuracy_score(y_pred_gnb, y_test)  # Output: 0.831

from sklearn.naive_bayes import GaussianNB

model_GNB = GaussianNB()

model_GNB.fit(X_train, y_train)

y_pred_gnb = model_GNB.predict(X_test)

gnb_accuracy = accuracy_score(y_pred_gnb, y_test) # Output: 0.831

Support Vector Machine (SVC):

from sklearn.svm import SVC

model_SVC = SVC()
model_SVC.fit(X_train, y_train)
y_pred_svc = model_SVC.predict(X_test)
svc_accuracy = accuracy_score(y_pred_svc, y_test)  # Output: 0.8765

from sklearn.svm import SVC

model_SVC = SVC()

model_SVC.fit(X_train, y_train)

y_pred_svc = model_SVC.predict(X_test)

svc_accuracy = accuracy_score(y_pred_svc, y_test) # Output: 0.8765

Decision Tree Classifier:

from sklearn.tree import DecisionTreeClassifier

model_DTC = DecisionTreeClassifier()
model_DTC.fit(X_train, y_train)
y_pred_dtc = model_DTC.predict(X_test)
dtc_accuracy = accuracy_score(y_pred_dtc, y_test)  # Output: 0.8175

from sklearn.tree import DecisionTreeClassifier

model_DTC = DecisionTreeClassifier()

model_DTC.fit(X_train, y_train)

y_pred_dtc = model_DTC.predict(X_test)

dtc_accuracy = accuracy_score(y_pred_dtc, y_test) # Output: 0.8175

Random Forest Classifier:

from sklearn.ensemble import RandomForestClassifier

model_RFC = RandomForestClassifier(n_estimators=500, max_depth=5)
model_RFC.fit(X_train, y_train)
y_pred_rfc = model_RFC.predict(X_test)
rfc_accuracy = accuracy_score(y_pred_rfc, y_test)  # Output: 0.8725

from sklearn.ensemble import RandomForestClassifier

model_RFC = RandomForestClassifier(n_estimators=500, max_depth=5)

model_RFC.fit(X_train, y_train)

y_pred_rfc = model_RFC.predict(X_test)

rfc_accuracy = accuracy_score(y_pred_rfc, y_test) # Output: 0.8725

AdaBoost Classifier:

from sklearn.ensemble import AdaBoostClassifier

model_ABC = AdaBoostClassifier()
model_ABC.fit(X_train, y_train)
y_pred_abc = model_ABC.predict(X_test)
abc_accuracy = accuracy_score(y_pred_abc, y_test)  # Output: 0.8725

from sklearn.ensemble import AdaBoostClassifier

model_ABC = AdaBoostClassifier()

model_ABC.fit(X_train, y_train)

y_pred_abc = model_ABC.predict(X_test)

abc_accuracy = accuracy_score(y_pred_abc, y_test) # Output: 0.8725

XGBoost Classifier:

import xgboost as xgb

model_xgb = xgb.XGBClassifier(use_label_encoder=False, eval_metric='logloss')
model_xgb.fit(X_train, y_train)
y_pred_xgb = model_xgb.predict(X_test)
xgb_accuracy = accuracy_score(y_pred_xgb, y_test)  # Output: 0.8715

import xgboost as xgb

model_xgb = xgb.XGBClassifier(use_label_encoder=False, eval_metric='logloss')

model_xgb.fit(X_train, y_train)

y_pred_xgb = model_xgb.predict(X_test)

xgb_accuracy = accuracy_score(y_pred_xgb, y_test) # Output: 0.8715

8. Generating the CAP Curve

The CAP Curve is plotted to visualize model performance against a random model.

Plotting the Random Model:

import matplotlib.pyplot as plt

# Total number of test samples
total = len(y_test) 

# Total number of positive instances
sum_count = np.sum(y_test) 

plt.figure(figsize=(10, 6)) 

# Plotting the random model
plt.plot([0, total], [0, sum_count], color='blue', linestyle='--', label='Random Model') 
plt.legend()
plt.xlabel('Total Observations', fontsize=16)
plt.ylabel('CAP Values', fontsize=16)
plt.title('Cumulative Accuracy Profile', fontsize=16)
plt.show()

import matplotlib.pyplot as plt

# Total number of test samples

total = len(y_test)

# Total number of positive instances

sum_count = np.sum(y_test)

plt.figure(figsize=(10, 6))

# Plotting the random model

plt.plot([0, total], [0, sum_count], color='blue', linestyle='--', label='Random Model')

plt.legend()

plt.xlabel('Total Observations', fontsize=16)

plt.ylabel('CAP Values', fontsize=16)

plt.title('Cumulative Accuracy Profile', fontsize=16)

plt.show()

Plotting the Logistic Regression Model:

# Predicting using Logistic Regression
pred_lr = LRM.predict(X_test)
x_values = np.arange(0, total + 1)

# Sorting predictions and actual values
sorted_zip = sorted(zip(pred_lr, y_test), reverse=True)

# Generating CAP values
cap = []
for p, o in sorted_zip:
    if p == o:
        cap.append(p)
    else:
        cap.append(o)
        
y_values = np.append([0], np.cumsum(cap))

# Plotting the CAP Curve
plt.figure(figsize=(10, 6)) 
plt.plot(x_values, y_values, color='blue', linewidth=3, label='Logistic Regression')
plt.plot([0, total], [0, sum_count], linestyle='--', label='Random Model') 

plt.xlabel('Total Observations', fontsize=16)
plt.ylabel('CAP Values', fontsize=16)
plt.title('Cumulative Accuracy Profile', fontsize=16)
plt.legend(loc='lower right', fontsize=16)
plt.show()

# Predicting using Logistic Regression

pred_lr = LRM.predict(X_test)

x_values = np.arange(0, total + 1)

# Sorting predictions and actual values

sorted_zip = sorted(zip(pred_lr, y_test), reverse=True)

# Generating CAP values

cap = []

for p, o in sorted_zip:

if p == o:

cap.append(p)

else:

cap.append(o)

y_values = np.append([0], np.cumsum(cap))

# Plotting the CAP Curve

plt.figure(figsize=(10, 6))

plt.plot(x_values, y_values, color='blue', linewidth=3, label='Logistic Regression')

plt.plot([0, total], [0, sum_count], linestyle='--', label='Random Model')

plt.xlabel('Total Observations', fontsize=16)

plt.ylabel('CAP Values', fontsize=16)

plt.title('Cumulative Accuracy Profile', fontsize=16)

plt.legend(loc='lower right', fontsize=16)

plt.show()

9. Comparing Multiple Models Using CAP Curves

By plotting CAP Curves for multiple models, one can visually assess and compare their performance.

Defining a CAP Generation Function:

def CAP_gen(model, X_test=X_test, y_test=y_test):
    pred = model.predict(X_test)
    sorted_zip = sorted(zip(pred, y_test), reverse=True)
    cap = []
    for p, o in sorted_zip:
        if p == o:
            cap.append(p)
        else:
            cap.append(o)
    y_values = np.append([0], np.cumsum(cap))
    x_values = np.arange(0, len(y_test) + 1)
    return (x_values, y_values)

def CAP_gen(model, X_test=X_test, y_test=y_test):

pred = model.predict(X_test)

sorted_zip = sorted(zip(pred, y_test), reverse=True)

cap = []

for p, o in sorted_zip:

if p == o:

cap.append(p)

else:

cap.append(o)

y_values = np.append([0], np.cumsum(cap))

x_values = np.arange(0, len(y_test) + 1)

return (x_values, y_values)

Plotting Multiple CAP Curves:

plt.figure(figsize=(10, 6)) 

# Plot CAP for Gaussian Naive Bayes
x_gnb, y_gnb = CAP_gen(model_GNB)
plt.plot(x_gnb, y_gnb, linewidth=3, label='GaussianNB')

# Plot CAP for XGBoost
x_xgb, y_xgb = CAP_gen(model_xgb)
plt.plot(x_xgb, y_xgb, linewidth=3, label='XGBoost')

# Plot CAP for AdaBoost
x_abc, y_abc = CAP_gen(model_ABC)
plt.plot(x_abc, y_abc, linewidth=3, label='AdaBoost')

# Plotting the random model
plt.plot([0, total], [0, sum_count], linestyle='--', label='Random Model') 

plt.xlabel('Total Observations', fontsize=16)
plt.ylabel('CAP Values', fontsize=16)
plt.title('Cumulative Accuracy Profile', fontsize=16)
plt.legend(loc='lower right', fontsize=16)
plt.show()

plt.figure(figsize=(10, 6))

# Plot CAP for Gaussian Naive Bayes

x_gnb, y_gnb = CAP_gen(model_GNB)

plt.plot(x_gnb, y_gnb, linewidth=3, label='GaussianNB')

# Plot CAP for XGBoost

x_xgb, y_xgb = CAP_gen(model_xgb)

plt.plot(x_xgb, y_xgb, linewidth=3, label='XGBoost')

# Plot CAP for AdaBoost

x_abc, y_abc = CAP_gen(model_ABC)

plt.plot(x_abc, y_abc, linewidth=3, label='AdaBoost')

# Plotting the random model

plt.plot([0, total], [0, sum_count], linestyle='--', label='Random Model')

plt.xlabel('Total Observations', fontsize=16)

plt.ylabel('CAP Values', fontsize=16)

plt.title('Cumulative Accuracy Profile', fontsize=16)

plt.legend(loc='lower right', fontsize=16)

plt.show()

From the CAP Curves, models like XGBoost and SVM (SVC) showcase superior performance with larger areas under their respective curves, indicating higher efficacy in prioritizing true positive predictions compared to the random model.

10. Conclusion

The Cumulative Accuracy Profile (CAP) Curve is a potent tool for evaluating and comparing classification models. Its ability to provide a clear visualization of model performance relative to a random baseline makes it invaluable in decision-making processes, especially in business-critical applications like fraud detection and customer segmentation.

By following the steps outlined in this guide—from data preprocessing and handling missing values to encoding categorical variables and building robust models—you can effectively implement CAP Curves in Python to gain deeper insights into your models’ performance.

Embracing CAP Curves not only enhances your model evaluation strategy but also elevates the interpretability of complex machine learning models, bridging the gap between data science and actionable business intelligence.

11. References

Disclaimer: The images referenced in this article (https://example.com/...) are placeholders. Replace them with actual image URLs relevant to CAP Curves.

S29L06 – CAP curve implementation

Implementing Cumulative Accuracy Profile (CAP) Curves in Python: A Comprehensive Guide

Table of Contents

1. Introduction to CAP Curves

Key Features of CAP Curves:

2. Understanding the Importance of CAP Curves

3. Data Preparation for CAP Curve Implementation

Step-by-Step Data Preparation:

4. Handling Missing Data

Handling Numeric Missing Values:

Handling Categorical Missing Values:

5. Encoding Categorical Variables

One-Hot Encoding Method:

Label Encoding Method:

Applying Encoding:

6. Feature Selection and Scaling

Feature Selection:

Feature Scaling:

7. Building and Evaluating Classification Models

Train-Test Split:

Building Models:

8. Generating the CAP Curve

Plotting the Random Model:

Plotting the Logistic Regression Model:

9. Comparing Multiple Models Using CAP Curves

Defining a CAP Generation Function:

Plotting Multiple CAP Curves:

10. Conclusion

11. References