Implementing Cumulative Accuracy Profile (CAP) Curves in Python: A Comprehensive Guide

In the realm of machine learning and data science, evaluating the performance of classification models is paramount. Among various evaluation metrics, the Cumulative Accuracy Profile (CAP) Curve stands out for its intuitive visualization of model performance, especially in binary and multi-class classification problems. This comprehensive guide delves into the concept of CAP Curves, their significance, and a step-by-step implementation using Python. Whether you’re a seasoned data scientist or a budding enthusiast, this article will equip you with the knowledge to harness CAP Curves effectively.
Table of Contents
- Introduction to CAP Curves
- Understanding the Importance of CAP Curves
- Data Preparation for CAP Curve Implementation
- Handling Missing Data
- Encoding Categorical Variables
- Feature Selection and Scaling
- Building and Evaluating Classification Models
- Generating the CAP Curve
- Comparing Multiple Models Using CAP Curves
- Conclusion
- References
1. Introduction to CAP Curves
The Cumulative Accuracy Profile (CAP) Curve is a graphical tool used to evaluate the performance of classification models. It plots the cumulative number of positive instances captured by the model against the total number of instances, providing a visual representation of the model’s ability to prioritize true positives.
Key Features of CAP Curves:
- Intuitive Visualization: Offers a clear depiction of model performance compared to random selection.
- Model Comparison: Facilitates the comparison of multiple models on the same dataset.
- Performance Metric: The area under the CAP Curve (AUC) serves as a metric for model evaluation.
2. Understanding the Importance of CAP Curves
CAP Curves are particularly beneficial in scenarios where the order of predictions matters, such as in customer targeting or fraud detection. By visualizing how quickly a model accumulates positive instances, stakeholders can assess the model’s effectiveness in prioritizing high-value predictions.
Advantages of Using CAP Curves:
- Assessing Model Performance: Quickly gauges how well a model performs relative to a random model.
- Decision-Making Tool: Aids in selecting the optimal model based on visual performance.
- Versatility: Applicable to both binary and multi-class classification problems.
3. Data Preparation for CAP Curve Implementation
Proper data preparation is crucial for accurate model evaluation and CAP Curve generation. Here’s a walkthrough of the data preprocessing steps using Python’s Pandas and Scikit-learn libraries.
Step-by-Step Data Preparation:
- Importing Libraries:
12import pandas as pdimport seaborn as sns
- Loading the Dataset:
12data = pd.read_csv('bangla.csv')data.tail()
Sample Output:
1234file_name zero_crossing ...1737 Tumi Robe Nirobe, Artist - DWIJEN MUKHOPADHYA... 785161738 TUMI SANDHYAR MEGHMALA Srikanta Acharya Rabi... 176887... - Separating Features and Target:
12X = data.iloc[:,:-1]y = data.iloc[:,-1]
4. Handling Missing Data
Missing data can skew model performance. It’s essential to address missing values before training.
Handling Numeric Missing Values:
1 2 3 4 5 6 7 |
import numpy as np from sklearn.impute import SimpleImputer imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean') numerical_cols = list(np.where((X.dtypes == np.int64) | (X.dtypes == np.float64))[0]) imp_mean.fit(X.iloc[:, numerical_cols]) X.iloc[:, numerical_cols] = imp_mean.transform(X.iloc[:, numerical_cols]) |
Handling Categorical Missing Values:
1 2 3 4 |
string_cols = list(np.where((X.dtypes == np.object))[0]) imp_mode = SimpleImputer(missing_values=np.nan, strategy='most_frequent') imp_mode.fit(X.iloc[:, string_cols]) X.iloc[:, string_cols] = imp_mode.transform(X.iloc[:, string_cols]) |
5. Encoding Categorical Variables
Machine learning models require numerical input. Encoding categorical variables is pivotal for model training.
One-Hot Encoding Method:
1 2 3 4 5 6 |
from sklearn.compose import ColumnTransformer from sklearn.preprocessing import OneHotEncoder def OneHotEncoderMethod(indices, data): columnTransformer = ColumnTransformer([('encoder', OneHotEncoder(), indices )], remainder='passthrough') return columnTransformer.fit_transform(data) |
Label Encoding Method:
1 2 3 4 5 6 |
from sklearn import preprocessing def LabelEncoderMethod(series): le = preprocessing.LabelEncoder() le.fit(series) return le.transform(series) |
Applying Encoding:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
y = LabelEncoderMethod(y) def EncodingSelection(X, threshold=10): string_cols = list(np.where((X.dtypes == np.object))[0]) one_hot_encoding_indices = [] for col in string_cols: length = len(pd.unique(X[X.columns[col]])) if length == 2 or length > threshold: X[X.columns[col]] = LabelEncoderMethod(X[X.columns[col]]) else: one_hot_encoding_indices.append(col) X = OneHotEncoderMethod(one_hot_encoding_indices, X) return X X = EncodingSelection(X) X.shape # Output: (1742, 30) |
6. Feature Selection and Scaling
Selecting relevant features and scaling ensures model efficiency and accuracy.
Feature Selection:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
from sklearn.feature_selection import SelectKBest, chi2 from sklearn import preprocessing kbest = SelectKBest(score_func=chi2, k=10) MMS = preprocessing.MinMaxScaler() K_features = 10 x_temp = MMS.fit_transform(X) x_temp = kbest.fit(x_temp, y) best_features = np.argsort(x_temp.scores_)[-K_features:] features_to_delete = np.argsort(x_temp.scores_)[:-K_features] X = np.delete(X, features_to_delete, axis=1) X.shape # Output: (1742, 10) del x_temp |
Feature Scaling:
1 2 3 4 5 6 7 |
from sklearn import preprocessing sc = preprocessing.StandardScaler(with_mean=False) sc.fit(X_train) X_train = sc.transform(X_train) X_test = sc.transform(X_test) |
7. Building and Evaluating Classification Models
Multiple classification models are trained to evaluate their performance using CAP Curves.
Train-Test Split:
1 2 3 |
from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=1) |
Building Models:
- K-Nearest Neighbors (KNN):
1234567from sklearn.neighbors import KNeighborsClassifierfrom sklearn.metrics import accuracy_scoreknnClassifier = KNeighborsClassifier(n_neighbors=3)knnClassifier.fit(X_train, y_train)y_pred_knn = knnClassifier.predict(X_test)knn_accuracy = accuracy_score(y_pred_knn, y_test) # Output: 0.6475
- Logistic Regression:
123456from sklearn.linear_model import LogisticRegressionLRM = LogisticRegression(random_state=0, max_iter=200)LRM.fit(X_train, y_train)y_pred_lr = LRM.predict(X_test)lr_accuracy = accuracy_score(y_pred_lr, y_test) # Output: ~0.63
- Gaussian Naive Bayes:
123456from sklearn.naive_bayes import GaussianNBmodel_GNB = GaussianNB()model_GNB.fit(X_train, y_train)y_pred_gnb = model_GNB.predict(X_test)gnb_accuracy = accuracy_score(y_pred_gnb, y_test) # Output: 0.831
- Support Vector Machine (SVC):
123456from sklearn.svm import SVCmodel_SVC = SVC()model_SVC.fit(X_train, y_train)y_pred_svc = model_SVC.predict(X_test)svc_accuracy = accuracy_score(y_pred_svc, y_test) # Output: 0.8765
- Decision Tree Classifier:
123456from sklearn.tree import DecisionTreeClassifiermodel_DTC = DecisionTreeClassifier()model_DTC.fit(X_train, y_train)y_pred_dtc = model_DTC.predict(X_test)dtc_accuracy = accuracy_score(y_pred_dtc, y_test) # Output: 0.8175
- Random Forest Classifier:
123456from sklearn.ensemble import RandomForestClassifiermodel_RFC = RandomForestClassifier(n_estimators=500, max_depth=5)model_RFC.fit(X_train, y_train)y_pred_rfc = model_RFC.predict(X_test)rfc_accuracy = accuracy_score(y_pred_rfc, y_test) # Output: 0.8725
- AdaBoost Classifier:
123456from sklearn.ensemble import AdaBoostClassifiermodel_ABC = AdaBoostClassifier()model_ABC.fit(X_train, y_train)y_pred_abc = model_ABC.predict(X_test)abc_accuracy = accuracy_score(y_pred_abc, y_test) # Output: 0.8725
- XGBoost Classifier:
123456import xgboost as xgbmodel_xgb = xgb.XGBClassifier(use_label_encoder=False, eval_metric='logloss')model_xgb.fit(X_train, y_train)y_pred_xgb = model_xgb.predict(X_test)xgb_accuracy = accuracy_score(y_pred_xgb, y_test) # Output: 0.8715
8. Generating the CAP Curve
The CAP Curve is plotted to visualize model performance against a random model.
Plotting the Random Model:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
import matplotlib.pyplot as plt # Total number of test samples total = len(y_test) # Total number of positive instances sum_count = np.sum(y_test) plt.figure(figsize=(10, 6)) # Plotting the random model plt.plot([0, total], [0, sum_count], color='blue', linestyle='--', label='Random Model') plt.legend() plt.xlabel('Total Observations', fontsize=16) plt.ylabel('CAP Values', fontsize=16) plt.title('Cumulative Accuracy Profile', fontsize=16) plt.show() |
Plotting the Logistic Regression Model:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 |
# Predicting using Logistic Regression pred_lr = LRM.predict(X_test) x_values = np.arange(0, total + 1) # Sorting predictions and actual values sorted_zip = sorted(zip(pred_lr, y_test), reverse=True) # Generating CAP values cap = [] for p, o in sorted_zip: if p == o: cap.append(p) else: cap.append(o) y_values = np.append([0], np.cumsum(cap)) # Plotting the CAP Curve plt.figure(figsize=(10, 6)) plt.plot(x_values, y_values, color='blue', linewidth=3, label='Logistic Regression') plt.plot([0, total], [0, sum_count], linestyle='--', label='Random Model') plt.xlabel('Total Observations', fontsize=16) plt.ylabel('CAP Values', fontsize=16) plt.title('Cumulative Accuracy Profile', fontsize=16) plt.legend(loc='lower right', fontsize=16) plt.show() |

9. Comparing Multiple Models Using CAP Curves
By plotting CAP Curves for multiple models, one can visually assess and compare their performance.
Defining a CAP Generation Function:
1 2 3 4 5 6 7 8 9 10 11 12 |
def CAP_gen(model, X_test=X_test, y_test=y_test): pred = model.predict(X_test) sorted_zip = sorted(zip(pred, y_test), reverse=True) cap = [] for p, o in sorted_zip: if p == o: cap.append(p) else: cap.append(o) y_values = np.append([0], np.cumsum(cap)) x_values = np.arange(0, len(y_test) + 1) return (x_values, y_values) |
Plotting Multiple CAP Curves:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
plt.figure(figsize=(10, 6)) # Plot CAP for Gaussian Naive Bayes x_gnb, y_gnb = CAP_gen(model_GNB) plt.plot(x_gnb, y_gnb, linewidth=3, label='GaussianNB') # Plot CAP for XGBoost x_xgb, y_xgb = CAP_gen(model_xgb) plt.plot(x_xgb, y_xgb, linewidth=3, label='XGBoost') # Plot CAP for AdaBoost x_abc, y_abc = CAP_gen(model_ABC) plt.plot(x_abc, y_abc, linewidth=3, label='AdaBoost') # Plotting the random model plt.plot([0, total], [0, sum_count], linestyle='--', label='Random Model') plt.xlabel('Total Observations', fontsize=16) plt.ylabel('CAP Values', fontsize=16) plt.title('Cumulative Accuracy Profile', fontsize=16) plt.legend(loc='lower right', fontsize=16) plt.show() |

From the CAP Curves, models like XGBoost and SVM (SVC) showcase superior performance with larger areas under their respective curves, indicating higher efficacy in prioritizing true positive predictions compared to the random model.
10. Conclusion
The Cumulative Accuracy Profile (CAP) Curve is a potent tool for evaluating and comparing classification models. Its ability to provide a clear visualization of model performance relative to a random baseline makes it invaluable in decision-making processes, especially in business-critical applications like fraud detection and customer segmentation.
By following the steps outlined in this guide—from data preprocessing and handling missing values to encoding categorical variables and building robust models—you can effectively implement CAP Curves in Python to gain deeper insights into your models’ performance.
Embracing CAP Curves not only enhances your model evaluation strategy but also elevates the interpretability of complex machine learning models, bridging the gap between data science and actionable business intelligence.
11. References
- Scikit-learn Documentation on Imputing Missing Values
- Scikit-learn Documentation on Feature Selection
- Understanding Cumulative Accuracy Profile (CAP) Curves
- XGBoost Documentation
Disclaimer: The images referenced in this article (https://example.com/...
) are placeholders. Replace them with actual image URLs relevant to CAP Curves.