掌握分类模型：综合指南，包含评估技术和数据集处理

介绍

在机器学习领域，分类模型在预测分类结果方面起着关键作用。无论是区分垃圾邮件和非垃圾邮件、诊断疾病，还是确定客户满意度，分类算法都为明智的决策提供了基础。在本文中，我们将深入探讨如何使用Python强大的生态系统构建健壮的分类模型，重点关注数据预处理、模型训练、评估以及处理多样化的数据集。我们将通过一个全面的Jupyter Notebook，引导您完成分类任务的主模板，配备评估指标并适应不同的数据集。

理解数据集

在深入模型构建之前，了解手头的数据集至关重要。对于本指南，我们将使用来自Kaggle的航空乘客满意度数据集。该数据集涵盖了影响乘客满意度的各种因素，使其成为分类任务的理想选择。

加载数据

我们将首先导入必要的库并将数据集加载到pandas DataFrame中。

import pandas as pd
import seaborn as sns

# Load datasets
data1 = pd.read_csv('Airline1.csv')
data2 = pd.read_csv('Airline2.csv')

# Concatenate datasets
data = pd.concat([data1, data2])
print(data.shape)

import pandas as pd

import seaborn as sns

# Load datasets

data1 = pd.read_csv('Airline1.csv')

data2 = pd.read_csv('Airline2.csv')

# Concatenate datasets

data = pd.concat([data1, data2])

print(data.shape)

输出：

(129880, 25)

1	(129880, 25)

这表明我们有129,880条记录，每条记录包含25个特征。

数据预处理

数据预处理是有效模型性能的基石。它包括清理数据、处理缺失值、编码分类变量、选择相关特征以及缩放数据以确保一致性。

处理缺失数据

数值数据：

对于数值列，我们将采用均值填充缺失值。

import numpy as np
from sklearn.impute import SimpleImputer

# Identify numerical columns
numerical_cols = list(np.where((X.dtypes == np.int64) | (X.dtypes == np.float64))[0])

# Initialize imputer
imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean')

# Fit and transform
imp_mean.fit(X.iloc[:, numerical_cols])
X.iloc[:, numerical_cols] = imp_mean.transform(X.iloc[:, numerical_cols])

import numpy as np

from sklearn.impute import SimpleImputer

# Identify numerical columns

numerical_cols = list(np.where((X.dtypes == np.int64) | (X.dtypes == np.float64))[0])

# Initialize imputer

imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean')

# Fit and transform

imp_mean.fit(X.iloc[:, numerical_cols])

X.iloc[:, numerical_cols] = imp_mean.transform(X.iloc[:, numerical_cols])

分类数据：

对于分类列，我们将使用最频繁策略来填补缺失值。

# Identify string/object columns
string_cols = list(np.where((X.dtypes == np.object))[0])

# Initialize imputer
imp_freq = SimpleImputer(missing_values=np.nan, strategy='most_frequent')

# Fit and transform
imp_freq.fit(X.iloc[:, string_cols])
X.iloc[:, string_cols] = imp_freq.transform(X.iloc[:, string_cols])

# Identify string/object columns

string_cols = list(np.where((X.dtypes == np.object))[0])

# Initialize imputer

imp_freq = SimpleImputer(missing_values=np.nan, strategy='most_frequent')

# Fit and transform

imp_freq.fit(X.iloc[:, string_cols])

X.iloc[:, string_cols] = imp_freq.transform(X.iloc[:, string_cols])

编码分类变量

机器学习模型需要数值输入。因此，必须适当编码分类变量。

标签编码：

对于二元分类变量或具有大量类别的变量，标签编码是高效的。

from sklearn import preprocessing

def LabelEncoderMethod(series):
    le = preprocessing.LabelEncoder()
    le.fit(series)
    return le.transform(series)

# Encode target variable
y = LabelEncoderMethod(y)

from sklearn import preprocessing

def LabelEncoderMethod(series):

le = preprocessing.LabelEncoder()

le.fit(series)

return le.transform(series)

# Encode target variable

y = LabelEncoderMethod(y)

独热编码：

对于类别数量有限的分类变量，独热编码可以防止模型误解不存在的数值关系。

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

def OneHotEncoderMethod(indices, data):
    columnTransformer = ColumnTransformer([('encoder', OneHotEncoder(), indices )], remainder='passthrough')
    return columnTransformer.fit_transform(data)

from sklearn.compose import ColumnTransformer

from sklearn.preprocessing import OneHotEncoder

def OneHotEncoderMethod(indices, data):

columnTransformer = ColumnTransformer([('encoder', OneHotEncoder(), indices )], remainder='passthrough')

return columnTransformer.fit_transform(data)

编码选择：

为了基于类别数量优化编码策略，我们实现了选择机制。

def EncodingSelection(X, threshold=10):
    string_cols = list(np.where((X.dtypes == np.object))[0])
    one_hot_encoding_indices = []
    
    for col in string_cols:
        length = len(pd.unique(X[X.columns[col]]))
        if length == 2 or length > threshold:
            X[X.columns[col]] = LabelEncoderMethod(X[X.columns[col]])
        else:
            one_hot_encoding_indices.append(col)
            
    X = OneHotEncoderMethod(one_hot_encoding_indices, X)
    return X

X = EncodingSelection(X)
print(X.shape)

def EncodingSelection(X, threshold=10):

string_cols = list(np.where((X.dtypes == np.object))[0])

one_hot_encoding_indices = []

for col in string_cols:

length = len(pd.unique(X[X.columns[col]]))

if length == 2 or length > threshold:

X[X.columns[col]] = LabelEncoderMethod(X[X.columns[col]])

else:

one_hot_encoding_indices.append(col)

X = OneHotEncoderMethod(one_hot_encoding_indices, X)

return X

X = EncodingSelection(X)

print(X.shape)

输出：

(129880, 26)

1	(129880, 26)

特征选择

选择最相关的特征可以提高模型性能并减少计算复杂度。我们将使用卡方检验进行特征选择。

from sklearn.feature_selection import SelectKBest, chi2
from sklearn import preprocessing

# Initialize
kbest = SelectKBest(score_func=chi2, k='all')
MMS = preprocessing.MinMaxScaler()
K_features = 10

# Apply transformations
x_temp = MMS.fit_transform(X)
x_temp = kbest.fit(x_temp, y)

# Select top K features
best_features = np.argsort(x_temp.scores_)[-K_features:]
features_to_delete = np.argsort(x_temp.scores_)[:-K_features]
X = np.delete(X, features_to_delete, axis=1)
print(X.shape)

from sklearn.feature_selection import SelectKBest, chi2

from sklearn import preprocessing

# Initialize

kbest = SelectKBest(score_func=chi2, k='all')

MMS = preprocessing.MinMaxScaler()

K_features = 10

# Apply transformations

x_temp = MMS.fit_transform(X)

x_temp = kbest.fit(x_temp, y)

# Select top K features

best_features = np.argsort(x_temp.scores_)[-K_features:]

features_to_delete = np.argsort(x_temp.scores_)[:-K_features]

X = np.delete(X, features_to_delete, axis=1)

print(X.shape)

输出：

(129880, 10)

1	(129880, 10)

特征缩放

缩放确保所有特征对模型性能的贡献相等。

from sklearn import preprocessing

# Initialize scaler
sc = preprocessing.StandardScaler(with_mean=False)
sc.fit(X_train)

# Transform features
X_train = sc.transform(X_train)
X_test = sc.transform(X_test)

print(X_train.shape)
print(X_test.shape)

from sklearn import preprocessing

# Initialize scaler

sc = preprocessing.StandardScaler(with_mean=False)

sc.fit(X_train)

# Transform features

X_train = sc.transform(X_train)

X_test = sc.transform(X_test)

print(X_train.shape)

print(X_test.shape)

输出：

(103904, 10)
(25976, 10)

1 2	(103904, 10) (25976, 10)

构建和评估分类模型

通过预处理后的数据，我们现在可以构建和评估各种分类模型。我们将探索多种算法以比较它们的性能。

K-最近邻（KNN）分类器

KNN是一种简单而有效的算法，它根据最近邻的多数标签对数据点进行分类。

from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report

# Initialize and train
knnClassifier = KNeighborsClassifier(n_neighbors=10)
knnClassifier.fit(X_train, y_train)

# Predict and evaluate
y_pred = knnClassifier.predict(X_test)
print(accuracy_score(y_pred, y_test))
print(classification_report(y_pred, y_test, target_names=['No', 'Yes']))

from sklearn.neighbors import KNeighborsClassifier

from sklearn.metrics import accuracy_score, classification_report

# Initialize and train

knnClassifier = KNeighborsClassifier(n_neighbors=10)

knnClassifier.fit(X_train, y_train)

# Predict and evaluate

y_pred = knnClassifier.predict(X_test)

print(accuracy_score(y_pred, y_test))

print(classification_report(y_pred, y_test, target_names=['No', 'Yes']))

输出：

0.932668617185094
              precision    recall  f1-score   support

           No       0.96      0.92      0.94     15395
          Yes       0.90      0.94      0.92     10581

    accuracy                           0.93     25976
   macro avg       0.93      0.93      0.93     25976
weighted avg       0.93      0.93      0.93     25976

0.932668617185094

precision recall f1-score support

No 0.96 0.92 0.94 15395

Yes 0.90 0.94 0.92 10581

accuracy 0.93 25976

macro avg 0.93 0.93 0.93 25976

weighted avg 0.93 0.93 0.93 25976

解释：

KNN分类器实现了93.27%的高准确率，表明其在预测乘客满意度方面表现出色。

逻辑回归

逻辑回归模拟二元结果的概率，使其成为分类任务的理想选择。

from sklearn.linear_model import LogisticRegression

# Initialize and train
LRM = LogisticRegression()
LRM.fit(X_train, y_train)

# Predict and evaluate
y_pred = LRM.predict(X_test)
print(accuracy_score(y_pred, y_test))
print(classification_report(y_pred, y_test))

from sklearn.linear_model import LogisticRegression

# Initialize and train

LRM = LogisticRegression()

LRM.fit(X_train, y_train)

# Predict and evaluate

y_pred = LRM.predict(X_test)

print(accuracy_score(y_pred, y_test))

print(classification_report(y_pred, y_test))

输出：

0.8557129658145981
              precision    recall  f1-score   support

           No       0.88      0.87      0.87     15068
          Yes       0.82      0.84      0.83     10908

    accuracy                           0.86     25976
   macro avg       0.85      0.85      0.85     25976
weighted avg       0.86      0.86      0.86     25976

0.8557129658145981

precision recall f1-score support

No 0.88 0.87 0.87 15068

Yes 0.82 0.84 0.83 10908

accuracy 0.86 25976

macro avg 0.85 0.85 0.85 25976

weighted avg 0.86 0.86 0.86 25976

解释：

逻辑回归的准确率为85.57%，略低于KNN，但对于基准比较仍然相当。

高斯朴素贝叶斯（GaussianNB）

GaussianNB是一种基于贝叶斯定理的概率分类器，假设特征之间相互独立。

from sklearn.naive_bayes import GaussianNB

# Initialize and train
model_GNB = GaussianNB()
model_GNB.fit(X_train, y_train)

# Predict and evaluate
y_pred = model_GNB.predict(X_test)
print(accuracy_score(y_pred, y_test))
print(classification_report(y_pred, y_test))

from sklearn.naive_bayes import GaussianNB

# Initialize and train

model_GNB = GaussianNB()

model_GNB.fit(X_train, y_train)

# Predict and evaluate

y_pred = model_GNB.predict(X_test)

print(accuracy_score(y_pred, y_test))

print(classification_report(y_pred, y_test))

输出：

0.828688019710502
              precision    recall  f1-score   support

           No       0.84      0.85      0.85     14662
          Yes       0.81      0.80      0.80     11314

    accuracy                           0.83     25976
   macro avg       0.83      0.82      0.83     25976
weighted avg       0.83      0.83      0.83     25976

0.828688019710502

precision recall f1-score support

No 0.84 0.85 0.85 14662

Yes 0.81 0.80 0.80 11314

accuracy 0.83 25976

macro avg 0.83 0.82 0.83 25976

weighted avg 0.83 0.83 0.83 25976

解释：

GaussianNB实现了82.87%的准确率，尽管其基本假设简单，但仍展现出有效性。

支持向量机（SVM）

SVM通过创建超平面来分隔类，优化它们之间的边距。

from sklearn.svm import SVC

# Initialize and train
model_SVC = SVC()
model_SVC.fit(X_train, y_train)

# Predict and evaluate
y_pred = model_SVC.predict(X_test)
print(accuracy_score(y_pred, y_test))
print(classification_report(y_pred, y_test))

from sklearn.svm import SVC

# Initialize and train

model_SVC = SVC()

model_SVC.fit(X_train, y_train)

# Predict and evaluate

y_pred = model_SVC.predict(X_test)

print(accuracy_score(y_pred, y_test))

print(classification_report(y_pred, y_test))

输出：

0.9325916230366492
              precision    recall  f1-score   support

           No       0.95      0.93      0.94     15033
          Yes       0.91      0.93      0.92     10943

    accuracy                           0.93     25976
   macro avg       0.93      0.93      0.93     25976
weighted avg       0.93      0.93      0.93     25976

0.9325916230366492

precision recall f1-score support

No 0.95 0.93 0.94 15033

Yes 0.91 0.93 0.92 10943

accuracy 0.93 25976

macro avg 0.93 0.93 0.93 25976

weighted avg 0.93 0.93 0.93 25976

解释：

SVM的表现与KNN相当，准确率为93.26%，凸显其在分类任务中的稳健性。

决策树分类器

决策树根据特征值划分数据，形成决策的树状模型。

from sklearn.tree import DecisionTreeClassifier

# Initialize and train
model_DTC = DecisionTreeClassifier(max_leaf_nodes=25, min_samples_split=4, random_state=42)
model_DTC.fit(X_train, y_train)

# Predict and evaluate
y_pred = model_DTC.predict(X_test)
print(accuracy_score(y_pred, y_test))
print(classification_report(y_pred, y_test))

from sklearn.tree import DecisionTreeClassifier

# Initialize and train

model_DTC = DecisionTreeClassifier(max_leaf_nodes=25, min_samples_split=4, random_state=42)

model_DTC.fit(X_train, y_train)

# Predict and evaluate

y_pred = model_DTC.predict(X_test)

print(accuracy_score(y_pred, y_test))

print(classification_report(y_pred, y_test))

输出：

0.9256621496766245
              precision    recall  f1-score   support

           No       0.95      0.92      0.94     15213
          Yes       0.90      0.93      0.91     10763

    accuracy                           0.93     25976
   macro avg       0.92      0.93      0.92     25976
weighted avg       0.93      0.93      0.93     25976

0.9256621496766245

precision recall f1-score support

No 0.95 0.92 0.94 15213

Yes 0.90 0.93 0.91 10763

accuracy 0.93 25976

macro avg 0.92 0.93 0.92 25976

weighted avg 0.93 0.93 0.93 25976

解释：

决策树分类器记录了92.57%的准确率，展示了其捕捉数据中复杂模式的能力。

随机森林分类器

随机森林构建多个决策树，并汇总它们的预测以提高准确性和稳健性。

from sklearn.ensemble import RandomForestClassifier

# Initialize and train
model_RFC = RandomForestClassifier(n_estimators=500, max_depth=5)
model_RFC.fit(X_train, y_train)

# Predict and evaluate
y_pred = model_RFC.predict(X_test)
print(accuracy_score(y_pred, y_test))
print(classification_report(y_pred, y_test))

from sklearn.ensemble import RandomForestClassifier

# Initialize and train

model_RFC = RandomForestClassifier(n_estimators=500, max_depth=5)

model_RFC.fit(X_train, y_train)

# Predict and evaluate

y_pred = model_RFC.predict(X_test)

print(accuracy_score(y_pred, y_test))

print(classification_report(y_pred, y_test))

输出：

0.9181937172774869
              precision    recall  f1-score   support

           No       0.93      0.93      0.93     14837
          Yes       0.90      0.91      0.90     11139

    accuracy                           0.92     25976
   macro avg       0.92      0.92      0.92     25976
weighted avg       0.92      0.92      0.92     25976

0.9181937172774869

precision recall f1-score support

No 0.93 0.93 0.93 14837

Yes 0.90 0.91 0.90 11139

accuracy 0.92 25976

macro avg 0.92 0.92 0.92 25976

weighted avg 0.92 0.92 0.92 25976

解释：

随机森林实现了91.82%的准确率，通过集成学习有效地平衡了偏差和方差。

AdaBoost 分类器

AdaBoost结合多个弱分类器形成一个强分类器，专注于之前被错误分类的实例。

from sklearn.ensemble import AdaBoostClassifier

# Initialize and train
model_ABC = AdaBoostClassifier()
model_ABC.fit(X_train, y_train)

# Predict and evaluate
y_pred = model_ABC.predict(X_test)
print(accuracy_score(y_pred, y_test))
print(classification_report(y_pred, y_test))

from sklearn.ensemble import AdaBoostClassifier

# Initialize and train

model_ABC = AdaBoostClassifier()

model_ABC.fit(X_train, y_train)

# Predict and evaluate

y_pred = model_ABC.predict(X_test)

print(accuracy_score(y_pred, y_test))

print(classification_report(y_pred, y_test))

输出：

0.9101863258392362
              precision    recall  f1-score   support

           No       0.93      0.92      0.92     14977
          Yes       0.89      0.90      0.89     10999

    accuracy                           0.91     25976
   macro avg       0.91      0.91      0.91     25976
weighted avg       0.91      0.91      0.91     25976

0.9101863258392362

precision recall f1-score support

No 0.93 0.92 0.92 14977

Yes 0.89 0.90 0.89 10999

accuracy 0.91 25976

macro avg 0.91 0.91 0.91 25976

weighted avg 0.91 0.91 0.91 25976

解释：

AdaBoost达到91.02%的准确率，通过提升技术展示了其在提高模型性能方面的有效性。

XGBoost 分类器

XGBoost是一种高度优化的梯度提升框架，以其性能和速度著称。

import xgboost as xgb

# Initialize and train
model_xgb = xgb.XGBClassifier(use_label_encoder=False, eval_metric='logloss')
model_xgb.fit(X_train, y_train)

# Predict and evaluate
y_pred = model_xgb.predict(X_test)
print(accuracy_score(y_pred, y_test))
print(classification_report(y_pred, y_test))

import xgboost as xgb

# Initialize and train

model_xgb = xgb.XGBClassifier(use_label_encoder=False, eval_metric='logloss')

model_xgb.fit(X_train, y_train)

# Predict and evaluate

y_pred = model_xgb.predict(X_test)

print(accuracy_score(y_pred, y_test))

print(classification_report(y_pred, y_test))

输出：

0.9410994764397905
              precision    recall  f1-score   support

           No       0.96      0.94      0.95     15122
          Yes       0.92      0.94      0.93     10854

    accuracy                           0.94     25976
   macro avg       0.94      0.94      0.94     25976
weighted avg       0.94      0.94      0.94     25976

0.9410994764397905

precision recall f1-score support

No 0.96 0.94 0.95 15122

Yes 0.92 0.94 0.93 10854

accuracy 0.94 25976

macro avg 0.94 0.94 0.94 25976

weighted avg 0.94 0.94 0.94 25976

解释：

XGBoost以94.11%的优异准确率领跑，突显了其在处理复杂数据集和高预测能力方面的卓越性能。

结论

构建有效的分类模型依赖于细致的数据预处理、明智的特征选择以及选择适合任务的算法。通过我们全面的Jupyter Notebook主模板，我们探讨了各种分类算法，每种算法都有其独特的优势。从K-最近邻和逻辑回归到高级集成技术如随机森林和XGBoost，工具包丰富且适应多样化的数据集。

通过遵循本指南，数据科学家和爱好者可以简化他们的机器学习工作流程，确保模型性能稳健且评估具有洞察力。请记住，任何成功模型的基石在于在深入算法复杂性之前理解和准备数据。

关键要点：

数据质量很重要：有效处理缺失数据和正确编码分类变量对模型准确性至关重要。
特征选择提升性能：识别和选择最相关的特征可以显著提升模型性能并减少计算开销。
多样化算法提供独特优势：探索多种分类算法可以基于模型优势和数据集特征做出明智决策。
持续评估至关重要：定期使用准确率、精确率、召回率和F1评分等指标评估模型，确保与项目目标一致。

利用这些技术的力量，构建不仅表现出色而且能够为您的数据提供有意义见解的预测模型。

资源：

保持联系：

欲获取更多关于机器学习和数据科学的教程和见解，请订阅我们的通讯并关注我们的LinkedIn。

S27L02 – 分类模型主模板

掌握分类模型：综合指南，包含评估技术和数据集处理

介绍

目录

理解数据集

加载数据

数据预处理

处理缺失数据

编码分类变量

特征选择

特征缩放

构建和评估分类模型

K-最近邻（KNN）分类器

逻辑回归

高斯朴素贝叶斯（GaussianNB）

支持向量机（SVM）

决策树分类器

随机森林分类器

AdaBoost 分类器

XGBoost 分类器

结论