S29L04 – ROC，AUC – 计算最佳阈值（最佳准确率方法）

html
优化二元分类模型：ROC、AUC与阈值分析全面指南
通过掌握ROC曲线、AUC指标和最佳阈值选择，释放您的机器学习模型的全部潜力。本指南深入探讨了预处理、逻辑回归建模以及使用真实天气数据集进行性能优化。

介绍
在机器学习领域，特别是二元分类任务中，评估和优化模型性能至关重要。诸如受试者工作特征（ROC）曲线和曲线下的面积（AUC）等指标为模型区分类别的能力提供了宝贵的见解。此外，调整分类阈值可以显著提高模型的准确性、F1得分和整体性能。本文详尽探讨了这些概念，并通过一个Jupyter Notebook示例，利用真实的天气数据集展示了实际应用。

理解ROC曲线和AUC
什么是ROC曲线？
ROC曲线是一种图形表示，展示了二元分类系统在不同判别阈值下的诊断能力。它将真阳性率（TPR）与假阳性率（FPR）在各种阈值设置下绘制出来。

  真阳性率（TPR）：也称为召回率或灵敏度，它衡量模型正确识别的实际阳性比例。
  
    \[
    \text{TPR} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Negatives}}
    \]
  
  假阳性率（FPR）：它衡量模型错误地将实际阴性识别为阳性的比例。
  
    \[
    \text{FPR} = \frac{\text{False Positives}}{\text{False Positives} + \text{True Negatives}}
    \]
  

什么是AUC？
曲线下的面积（AUC）量化了模型区分正类和负类的整体能力。较高的AUC表示模型性能更好。AUC为0.5表明没有区分能力，相当于随机猜测，而AUC为1.0则表示完美区分。

数据集概述：澳大利亚天气
在本指南中，我们将使用澳大利亚天气数据集，该数据集包含各种气象属性。该数据集已预处理为包含10,000条记录，确保在说明概念时具备可管理性和有效性。
数据来源： Kaggle上的澳大利亚天气数据集

数据预处理
有效的预处理对于构建稳健的机器学习模型至关重要。以下步骤概述了应用于澳大利亚天气数据集的预处理流程。
1. 导入库和数据




		
		
			
			
Java
			
			import pandas as pd 
import seaborn as sns
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, LabelEncoder, MinMaxScaler, StandardScaler
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, roc_curve, auc, classification_report
			
				
					
				
					1
2
3
4
5
6
7
8
9
10
				
						import pandas as pd 
import seaborn as sns
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, LabelEncoder, MinMaxScaler, StandardScaler
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, roc_curve, auc, classification_report
					
				
			
		






		
		
			
			
Java
			
			data = pd.read_csv('weatherAUS - tiny.csv')
data.tail()
			
				
					
				
					1
2
				
						data = pd.read_csv('weatherAUS - tiny.csv')
data.tail()
					
				
			
		


样本输出：

  
    Date
    Location
    MinTemp
    MaxTemp
    Rainfall
    Evaporation
    Sunshine
    ...
    RainToday
    RISK_MM
    RainTomorrow
  
  
    05/01/2012
    CoffsHarbour
    21.3
    26.5
    0.6
    7.6
    6.4
    ...
    No
    0.0
    No
  

2. 特征选择
将数据集分为特征（X）和目标（y）。




		
		
			
			
Java
			
			X = data.iloc[:,:-1]
X.drop('RISK_MM', axis=1, inplace=True)
y = data.iloc[:,-1]
			
				
					
				
					1
2
3
				
						X = data.iloc[:,:-1]
X.drop('RISK_MM', axis=1, inplace=True)
y = data.iloc[:,-1]
					
				
			
		


3. 处理缺失数据
a. 数值特征
使用均值策略对数值列中的缺失值进行填补。




		
		
			
			
Java
			
			numerical_cols = list(np.where((X.dtypes == np.int64) | (X.dtypes == np.float64))[0])
imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean')
imp_mean.fit(X.iloc[:, numerical_cols])
X.iloc[:, numerical_cols] = imp_mean.transform(X.iloc[:, numerical_cols])
			
				
					
				
					1
2
3
4
				
						numerical_cols = list(np.where((X.dtypes == np.int64) | (X.dtypes == np.float64))[0])
imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean')
imp_mean.fit(X.iloc[:, numerical_cols])
X.iloc[:, numerical_cols] = imp_mean.transform(X.iloc[:, numerical_cols])
					
				
			
		


b. 分类特征
使用最频繁策略对分类列中的缺失值进行填补。




		
		
			
			
Java
			
			string_cols = list(np.where((X.dtypes == object))[0])
imp_freq = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
imp_freq.fit(X.iloc[:, string_cols])
X.iloc[:, string_cols] = imp_freq.transform(X.iloc[:, string_cols])
			
				
					
				
					1
2
3
4
				
						string_cols = list(np.where((X.dtypes == object))[0])
imp_freq = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
imp_freq.fit(X.iloc[:, string_cols])
X.iloc[:, string_cols] = imp_freq.transform(X.iloc[:, string_cols])
					
				
			
		


4. 编码分类变量
a. 标签编码
将目标变量的分类标签转换为数值值。




		
		
			
			
Java
			
			def LabelEncoderMethod(series):
    le = LabelEncoder()
    return le.fit_transform(series) 

y = LabelEncoderMethod(y)
			
				
					
				
					1
2
3
4
5
				
						def LabelEncoderMethod(series):
    le = LabelEncoder()
    return le.fit_transform(series) 
 
y = LabelEncoderMethod(y)
					
				
			
		


b. 独热编码
对具有两个以上唯一值的分类特征应用独热编码。




		
		
			
			
Java
			
			def OneHotEncoderMethod(indices, data):
    columnTransformer = ColumnTransformer([('encoder', OneHotEncoder(), indices )], remainder='passthrough')
    return columnTransformer.fit_transform(data)

def EncodingSelection(X, threshold=10):
    string_cols = list(np.where((X.dtypes == object))[0])
    one_hot_encoding_indices = []
    
    for col in string_cols:
        unique_values = len(pd.unique(X[X.columns[col]]))
        if unique_values == 2 or unique_values > threshold:
            X[X.columns[col]] = LabelEncoderMethod(X[X.columns[col]])
        else:
            one_hot_encoding_indices.append(col)
            
    X = OneHotEncoderMethod(one_hot_encoding_indices, X)
    return X

X = EncodingSelection(X)
			
				
					
				
					1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
				
						def OneHotEncoderMethod(indices, data):
    columnTransformer = ColumnTransformer([('encoder', OneHotEncoder(), indices )], remainder='passthrough')
    return columnTransformer.fit_transform(data)
 
def EncodingSelection(X, threshold=10):
    string_cols = list(np.where((X.dtypes == object))[0])
    one_hot_encoding_indices = []
    
    for col in string_cols:
        unique_values = len(pd.unique(X[X.columns[col]]))
        if unique_values == 2 or unique_values > threshold:
            X[X.columns[col]] = LabelEncoderMethod(X[X.columns[col]])
        else:
            one_hot_encoding_indices.append(col)
            
    X = OneHotEncoderMethod(one_hot_encoding_indices, X)
    return X
 
X = EncodingSelection(X)
					
				
			
		


5. 特征缩放与选择
a. 特征缩放
标准化特征集，以确保变量之间的一致性。




		
		
			
			
Java
			
			sc = StandardScaler(with_mean=False)
X = sc.fit_transform(X)
			
				
					
				
					1
2
				
						sc = StandardScaler(with_mean=False)
X = sc.fit_transform(X)
					
				
			
		


b. 特征选择
基于卡方（chi2）统计检验选择前10个特征。




		
		
			
			
Java
			
			kbest = SelectKBest(score_func=chi2, k=10)
X = kbest.fit_transform(X, y)
			
				
					
				
					1
2
				
						kbest = SelectKBest(score_func=chi2, k=10)
X = kbest.fit_transform(X, y)
					
				
			
		


6. 训练集与测试集划分
将数据集划分为训练集和测试集，以评估模型性能。




		
		
			
			
Java
			
			X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=1)
			
				
					
				
					1
				
						X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=1)
					
				
			
		



构建和评估逻辑回归模型
在数据预处理完成后，我们继续构建一个逻辑回归模型，评估其性能，并使用ROC和AUC指标进行优化。
1. 训练模型




		
		
			
			
Java
			
			LRM = LogisticRegression(random_state=0, max_iter=500)
LRM.fit(X_train, y_train)
y_pred = LRM.predict(X_test)
print(f"Accuracy: {accuracy_score(y_pred, y_test):.3f}")
			
				
					
				
					1
2
3
4
				
						LRM = LogisticRegression(random_state=0, max_iter=500)
LRM.fit(X_train, y_train)
y_pred = LRM.predict(X_test)
print(f"Accuracy: {accuracy_score(y_pred, y_test):.3f}")
					
				
			
		


输出：




		
		
			
			
Java
			
			Accuracy: 0.872
			
				
					
				
					1
				
						Accuracy: 0.872
					
				
			
		


2. ROC曲线和AUC计算
绘制ROC曲线并计算AUC，可全面了解模型的性能。




		
		
			
			
Java
			
			predicted_probabilities = LRM.predict_proba(X_test)
fpr, tpr, thresholds = roc_curve(y_test, predicted_probabilities[:,1])
roc_auc = auc(fpr, tpr)
print(f"AUC: {roc_auc:.3f}")
			
				
					
				
					1
2
3
4
				
						predicted_probabilities = LRM.predict_proba(X_test)
fpr, tpr, thresholds = roc_curve(y_test, predicted_probabilities[:,1])
roc_auc = auc(fpr, tpr)
print(f"AUC: {roc_auc:.3f}")
					
				
			
		


输出：




		
		
			
			
Java
			
			AUC: 0.884
			
				
					
				
					1
				
						AUC: 0.884
					
				
			
		


3. 优化分类阈值
默认的0.5阈值可能并不总能带来最佳性能。调整该阈值可以提高准确性和其他指标。
a. 计算不同阈值下的准确性




		
		
			
			
Java
			
			accuracies = []
for thresh in thresholds:
    _predictions = [1 if i >= thresh else 0 for i in predicted_probabilities[:, -1]]
    accuracies.append(accuracy_score(y_test, _predictions, normalize=True))

accuracies = pd.concat([pd.Series(thresholds), pd.Series(accuracies)], axis=1)
accuracies.columns = ['threshold', 'accuracy']
accuracies.sort_values(by='accuracy', ascending=False, inplace=True)
print(accuracies.head())
			
				
					
				
					1
2
3
4
5
6
7
8
9
				
						accuracies = []
for thresh in thresholds:
    _predictions = [1 if i >= thresh else 0 for i in predicted_probabilities[:, -1]]
    accuracies.append(accuracy_score(y_test, _predictions, normalize=True))
 
accuracies = pd.concat([pd.Series(thresholds), pd.Series(accuracies)], axis=1)
accuracies.columns = ['threshold', 'accuracy']
accuracies.sort_values(by='accuracy', ascending=False, inplace=True)
print(accuracies.head())
					
				
			
		


样本输出：




		
		
			
			
Java
			
			   threshold  accuracy
78    0.547545    0.8760
76    0.560424    0.8755
114   0.428764    0.8755
112   0.432886    0.8755
110   0.433176    0.8755
			
				
					
				
					1
2
3
4
5
6
				
						   threshold  accuracy
78    0.547545    0.8760
76    0.560424    0.8755
114   0.428764    0.8755
112   0.432886    0.8755
110   0.433176    0.8755
					
				
			
		


b. 选择最佳阈值




		
		
			
			
Java
			
			optimal_proba_cutoff = accuracies['threshold'].iloc[0]
roc_predictions = [1 if i >= optimal_proba_cutoff else 0 for i in predicted_probabilities[:, -1]]
			
				
					
				
					1
2
				
						optimal_proba_cutoff = accuracies['threshold'].iloc[0]
roc_predictions = [1 if i >= optimal_proba_cutoff else 0 for i in predicted_probabilities[:, -1]]
					
				
			
		


c. 使用最佳阈值进行评估




		
		
			
			
Java
			
			print("Classification Report with Optimal Threshold:")
print(classification_report(roc_predictions, y_test))
			
				
					
				
					1
2
				
						print("Classification Report with Optimal Threshold:")
print(classification_report(roc_predictions, y_test))
					
				
			
		


输出：




		
		
			
			
Java
			
			              precision    recall  f1-score   support

           0       0.97      0.89      0.93      1770
           1       0.48      0.77      0.59       230

    accuracy                           0.88      2000
   macro avg       0.72      0.83      0.76      2000
weighted avg       0.91      0.88      0.89      2000
			
				
					
				
					1
2
3
4
5
6
7
8
				
						              precision    recall  f1-score   support
 
           0       0.97      0.89      0.93      1770
           1       0.48      0.77      0.59       230
 
    accuracy                           0.88      2000
   macro avg       0.72      0.83      0.76      2000
weighted avg       0.91      0.88      0.89      2000
					
				
			
		


与默认阈值的比较：




		
		
			
			
Java
			
			print("Classification Report with Default Threshold (0.5):")
print(classification_report(y_pred, y_test))
			
				
					
				
					1
2
				
						print("Classification Report with Default Threshold (0.5):")
print(classification_report(y_pred, y_test))
					
				
			
		


输出：




		
		
			
			
Java
			
			              precision    recall  f1-score   support

           0       0.96      0.89      0.92      1740
           1       0.51      0.73      0.60       260

    accuracy                           0.87      2000
   macro avg       0.73      0.81      0.76      2000
weighted avg       0.90      0.87      0.88      2000
			
				
					
				
					1
2
3
4
5
6
7
8
				
						              precision    recall  f1-score   support
 
           0       0.96      0.89      0.92      1740
           1       0.51      0.73      0.60       260
 
    accuracy                           0.87      2000
   macro avg       0.73      0.81      0.76      2000
weighted avg       0.90      0.87      0.88      2000
					
				
			
		


见解：

  准确性提升：最佳阈值将准确率从87.2%略微提高到88%。
  F1得分增强：F1得分从0.60提高到0.59（因精确率和召回率之间的平衡，提升较为有限）。
  精确率与召回率平衡：最佳阈值保持了精确率和召回率的平衡，确保两者都不被过度偏重。


阈值优化的最佳实践

  理解权衡：调整阈值会影响灵敏度和特异性。必须根据应用的具体目标来选择阈值。
  使用相关指标：根据问题的不同，优先考虑F1得分、精确率或召回率等指标，而不仅仅是准确性。
  自动化阈值选择：虽然手动检查是有益的，但利用自动化方法或交叉验证可以增强稳健性。


结论
优化二元分类模型不仅仅是实现高准确性。通过利用ROC曲线、AUC指标和战略性的阈值调整，实践者可以微调模型以满足特定的性能标准。这种全面的方法确保了模型不仅准确，而且在各种场景下都具有可靠性和有效性。
关键要点：

  ROC和AUC提供了模型在不同阈值下的整体性能视图。
  阈值优化可以提升模型指标，使其性能更符合特定应用需求。
  全面的预处理是构建稳健有效的机器学习模型的基础。

通过这些策略，开始完善您的模型，以实现卓越的性能和可操作的见解。

附加资源

  Scikit-learn 文档
  Kaggle 数据集
  理解ROC曲线


作者：[您的名字]

技术作家与数据科学爱好者