S17L03 – 不使用 GridSearchCV 的 K 折交叉验证

html
精通K折交叉验证（无需GridSearchCV）：全面指南
在机器学习领域，确保模型的稳健性和可靠性至关重要。实现这一目标的基本技术之一是K折交叉验证。尽管像Scikit-Learn这样的流行库提供了像GridSearchCV这样的工具，用于在交叉验证中进行超参数调优，但在某些情况下，您可能希望手动实现K折交叉验证。本指南深入探讨了如何在不依赖GridSearchCV的情况下，使用Python和Jupyter笔记本理解和实施K折交叉验证。

目录

    K折交叉验证简介
    理解数据集
    数据预处理
        
            处理缺失数据
            特征选择
            编码分类变量
            特征缩放
        
    
    构建机器学习模型
    无需GridSearchCV实现K折交叉验证
    最佳实践与技巧
    结论



K折交叉验证简介
K折交叉验证是一种重采样技术，用于在有限的数据样本上评估机器学习模型。该过程涉及将原始数据集划分为K个不重叠的子集（折叠）。模型在K-1个折上训练，并在剩余的一个折上验证。这个过程重复K次，每个折一次作为验证集。最终的性能指标通常是K个验证分数的平均值。

为什么使用K折交叉验证？

    稳健的评估：相比于单一的训练-测试拆分，提供了更可靠的模型性能估计。
    减少过拟合：通过在多个子集上训练，模型对未见数据的泛化能力更强。
    高效利用数据：在处理有限数据集时尤其有益。

虽然GridSearchCV将交叉验证与超参数调优集成在一起，但了解如何手动实现K折交叉验证可以提供更大的灵活性，并深入了解模型训练过程。


理解数据集
本指南使用来自Kaggle的汽车价格预测数据集。该数据集涵盖了汽车的各种特征，旨在预测其市场价格。

数据集概述

    特征：25个（不包括目标变量）
        
            数值型：发动机大小，马力，峰值RPM，城市MPG，高速公路MPG等。
            分类型：汽车品牌，燃料类型，吸气方式，门数，车身类型，驱动轮配置等。
        
    
    目标变量：price（连续值）


初步数据检查
在深入进行数据预处理之前，检查数据集是至关重要的：




		
		
			
			
Java
			
			import pandas as pd

# Load the dataset
data = pd.read_csv('CarPrice.csv')
print(data.head())
			
				
					
				
					1
2
3
4
5
				
						import pandas as pd
 
# Load the dataset
data = pd.read_csv('CarPrice.csv')
print(data.head())
					
				
			
		


示例输出：

    
        car_ID
        symboling
        CarName
        fueltype
        aspiration
        doornumber
        carbody
        
        highwaympg
        price
    
    
        1
        3
        alfa-romero giulia
        gas
        std
        two
        convertible
        
        27
        13495.0
    
    
        2
        3
        alfa-romero stelvio
        gas
        std
        two
        convertible
        
        27
        16500.0
    
    
        3
        1
        alfa-romero Quadrifoglio
        gas
        std
        two
        hatchback
        
        26
        16500.0
    
    
        4
        2
        audi 100 ls
        gas
        std
        four
        sedan
        
        30
        13950.0
    
    
        5
        2
        audi 100ls
        gas
        std
        four
        sedan
        
        22
        17450.0
    



数据预处理
有效的数据预处理对于构建准确且高效的机器学习模型至关重要。本节涵盖了处理缺失数据、特征选择、编码分类变量和特征缩放。

处理缺失数据

数值特征
可以使用均值、中位数或最频繁值等策略对数值特征中的缺失值进行填补：




		
		
			
			
Java
			
			import numpy as np
from sklearn.impute import SimpleImputer

# Initialize imputer with mean strategy
imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean')

# Identify numerical columns
numerical_cols = list(np.where((X.dtypes == np.int64) | (X.dtypes == np.float64))[0])

# Fit and transform the numerical data
imp_mean.fit(X.iloc[:, numerical_cols])
X.iloc[:, numerical_cols] = imp_mean.transform(X.iloc[:, numerical_cols])
			
				
					
				
					1
2
3
4
5
6
7
8
9
10
11
12
				
						import numpy as np
from sklearn.impute import SimpleImputer
 
# Initialize imputer with mean strategy
imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean')
 
# Identify numerical columns
numerical_cols = list(np.where((X.dtypes == np.int64) | (X.dtypes == np.float64))[0])
 
# Fit and transform the numerical data
imp_mean.fit(X.iloc[:, numerical_cols])
X.iloc[:, numerical_cols] = imp_mean.transform(X.iloc[:, numerical_cols])
					
				
			
		



分类特征
对于分类数据，可以使用最频繁的值来替换缺失项：




		
		
			
			
Java
			
			from sklearn.impute import SimpleImputer

# Identify string columns
string_cols = list(np.where((X.dtypes == object))[0])

# Initialize imputer with most frequent strategy
imp_freq = SimpleImputer(missing_values=np.nan, strategy='most_frequent')

# Fit and transform the categorical data
imp_freq.fit(X.iloc[:, string_cols])
X.iloc[:, string_cols] = imp_freq.transform(X.iloc[:, string_cols])
			
				
					
				
					1
2
3
4
5
6
7
8
9
10
11
				
						from sklearn.impute import SimpleImputer
 
# Identify string columns
string_cols = list(np.where((X.dtypes == object))[0])
 
# Initialize imputer with most frequent strategy
imp_freq = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
 
# Fit and transform the categorical data
imp_freq.fit(X.iloc[:, string_cols])
X.iloc[:, string_cols] = imp_freq.transform(X.iloc[:, string_cols])
					
				
			
		



特征选择
移除不相关或冗余的特征可以提升模型性能：




		
		
			
			
Java
			
			# Drop the 'car_ID' column as it's not a predictive feature
X.drop('car_ID', axis=1, inplace=True)
			
				
					
				
					1
2
				
						# Drop the 'car_ID' column as it's not a predictive feature
X.drop('car_ID', axis=1, inplace=True)
					
				
			
		



编码分类变量
机器学习模型需要数值输入。因此，需要对分类变量进行编码。

独热编码
独热编码将分类变量转换为二进制矩阵：




		
		
			
			
Java
			
			from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

# Identify string columns for encoding
string_cols = list(np.where((X.dtypes == object))[0])

# Initialize ColumnTransformer with OneHotEncoder
columnTransformer = ColumnTransformer(
    [('encoder', OneHotEncoder(), string_cols)],
    remainder='passthrough'
)

# Apply transformation
X = columnTransformer.fit_transform(X)
			
				
					
				
					1
2
3
4
5
6
7
8
9
10
11
12
13
14
				
						from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
 
# Identify string columns for encoding
string_cols = list(np.where((X.dtypes == object))[0])
 
# Initialize ColumnTransformer with OneHotEncoder
columnTransformer = ColumnTransformer(
    [('encoder', OneHotEncoder(), string_cols)],
    remainder='passthrough'
)
 
# Apply transformation
X = columnTransformer.fit_transform(X)
					
				
			
		



特征缩放
缩放确保数值特征在模型训练过程中同等贡献。

标准化
标准化将特征缩放到均值为0，标准差为1：




		
		
			
			
Java
			
			from sklearn.preprocessing import StandardScaler

# Initialize StandardScaler
sc = StandardScaler(with_mean=False)

# Fit and transform the training data
sc.fit(X_train)
X_train = sc.transform(X_train)
X_test = sc.transform(X_test)
			
				
					
				
					1
2
3
4
5
6
7
8
9
				
						from sklearn.preprocessing import StandardScaler
 
# Initialize StandardScaler
sc = StandardScaler(with_mean=False)
 
# Fit and transform the training data
sc.fit(X_train)
X_train = sc.transform(X_train)
X_test = sc.transform(X_test)
					
				
			
		




构建机器学习模型
借助预处理后的数据，可以构建和评估各种回归模型。

决策树回归器




		
		
			
			
Java
			
			from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import r2_score

# Initialize the model
model = DecisionTreeRegressor(max_depth=4)

# Train and evaluate
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print(r2_score(y_test, y_pred))
			
				
					
				
					1
2
3
4
5
6
7
8
9
10
				
						from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import r2_score
 
# Initialize the model
model = DecisionTreeRegressor(max_depth=4)
 
# Train and evaluate
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print(r2_score(y_test, y_pred))
					
				
			
		


R²得分：0.884

随机森林回归器




		
		
			
			
Java
			
			from sklearn.ensemble import RandomForestRegressor

# Initialize the model
model = RandomForestRegressor(n_estimators=25, random_state=10)

# Train and evaluate
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print(r2_score(y_test, y_pred))
			
				
					
				
					1
2
3
4
5
6
7
8
9
				
						from sklearn.ensemble import RandomForestRegressor
 
# Initialize the model
model = RandomForestRegressor(n_estimators=25, random_state=10)
 
# Train and evaluate
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print(r2_score(y_test, y_pred))
					
				
			
		


R²得分：0.911

AdaBoost回归器




		
		
			
			
Java
			
			from sklearn.ensemble import AdaBoostRegressor

# Initialize the model
model = AdaBoostRegressor(random_state=0, n_estimators=100)

# Train and evaluate
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print(r2_score(y_test, y_pred))
			
				
					
				
					1
2
3
4
5
6
7
8
9
				
						from sklearn.ensemble import AdaBoostRegressor
 
# Initialize the model
model = AdaBoostRegressor(random_state=0, n_estimators=100)
 
# Train and evaluate
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print(r2_score(y_test, y_pred))
					
				
			
		


R²得分：0.881

XGBoost回归器




		
		
			
			
Java
			
			import xgboost as xgb

# Initialize the model
model = xgb.XGBRegressor(
    n_estimators=100,
    reg_lambda=1,
    gamma=0,
    max_depth=3,
    learning_rate=0.05
)

# Train and evaluate
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print(r2_score(y_test, y_pred))
			
				
					
				
					1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
				
						import xgboost as xgb
 
# Initialize the model
model = xgb.XGBRegressor(
    n_estimators=100,
    reg_lambda=1,
    gamma=0,
    max_depth=3,
    learning_rate=0.05
)
 
# Train and evaluate
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print(r2_score(y_test, y_pred))
					
				
			
		


R²得分：0.895

支持向量回归器（SVR）




		
		
			
			
Java
			
			from sklearn.svm import SVR

# Initialize the model
model = SVR()

# Train and evaluate
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print(r2_score(y_test, y_pred))
			
				
					
				
					1
2
3
4
5
6
7
8
9
				
						from sklearn.svm import SVR
 
# Initialize the model
model = SVR()
 
# Train and evaluate
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print(r2_score(y_test, y_pred))
					
				
			
		


R²得分：-0.027
注意：R²得分低于0表示模型表现不如一条水平线。


无需GridSearchCV实现K折交叉验证
手动实现K折交叉验证可以对训练和评估过程进行细粒度控制。以下是一个逐步指南：

步骤1：初始化K折




		
		
			
			
Java
			
			from sklearn.model_selection import KFold

# Initialize KFold with 5 splits, shuffling, and a fixed random state for reproducibility
kf = KFold(n_splits=5, random_state=42, shuffle=True)
			
				
					
				
					1
2
3
4
				
						from sklearn.model_selection import KFold
 
# Initialize KFold with 5 splits, shuffling, and a fixed random state for reproducibility
kf = KFold(n_splits=5, random_state=42, shuffle=True)
					
				
			
		



步骤2：定义模型构建函数
将模型训练和评估封装在一个函数中以便复用：




		
		
			
			
Java
			
			from sklearn.metrics import r2_score

def build_model(X_train, X_test, y_train, y_test, model):
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    return r2_score(y_test, y_pred)
			
				
					
				
					1
2
3
4
5
6
				
						from sklearn.metrics import r2_score
 
def build_model(X_train, X_test, y_train, y_test, model):
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    return r2_score(y_test, y_pred)
					
				
			
		



步骤3：执行K折交叉验证
遍历每个折，训练模型并收集R²得分：




		
		
			
			
Java
			
			scores = []
for train_index, test_index in kf.split(X):
    X_train_fold, X_test_fold = X[train_index], X[test_index]
    y_train_fold, y_test_fold = y.iloc[train_index], y.iloc[test_index]
    score = build_model(X_train_fold, X_test_fold, y_train_fold, y_test_fold, model)
    scores.append(score)

print(scores)
			
				
					
				
					1
2
3
4
5
6
7
8
				
						scores = []
for train_index, test_index in kf.split(X):
    X_train_fold, X_test_fold = X[train_index], X[test_index]
    y_train_fold, y_test_fold = y.iloc[train_index], y.iloc[test_index]
    score = build_model(X_train_fold, X_test_fold, y_train_fold, y_test_fold, model)
    scores.append(score)
 
print(scores)
					
				
			
		


示例输出：




		
		
			
			
Java
			
			[-0.10198885010286984, 
 -0.05769313782320418, 
 -0.1910165707884004, 
 -0.09880100338491071, 
 -0.260272529471554]
			
				
					
				
					1
2
3
4
5
				
						[-0.10198885010286984, 
 -0.05769313782320418, 
 -0.1910165707884004, 
 -0.09880100338491071, 
 -0.260272529471554]
					
				
			
		


评分解释：负的R²得分表示所有折叠中的模型表现都很差。这表明存在过拟合、数据泄漏或模型选择不当等问题。

步骤4：分析结果
对交叉验证得分进行全面分析可以提供有关模型稳定性和泛化能力的见解。




		
		
			
			
Java
			
			import numpy as np

# Calculate mean and standard deviation
mean_score = np.mean(scores)
std_score = np.std(scores)

print(f"Mean R² Score: {mean_score}")
print(f"Standard Deviation: {std_score}")
			
				
					
				
					1
2
3
4
5
6
7
8
				
						import numpy as np
 
# Calculate mean and standard deviation
mean_score = np.mean(scores)
std_score = np.std(scores)
 
print(f"Mean R² Score: {mean_score}")
print(f"Standard Deviation: {std_score}")
					
				
			
		


示例输出：




		
		
			
			
Java
			
			Mean R² Score: -0.133554
Standard Deviation: 0.077
			
				
					
				
					1
2
				
						Mean R² Score: -0.133554
Standard Deviation: 0.077
					
				
			
		


见解：

    负的平均R²得分表明模型表现不佳。
    高标准差表明不同折叠之间存在显著的变异性，指出模型的预测能力不一致。



最佳实践与技巧

    分类任务的分层K折：虽然本指南专注于回归，但在处理分类任务时，使用分层K折来保持各折之间类别分布的一致性至关重要。
    特征重要性分析：在模型训练后，分析特征的重要性可以帮助理解哪些特征对目标变量影响最大。
    超参数调优：即使没有GridSearchCV，您也可以在每个折叠内手动调整超参数，以找到模型的最佳设置。
    处理不平衡数据集：确保训练和测试拆分在分类任务中特别是类别平衡方面保持一致。
    模型选择：始终尝试多种模型，以确定哪种模型最适合数据集的特性。



结论
K折交叉验证是机器学习工具包中不可或缺的技术，提供了一种稳健的方法来评估模型性能。通过手动实现K折交叉验证，如本指南所示，您可以深入了解模型训练过程，并完全控制每个评估步骤。虽然像GridSearchCV这样的自动化工具非常方便，但理解其底层机制可以让您应对更复杂的场景，并根据具体需求定制验证过程。
充分利用K折交叉验证的强大功能，提升预测模型的可靠性和准确性，为更明智和数据驱动的决策铺平道路。


关键词：K折交叉验证, GridSearchCV, 机器学习, 模型评估, Python, Jupyter Notebook, 数据预处理, 回归模型, 交叉验证技术, Scikit-Learn
car_ID	symboling	CarName	fueltype	aspiration	doornumber	carbody	highwaympg	price
1	3	alfa-romero giulia	gas	std	two	convertible	27	13495.0
2	3	alfa-romero stelvio	gas	std	two	convertible	27	16500.0
3	1	alfa-romero Quadrifoglio	gas	std	two	hatchback	26	16500.0
4	2	audi 100 ls	gas	std	four	sedan	30	13950.0
5	2	audi 100ls	gas	std	four	sedan	22	17450.0