S16L02 – 主模板回归模型 – 模型与评估

html
掌握高级回归模型的汽车价格预测：全面指南

目录

介绍
数据集概述
数据导入与初步探索
数据清理与预处理
    
        处理缺失的数值数据
        处理缺失的分类数据
    

特征选择与编码
    
        删除无关特征
        分类变量的独热编码
    

训练集与测试集划分
特征缩放
构建与评估回归模型
    
        1. 线性回归
        2. 多项式线性回归
        3. 决策树回归
        4. 随机森林回归
        5. AdaBoost 回归
        6. XGBoost 回归
        7. 支持向量回归 (SVR)
    

模型性能比较
结论




介绍

预测分析使企业能够预见未来趋势，优化运营，并增强决策过程。汽车价格预测是一个典型的例子，机器学习模型可以基于品牌、发动机规格、燃料类型等属性预测车辆价格。本指南将引导您构建一个全面的回归模型流程，从数据预处理到评估多种回归算法。

数据集概述

Kaggle上的汽车价格预测数据集是一个丰富的资源，包含205条记录，每条记录有26个特征。这些特征涵盖了汽车的各个方面，如门的数量、发动机大小、马力、燃料类型等，所有这些都影响汽车的市场价格。

主要特征：

CarName：汽车名称（品牌和型号）
FuelType：使用的燃料类型（例如，汽油、柴油）
Aspiration：发动机吸气类型
Doornumber：门的数量（两门或四门）
Enginesize：发动机大小
Horsepower：发动机功率
Price：汽车的市场价格（目标变量）


数据导入与初步探索

首先，我们使用pandas导入数据集并初步查看数据结构。





		
		
			
			
Java
			
			import pandas as pd

# Load the dataset
data = pd.read_csv('CarPrice.csv')

# Display the first five rows
print(data.head())
			
				
					
				
					1
2
3
4
5
6
7
				
						import pandas as pd
 
# Load the dataset
data = pd.read_csv('CarPrice.csv')
 
# Display the first five rows
print(data.head())
					
				
			
		



示例输出：




		
		
			
			
Java
			
			   car_ID  symboling                   CarName fueltype aspiration doornumber  \
0       1          3        alfa-romero giulia      gas        std        two   
1       2          3       alfa-romero stelvio      gas        std        two   
2       3          1  alfa-romero Quadrifoglio      gas        std        two   
3       4          2               audi 100 ls      gas        std       four   
4       5          2                audi 100ls      gas        std       four   

      carbody drivewheel enginelocation  wheelbase  ...  horsepower  peakrpm citympg  \
0  convertible        rwd          front       88.6  ...       111.0     5000      21   
1  convertible        rwd          front       88.6  ...       111.0     5000      21   
2    hatchback        rwd          front       94.5  ...       154.0     5000      19   
3        sedan        fwd          front       99.8  ...       102.0     5500      24   
4        sedan        4wd          front       99.4  ...       115.0     5500      18   

   highwaympg    price  
0          27  13495.0  
1          27  16500.0  
2          26  16500.0  
3          30  13950.0  
4          22  17450.0  
			
				
					
				
					1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
				
						   car_ID  symboling                   CarName fueltype aspiration doornumber  \
0       1          3        alfa-romero giulia      gas        std        two   
1       2          3       alfa-romero stelvio      gas        std        two   
2       3          1  alfa-romero Quadrifoglio      gas        std        two   
3       4          2               audi 100 ls      gas        std       four   
4       5          2                audi 100ls      gas        std       four   
 
      carbody drivewheel enginelocation  wheelbase  ...  horsepower  peakrpm citympg  \
0  convertible        rwd          front       88.6  ...       111.0     5000      21   
1  convertible        rwd          front       88.6  ...       111.0     5000      21   
2    hatchback        rwd          front       94.5  ...       154.0     5000      19   
3        sedan        fwd          front       99.8  ...       102.0     5500      24   
4        sedan        4wd          front       99.4  ...       115.0     5500      18   
 
   highwaympg    price  
0          27  13495.0  
1          27  16500.0  
2          26  16500.0  
3          30  13950.0  
4          22  17450.0  
					
				
			
		



数据清理与预处理

处理缺失的数值数据

缺失值可能会显著影响机器学习模型的性能。我们首先通过用均值填充来处理缺失的数值数据。





		
		
			
			
Java
			
			import numpy as np
from sklearn.impute import SimpleImputer

# Identify numerical columns
numerical_cols = list(np.where((X.dtypes == np.int64) | (X.dtypes == np.float64))[0])

# Initialize imputer
imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean')
imp_mean.fit(X.iloc[:, numerical_cols])

# Impute missing numerical data
X.iloc[:, numerical_cols] = imp_mean.transform(X.iloc[:, numerical_cols])
			
				
					
				
					1
2
3
4
5
6
7
8
9
10
11
12
				
						import numpy as np
from sklearn.impute import SimpleImputer
 
# Identify numerical columns
numerical_cols = list(np.where((X.dtypes == np.int64) | (X.dtypes == np.float64))[0])
 
# Initialize imputer
imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean')
imp_mean.fit(X.iloc[:, numerical_cols])
 
# Impute missing numerical data
X.iloc[:, numerical_cols] = imp_mean.transform(X.iloc[:, numerical_cols])
					
				
			
		



处理缺失的分类数据

对于分类变量，使用最频繁策略进行缺失值填充。





		
		
			
			
Java
			
			# Identify categorical columns
string_cols = list(np.where((X.dtypes == np.object))[0])

# Initialize imputer for categorical data
imp_freq = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
imp_freq.fit(X.iloc[:, string_cols])

# Impute missing categorical data
X.iloc[:, string_cols] = imp_freq.transform(X.iloc[:, string_cols])
			
				
					
				
					1
2
3
4
5
6
7
8
9
				
						# Identify categorical columns
string_cols = list(np.where((X.dtypes == np.object))[0])
 
# Initialize imputer for categorical data
imp_freq = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
imp_freq.fit(X.iloc[:, string_cols])
 
# Impute missing categorical data
X.iloc[:, string_cols] = imp_freq.transform(X.iloc[:, string_cols])
					
				
			
		



特征选择与编码

删除无关特征

car_ID 列是唯一标识符，对模型的预测能力没有贡献。因此，将其移除。





		
		
			
			
Java
			
			# Drop 'car_ID' column
X.drop('car_ID', axis=1, inplace=True)
			
				
					
				
					1
2
				
						# Drop 'car_ID' column
X.drop('car_ID', axis=1, inplace=True)
					
				
			
		



分类变量的独热编码

机器学习算法需要数值输入。因此，使用独热编码转换分类变量。





		
		
			
			
Java
			
			from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

# Re-identify categorical columns after dropping 'car_ID'
string_cols = list(np.where((X.dtypes == np.object))[0])

# Apply One-Hot Encoding
columnTransformer = ColumnTransformer(
    [('encoder', OneHotEncoder(), string_cols)],
    remainder='passthrough'
)
X = columnTransformer.fit_transform(X)
			
				
					
				
					1
2
3
4
5
6
7
8
9
10
11
12
				
						from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
 
# Re-identify categorical columns after dropping 'car_ID'
string_cols = list(np.where((X.dtypes == np.object))[0])
 
# Apply One-Hot Encoding
columnTransformer = ColumnTransformer(
    [('encoder', OneHotEncoder(), string_cols)],
    remainder='passthrough'
)
X = columnTransformer.fit_transform(X)
					
				
			
		



编码前：

形状： (205, 24)


编码后：

形状： (205, 199)


训练集与测试集划分

将数据集划分为训练集和测试集对于评估模型性能至关重要。





		
		
			
			
Java
			
			from sklearn.model_selection import train_test_split

# Perform train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, Y, test_size=0.20, random_state=1
)

print(f"Training set shape: {X_train.shape}")
print(f"Testing set shape: {X_test.shape}")
			
				
					
				
					1
2
3
4
5
6
7
8
9
				
						from sklearn.model_selection import train_test_split
 
# Perform train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, Y, test_size=0.20, random_state=1
)
 
print(f"Training set shape: {X_train.shape}")
print(f"Testing set shape: {X_test.shape}")
					
				
			
		



输出：




		
		
			
			
Java
			
			Training set shape: (164, 199)
Testing set shape: (41, 199)
			
				
					
				
					1
2
				
						Training set shape: (164, 199)
Testing set shape: (41, 199)
					
				
			
		



特征缩放

特征缩放确保所有特征对模型性能的贡献相等。在这里，我们使用标准化。





		
		
			
			
Java
			
			from sklearn import preprocessing

# Initialize StandardScaler
sc = preprocessing.StandardScaler(with_mean=False)
sc.fit(X_train)

# Transform the data
X_train = sc.transform(X_train)
X_test = sc.transform(X_test)
			
				
					
				
					1
2
3
4
5
6
7
8
9
				
						from sklearn import preprocessing
 
# Initialize StandardScaler
sc = preprocessing.StandardScaler(with_mean=False)
sc.fit(X_train)
 
# Transform the data
X_train = sc.transform(X_train)
X_test = sc.transform(X_test)
					
				
			
		



构建与评估回归模型

我们将探索多种回归模型，并根据R²分数评估每个模型。

1. 线性回归

线性回归作为基线模型。





		
		
			
			
Java
			
			from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

# Initialize and train the model
model = LinearRegression()
model.fit(X_train, y_train)

# Predict and evaluate
y_pred = model.predict(X_test)
r2 = r2_score(y_test, y_pred)
print(f"Linear Regression R² Score: {r2:.2f}")
			
				
					
				
					1
2
3
4
5
6
7
8
9
10
11
				
						from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
 
# Initialize and train the model
model = LinearRegression()
model.fit(X_train, y_train)
 
# Predict and evaluate
y_pred = model.predict(X_test)
r2 = r2_score(y_test, y_pred)
print(f"Linear Regression R² Score: {r2:.2f}")
					
				
			
		



R² 分数： 0.097

解释： 模型解释了汽车价格方差的约9.7%。

2. 多项式线性回归

为了捕捉非线性关系，我们引入多项式特征。





		
		
			
			
Java
			
			from sklearn.preprocessing import PolynomialFeatures

# Initialize PolynomialFeatures
poly = PolynomialFeatures(degree=2)
X_train_poly = poly.fit_transform(X_train)
X_test_poly = poly.transform(X_test)

# Train the model
model = LinearRegression()
model.fit(X_train_poly, y_train)

# Predict and evaluate
y_pred = model.predict(X_test_poly)
r2 = r2_score(y_test, y_pred)
print(f"Polynomial Linear Regression R² Score: {r2:.2f}")
			
				
					
				
					1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
				
						from sklearn.preprocessing import PolynomialFeatures
 
# Initialize PolynomialFeatures
poly = PolynomialFeatures(degree=2)
X_train_poly = poly.fit_transform(X_train)
X_test_poly = poly.transform(X_test)
 
# Train the model
model = LinearRegression()
model.fit(X_train_poly, y_train)
 
# Predict and evaluate
y_pred = model.predict(X_test_poly)
r2 = r2_score(y_test, y_pred)
print(f"Polynomial Linear Regression R² Score: {r2:.2f}")
					
				
			
		



R² 分数： -0.45

解释： 模型表现不如基线，解释了-45%的方差。

3. 决策树回归

决策树可以通过划分数据来模拟复杂关系。





		
		
			
			
Java
			
			from sklearn.tree import DecisionTreeRegressor

# Initialize and train the model
model = DecisionTreeRegressor(max_depth=4)
model.fit(X_train, y_train)

# Predict and evaluate
y_pred = model.predict(X_test)
r2 = r2_score(y_test, y_pred)
print(f"Decision Tree Regression R² Score: {r2:.2f}")
			
				
					
				
					1
2
3
4
5
6
7
8
9
10
				
						from sklearn.tree import DecisionTreeRegressor
 
# Initialize and train the model
model = DecisionTreeRegressor(max_depth=4)
model.fit(X_train, y_train)
 
# Predict and evaluate
y_pred = model.predict(X_test)
r2 = r2_score(y_test, y_pred)
print(f"Decision Tree Regression R² Score: {r2:.2f}")
					
				
			
		



R² 分数： 0.88

解释： 显著提高，解释了88%的方差。

4. 随机森林回归

随机森林通过聚合多个决策树来增强性能并减轻过拟合。





		
		
			
			
Java
			
			from sklearn.ensemble import RandomForestRegressor

# Initialize and train the model
model = RandomForestRegressor(n_estimators=25, random_state=10)
model.fit(X_train, y_train)

# Predict and evaluate
y_pred = model.predict(X_test)
r2 = r2_score(y_test, y_pred)
print(f"Random Forest Regression R² Score: {r2:.2f}")
			
				
					
				
					1
2
3
4
5
6
7
8
9
10
				
						from sklearn.ensemble import RandomForestRegressor
 
# Initialize and train the model
model = RandomForestRegressor(n_estimators=25, random_state=10)
model.fit(X_train, y_train)
 
# Predict and evaluate
y_pred = model.predict(X_test)
r2 = r2_score(y_test, y_pred)
print(f"Random Forest Regression R² Score: {r2:.2f}")
					
				
			
		



R² 分数： 0.91

解释： 优秀的性能，解释了91%的方差。

5. AdaBoost 回归

AdaBoost通过关注错误来将弱学习器组合成强预测器。





		
		
			
			
Java
			
			from sklearn.ensemble import AdaBoostRegressor

# Initialize and train the model
model = AdaBoostRegressor(random_state=0, n_estimators=100)
model.fit(X_train, y_train)

# Predict and evaluate
y_pred = model.predict(X_test)
r2 = r2_score(y_test, y_pred)
print(f"AdaBoost Regression R² Score: {r2:.2f}")
			
				
					
				
					1
2
3
4
5
6
7
8
9
10
				
						from sklearn.ensemble import AdaBoostRegressor
 
# Initialize and train the model
model = AdaBoostRegressor(random_state=0, n_estimators=100)
model.fit(X_train, y_train)
 
# Predict and evaluate
y_pred = model.predict(X_test)
r2 = r2_score(y_test, y_pred)
print(f"AdaBoost Regression R² Score: {r2:.2f}")
					
				
			
		



R² 分数： 0.88

解释： 与决策树相当，解释了88%的方差。

6. XGBoost 回归

XGBoost是一种强大的梯度提升框架，以其高效性和性能著称。





		
		
			
			
Java
			
			import xgboost as xgb

# Initialize and train the model
model = xgb.XGBRegressor(
    n_estimators=100,
    reg_lambda=1,
    gamma=0,
    max_depth=3,
    learning_rate=0.05
)
model.fit(X_train, y_train)

# Predict and evaluate
y_pred = model.predict(X_test)
r2 = r2_score(y_test, y_pred)
print(f"XGBoost Regression R² Score: {r2:.2f}")
			
				
					
				
					1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
				
						import xgboost as xgb
 
# Initialize and train the model
model = xgb.XGBRegressor(
    n_estimators=100,
    reg_lambda=1,
    gamma=0,
    max_depth=3,
    learning_rate=0.05
)
model.fit(X_train, y_train)
 
# Predict and evaluate
y_pred = model.predict(X_test)
r2 = r2_score(y_test, y_pred)
print(f"XGBoost Regression R² Score: {r2:.2f}")
					
				
			
		



R² 分数： 0.89

解释： 稳健的性能，解释了89%的方差。

7. 支持向量回归 (SVR)

SVR在高维空间中效果良好，但在大型数据集上可能表现不佳。





		
		
			
			
Java
			
			from sklearn.svm import SVR

# Initialize and train the model
model = SVR()
model.fit(X_train, y_train)

# Predict and evaluate
y_pred = model.predict(X_test)
r2 = r2_score(y_test, y_pred)
print(f"Support Vector Regression (SVR) R² Score: {r2:.2f}")
			
				
					
				
					1
2
3
4
5
6
7
8
9
10
				
						from sklearn.svm import SVR
 
# Initialize and train the model
model = SVR()
model.fit(X_train, y_train)
 
# Predict and evaluate
y_pred = model.predict(X_test)
r2 = r2_score(y_test, y_pred)
print(f"Support Vector Regression (SVR) R² Score: {r2:.2f}")
					
				
			
		



R² 分数： -0.03

解释： 表现不佳，解释了-3%的方差。

模型性能比较



模型
R² 分数


线性回归
0.10


多项式线性回归
-0.45


决策树回归
0.88


随机森林回归
0.91


AdaBoost 回归
0.88


XGBoost 回归
0.89


支持向量回归 (SVR)
-0.03



见解：

随机森林回归以0.91的R²分数表现优于所有其他模型，表明其解释了汽车价格的91%方差。
多项式线性回归表现最差，甚至不如基线模型，这可能表明过拟合或特征转换不当。
支持向量回归 (SVR)在此数据集上的表现不佳，可能是由于编码后高维度的原因。


结论

汽车价格预测的预测模型强调了选择正确算法和彻底数据预处理的重要性。在我们的探索中：

决策树和随机森林模型表现出色，随机森林略胜一筹。
集成方法如AdaBoost和XGBoost也展示了强劲的结果，突显了它们在处理复杂数据集方面的有效性。
线性模型，特别是扩展到多项式特征时，并不总能带来更好的性能，有时甚至会降低模型的效能。
支持向量回归 (SVR)可能不适合高维度数据集或非线性模式不明显的情况。


关键要点：

数据预处理：处理缺失值和编码分类变量是显著影响模型性能的关键步骤。
特征缩放：确保所有特征的贡献相等，提高了基于梯度算法的效率。
模型选择：如随机森林和XGBoost等集成方法在回归任务中通常表现优异。
模型评估：R²分数是评估预测结果与实际结果接近程度的宝贵指标。


使用高级回归模型进行汽车价格预测不仅提升了预测准确性，还为利益相关者提供了关于市场动态的可操作见解。随着机器学习领域的不断发展，保持对最新算法和技术的了解对于数据爱好者和专业人士而言至关重要。
模型	R² 分数
线性回归	0.10
多项式线性回归	-0.45
决策树回归	0.88
随机森林回归	0.91
AdaBoost 回归	0.88
XGBoost 回归	0.89
支持向量回归 (SVR)	-0.03