机器学习数据预处理与模型构建综合指南

1. 引言

数据预处理是机器学习流程中的关键阶段。它涉及将原始数据转换为适合建模的格式，从而提高预测模型的性能和准确性。本文通过使用来自 Kaggle 的天气数据集，展示了数据预处理和模型构建的逐步过程。

2. 导入与探索数据

在深入预处理之前，加载并理解数据集是至关重要的。

import pandas as pd
import seaborn as sns

# Load the dataset
data = pd.read_csv('weatherAUS.csv')

# Display the last five rows
print(data.tail())

import pandas as pd

import seaborn as sns

# Load the dataset

data = pd.read_csv('weatherAUS.csv')

# Display the last five rows

print(data.tail())

示例输出：

        Date Location  MinTemp  MaxTemp  Rainfall  Evaporation  Sunshine WindGustDir  WindGustSpeed WindDir9am  ... Humidity3pm  Pressure9am  Pressure3pm  Cloud9am  Cloud3pm  Temp9am  Temp3pm  RainToday  RISK_MM  RainTomorrow  
142188 2017-06-20 Uluru     3.5     21.8       0.0          NaN       NaN          E           31.0        ESE  ...        27.0       1024.7       1021.2       NaN       NaN      9.4     20.9         No      0.0            No

Date Location MinTemp MaxTemp Rainfall Evaporation Sunshine WindGustDir WindGustSpeed WindDir9am ... Humidity3pm Pressure9am Pressure3pm Cloud9am Cloud3pm Temp9am Temp3pm RainToday RISK_MM RainTomorrow

142188 2017-06-20 Uluru 3.5 21.8 0.0 NaN NaN E 31.0 ESE ... 27.0 1024.7 1021.2 NaN NaN 9.4 20.9 No 0.0 No

理解数据集的结构对于有效的预处理至关重要。使用 .info() 和 .describe() 可以获取数据类型和统计摘要的见解。

3. 处理缺失数据

缺失数据可能会扭曲分析结果。适当地处理它们是至关重要的。

数值数据

对于数值列，缺失值可以使用均值、中位数或众数等策略进行填补。

import numpy as np
from sklearn.impute import SimpleImputer

# Identify numerical columns
numerical_cols = list(np.where((X.dtypes == np.int64) | (X.dtypes == np.float64))[0])

# Initialize the imputer with a mean strategy
imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean')

# Fit and transform the data
X.iloc[:, numerical_cols] = imp_mean.fit_transform(X.iloc[:, numerical_cols])

import numpy as np

from sklearn.impute import SimpleImputer

# Identify numerical columns

numerical_cols = list(np.where((X.dtypes == np.int64) | (X.dtypes == np.float64))[0])

# Initialize the imputer with a mean strategy

imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean')

# Fit and transform the data

X.iloc[:, numerical_cols] = imp_mean.fit_transform(X.iloc[:, numerical_cols])

分类数据

对于分类列，缺失值可以使用最频繁的值进行填补。

# Identify string columns
string_cols = list(np.where((X.dtypes == object))[0])

# Initialize the imputer with the most frequent strategy
imp_freq = SimpleImputer(missing_values=np.nan, strategy='most_frequent')

# Fit and transform the data
X.iloc[:, string_cols] = imp_freq.fit_transform(X.iloc[:, string_cols])

# Identify string columns

string_cols = list(np.where((X.dtypes == object))[0])

# Initialize the imputer with the most frequent strategy

imp_freq = SimpleImputer(missing_values=np.nan, strategy='most_frequent')

# Fit and transform the data

X.iloc[:, string_cols] = imp_freq.fit_transform(X.iloc[:, string_cols])

4. 编码分类变量

机器学习模型需要数值输入。因此，分类变量需要进行适当的编码。

标签编码

标签编码将分类标签转换为数值值。它适用于二分类或有序数据。

from sklearn import preprocessing

def LabelEncoderMethod(series):
    le = preprocessing.LabelEncoder()
    return le.fit_transform(series)

from sklearn import preprocessing

def LabelEncoderMethod(series):

le = preprocessing.LabelEncoder()

return le.fit_transform(series)

独热编码

独热编码将分类变量转换为二进制矩阵。它适用于具有多个类别的名义数据。

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

def OneHotEncoderMethod(indices, data):
    columnTransformer = ColumnTransformer([('encoder', OneHotEncoder(), indices)], remainder='passthrough')
    return columnTransformer.fit_transform(data)

from sklearn.compose import ColumnTransformer

from sklearn.preprocessing import OneHotEncoder

def OneHotEncoderMethod(indices, data):

columnTransformer = ColumnTransformer([('encoder', OneHotEncoder(), indices)], remainder='passthrough')

return columnTransformer.fit_transform(data)

基于阈值的编码选择

为了简化编码过程，可以创建一个函数，根据每列的类别数量选择编码方法。

def EncodingSelection(X, threshold=10):
    # Select string columns
    string_cols = list(np.where((X.dtypes == object))[0])
    one_hot_encoding_indices = []
    
    # Decide on encoding based on the number of unique categories
    for col in string_cols:
        length = len(pd.unique(X[X.columns[col]]))
        if length == 2 or length > threshold:
            X[X.columns[col]] = LabelEncoderMethod(X[X.columns[col]])
        else:
            one_hot_encoding_indices.append(col)
                
    # Apply One-Hot Encoding to selected columns
    X = OneHotEncoderMethod(one_hot_encoding_indices, X)
    return X

# Apply encoding selection
X = EncodingSelection(X)

def EncodingSelection(X, threshold=10):

# Select string columns

string_cols = list(np.where((X.dtypes == object))[0])

one_hot_encoding_indices = []

# Decide on encoding based on the number of unique categories

for col in string_cols:

length = len(pd.unique(X[X.columns[col]]))

if length == 2 or length > threshold:

X[X.columns[col]] = LabelEncoderMethod(X[X.columns[col]])

else:

one_hot_encoding_indices.append(col)

# Apply One-Hot Encoding to selected columns

X = OneHotEncoderMethod(one_hot_encoding_indices, X)

return X

# Apply encoding selection

X = EncodingSelection(X)

5. 特征选择

特征选择涉及选择对模型构建最相关的特征。可以采用相关性分析、热图以及 SelectKBest 等方法来识别有影响力的特征。

6. 训练集与测试集划分

将数据集划分为训练集和测试集对于评估模型在未见数据上的性能至关重要。

from sklearn.model_selection import train_test_split

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=1)

print(X_train.shape)
# Output: (164, 199)

from sklearn.model_selection import train_test_split

# Split the data

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=1)

print(X_train.shape)

# Output: (164, 199)

7. 特征缩放

特征缩放确保所有特征对结果有同等贡献。它有助于加速梯度下降的收敛。

标准化

标准化将数据转换为均值为零，标准差为一。

from sklearn import preprocessing

sc = preprocessing.StandardScaler(with_mean=False)
sc.fit(X_train)

X_train = sc.transform(X_train)
X_test = sc.transform(X_test)

print(X_train.shape)
# Output: (164, 199)

from sklearn import preprocessing

sc = preprocessing.StandardScaler(with_mean=False)

sc.fit(X_train)

X_train = sc.transform(X_train)

X_test = sc.transform(X_test)

print(X_train.shape)

# Output: (164, 199)

归一化

归一化将数据缩放到一个固定范围，通常是0到1之间。

from sklearn.preprocessing import MinMaxScaler

min_max_scaler = MinMaxScaler()
X_train = min_max_scaler.fit_transform(X_train)
X_test = min_max_scaler.transform(X_test)

from sklearn.preprocessing import MinMaxScaler

min_max_scaler = MinMaxScaler()

X_train = min_max_scaler.fit_transform(X_train)

X_test = min_max_scaler.transform(X_test)

8. 构建回归模型

一旦数据预处理完成，就可以构建和评估各种回归模型。以下是几种流行回归算法的实现。

线性回归

一种基础算法，用于建模因变量与一个或多个自变量之间的关系。

from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

# Initialize and train the model
model = LinearRegression()
model.fit(X_train, y_train)

# Predict and evaluate
y_pred = model.predict(X_test)
score = r2_score(y_test, y_pred)
print(f"Linear Regression R2 Score: {score}")
# Output: 0.09741670577134398

from sklearn.linear_model import LinearRegression

from sklearn.metrics import r2_score

# Initialize and train the model

model = LinearRegression()

model.fit(X_train, y_train)

# Predict and evaluate

y_pred = model.predict(X_test)

score = r2_score(y_test, y_pred)

print(f"Linear Regression R2 Score: {score}")

# Output: 0.09741670577134398

多项式回归

通过添加多项式项来增强线性模型，以捕捉非线性关系。

from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures

# Initialize polynomial features and linear regression
poly = PolynomialFeatures(degree=2)
X_train_poly = poly.fit_transform(X_train)
model = LinearRegression()

# Train and predict
model.fit(X_train_poly, y_train)
X_test_poly = poly.transform(X_test)
y_pred = model.predict(X_test_poly)
score = r2_score(y_test, y_pred)
print(f"Polynomial Regression R2 Score: {score}")
# Output: -0.4531422286977287

from sklearn.linear_model import LinearRegression

from sklearn.preprocessing import PolynomialFeatures

# Initialize polynomial features and linear regression

poly = PolynomialFeatures(degree=2)

X_train_poly = poly.fit_transform(X_train)

model = LinearRegression()

# Train and predict

model.fit(X_train_poly, y_train)

X_test_poly = poly.transform(X_test)

y_pred = model.predict(X_test_poly)

score = r2_score(y_test, y_pred)

print(f"Polynomial Regression R2 Score: {score}")

# Output: -0.4531422286977287

注意：负的 R² 分数表示模型性能差。

决策树回归器

一种非线性模型，根据特征值将数据划分为子集。

from sklearn.tree import DecisionTreeRegressor

# Initialize and train the model
model = DecisionTreeRegressor(max_depth=4)
model.fit(X_train, y_train)

# Predict and evaluate
y_pred = model.predict(X_test)
score = r2_score(y_test, y_pred)
print(f"Decision Tree Regressor R2 Score: {score}")
# Output: 0.883961900453219

from sklearn.tree import DecisionTreeRegressor

# Initialize and train the model

model = DecisionTreeRegressor(max_depth=4)

model.fit(X_train, y_train)

# Predict and evaluate

y_pred = model.predict(X_test)

score = r2_score(y_test, y_pred)

print(f"Decision Tree Regressor R2 Score: {score}")

# Output: 0.883961900453219

随机森林回归器

一种集成方法，通过结合多个决策树来提高性能并减少过拟合。

from sklearn.ensemble import RandomForestRegressor

# Initialize and train the model
model = RandomForestRegressor(n_estimators=25, random_state=10)
model.fit(X_train, y_train)

# Predict and evaluate
y_pred = model.predict(X_test)
score = r2_score(y_test, y_pred)
print(f"Random Forest Regressor R2 Score: {score}")
# Output: 0.9107611439295349

from sklearn.ensemble import RandomForestRegressor

# Initialize and train the model

model = RandomForestRegressor(n_estimators=25, random_state=10)

model.fit(X_train, y_train)

# Predict and evaluate

y_pred = model.predict(X_test)

score = r2_score(y_test, y_pred)

print(f"Random Forest Regressor R2 Score: {score}")

# Output: 0.9107611439295349

AdaBoost 回归器

另一种集成技术，通过结合弱学习器形成强预测器。

from sklearn.ensemble import AdaBoostRegressor

# Initialize and train the model
model = AdaBoostRegressor(random_state=0, n_estimators=100)
model.fit(X_train, y_train)

# Predict and evaluate
y_pred = model.predict(X_test)
score = r2_score(y_test, y_pred)
print(f"AdaBoost Regressor R2 Score: {score}")
# Output: 0.8806696893560713

from sklearn.ensemble import AdaBoostRegressor

# Initialize and train the model

model = AdaBoostRegressor(random_state=0, n_estimators=100)

model.fit(X_train, y_train)

# Predict and evaluate

y_pred = model.predict(X_test)

score = r2_score(y_test, y_pred)

print(f"AdaBoost Regressor R2 Score: {score}")

# Output: 0.8806696893560713

XGBoost 回归器

一种强大的梯度提升框架，优化了速度和性能。

import xgboost as xgb

# Initialize and train the model
model = xgb.XGBRegressor(
    n_estimators=100,
    reg_lambda=1,
    gamma=0,
    max_depth=3,
    learning_rate=0.05
)
model.fit(X_train, y_train)

# Predict and evaluate
y_pred = model.predict(X_test)
score = r2_score(y_test, y_pred)
print(f"XGBoost Regressor R2 Score: {score}")
# Output: 0.8947431439987505

import xgboost as xgb

# Initialize and train the model

model = xgb.XGBRegressor(

n_estimators=100,

reg_lambda=1,

gamma=0,

max_depth=3,

learning_rate=0.05

)

model.fit(X_train, y_train)

# Predict and evaluate

y_pred = model.predict(X_test)

score = r2_score(y_test, y_pred)

print(f"XGBoost Regressor R2 Score: {score}")

# Output: 0.8947431439987505

支持向量机 (SVM) 回归器

SVM 可适应回归任务，捕捉复杂关系。

from sklearn.svm import SVR

# Initialize and train the model
model = SVR()
model.fit(X_train, y_train)

# Predict and evaluate
y_pred = model.predict(X_test)
score = r2_score(y_test, y_pred)
print(f"SVM Regressor R2 Score: {score}")
# Output: -0.02713944090388254

from sklearn.svm import SVR

# Initialize and train the model

model = SVR()

model.fit(X_train, y_train)

# Predict and evaluate

y_pred = model.predict(X_test)

score = r2_score(y_test, y_pred)

print(f"SVM Regressor R2 Score: {score}")

# Output: -0.02713944090388254

注意：负的 R² 分数表示模型的性能比水平线还差。

9. 模型评估

R² 分数是评估回归模型的常用指标。它表示因变量中由自变量可预测的方差比例。

正的 R²：模型解释了一部分方差。
负的 R²：模型未能解释方差，性能比简单的基于均值的模型还差。

在本指南中，随机森林回归器达到了约 0.91 的最高 R² 分数，表明其在测试数据上的性能强劲。

10. 结论

有效的数据预处理为构建稳健的机器学习模型奠定了基础。通过仔细处理缺失数据、选择适当的编码技术和缩放特征，可以提升数据质量，从而改善模型性能。在探索的回归模型中，诸如 随机森林 和 AdaBoost 之类的集成方法在天气数据集上展示了优越的预测能力。始终记得彻底评估您的模型，并选择最符合项目目标的模型。

运用这些预处理和建模策略，充分挖掘您的数据集潜力，推动有影响力的机器学习解决方案。

S18L06 – 继续重新探讨预处理

机器学习数据预处理与模型构建综合指南

目录

1. 引言

2. 导入与探索数据

3. 处理缺失数据

数值数据

分类数据

4. 编码分类变量

标签编码

独热编码

基于阈值的编码选择

5. 特征选择

6. 训练集与测试集划分

7. 特征缩放

标准化

归一化

8. 构建回归模型

线性回归

多项式回归

决策树回归器

随机森林回归器

AdaBoost 回归器

XGBoost 回归器

支持向量机 (SVM) 回归器

9. 模型评估

10. 结论