掌握机器学习中的标签编码：全面指南

标签编码简介

在机器学习中，标签编码是一种将分类数据转换为数值格式的技术。由于许多算法无法直接处理分类数据，将这些类别编码为数字变得必要。标签编码为每个类别分配一个唯一的整数，从而促进模型高效地解释和处理数据的能力。

关键概念：

分类数据：表示类别的变量，例如“是/否”，“红/蓝/绿”等。
数值编码：将分类数据转换为数值的过程。

理解数据集

在本指南中，我们将使用来自Kaggle的Weather AUS数据集。该数据集涵盖了不同澳大利亚地点和日期的各种天气相关属性。

数据集概览：

URL： Weather AUS 数据集
特征：日期、地点、温度指标、降雨量、风力详情、湿度、气压、云量等。
目标变量： RainTomorrow 表示第二天是否会下雨。

处理缺失数据

现实世界的数据集通常包含缺失值，这可能会影响机器学习模型的性能。正确处理这些缺失值对于构建稳健的模型至关重要。

数值数据

策略：使用列的平均值填补缺失值。

实现：

import numpy as np
from sklearn.impute import SimpleImputer

# Initialize the imputer with mean strategy
imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean')

# Identify numerical columns
numerical_cols = list(np.where((X.dtypes == np.int64) | (X.dtypes == np.float64))[0])

# Fit and transform the data
imp_mean.fit(X.iloc[:, numerical_cols])
X.iloc[:, numerical_cols] = imp_mean.transform(X.iloc[:, numerical_cols])

import numpy as np

from sklearn.impute import SimpleImputer

# Initialize the imputer with mean strategy

imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean')

# Identify numerical columns

numerical_cols = list(np.where((X.dtypes == np.int64) | (X.dtypes == np.float64))[0])

# Fit and transform the data

imp_mean.fit(X.iloc[:, numerical_cols])

X.iloc[:, numerical_cols] = imp_mean.transform(X.iloc[:, numerical_cols])

分类数据

策略：使用最频繁的类别填补缺失值。

实现：

# Identify string columns
string_cols = list(np.where((X.dtypes == object))[0])

# Initialize the imputer with the most frequent strategy
imp_mean = SimpleImputer(missing_values=np.nan, strategy='most_frequent')

# Fit and transform the data
imp_mean.fit(X.iloc[:, string_cols])
X.iloc[:, string_cols] = imp_mean.transform(X.iloc[:, string_cols])

# Identify string columns

string_cols = list(np.where((X.dtypes == object))[0])

# Initialize the imputer with the most frequent strategy

imp_mean = SimpleImputer(missing_values=np.nan, strategy='most_frequent')

# Fit and transform the data

imp_mean.fit(X.iloc[:, string_cols])

X.iloc[:, string_cols] = imp_mean.transform(X.iloc[:, string_cols])

对分类变量进行编码

在处理缺失数据之后，下一步是对分类变量进行编码，以便为机器学习算法做好准备。

独热编码

独热编码将分类变量转换为一种可以提供给机器学习算法的格式，从而更好地执行预测。

实现：

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

def OneHotEncoderMethod(indices, data):
    columnTransformer = ColumnTransformer(
        [('encoder', OneHotEncoder(), indices)], 
        remainder='passthrough'
    )
    return columnTransformer.fit_transform(data)

from sklearn.compose import ColumnTransformer

from sklearn.preprocessing import OneHotEncoder

def OneHotEncoderMethod(indices, data):

columnTransformer = ColumnTransformer(

[('encoder', OneHotEncoder(), indices)],

remainder='passthrough'

)

return columnTransformer.fit_transform(data)

标签编码

标签编码将分类列的每个值转换为唯一的整数。它对于二元分类变量特别有用。

实现：

from sklearn import preprocessing

def LabelEncoderMethod(series):
    le = preprocessing.LabelEncoder()
    le.fit(series)
    return le.transform(series)

from sklearn import preprocessing

def LabelEncoderMethod(series):

le = preprocessing.LabelEncoder()

le.fit(series)

return le.transform(series)

选择合适的编码技术

在独热编码和标签编码之间的选择取决于分类数据的性质。

指导原则：

二元类别：标签编码就足够了。
多类别：独热编码更可取，以避免引入序数关系。

实现：

def EncodingSelection(X, threshold=10):
    # Select string columns
    string_cols = list(np.where((X.dtypes == object))[0])
    one_hot_encoding_indices = []
    
    # Decide encoding method based on unique values
    for col in string_cols:
        unique_length = len(pd.unique(X[X.columns[col]]))
        if unique_length == 2 or unique_length &gt; threshold:
            X[X.columns[col]] = LabelEncoderMethod(X[X.columns[col]])
        else:
            one_hot_encoding_indices.append(col)
            
    # Apply One-Hot Encoding
    X = OneHotEncoderMethod(one_hot_encoding_indices, X)
    return X

X = EncodingSelection(X)

def EncodingSelection(X, threshold=10):

# Select string columns

string_cols = list(np.where((X.dtypes == object))[0])

one_hot_encoding_indices = []

# Decide encoding method based on unique values

for col in string_cols:

unique_length = len(pd.unique(X[X.columns[col]]))

if unique_length == 2 or unique_length > threshold:

X[X.columns[col]] = LabelEncoderMethod(X[X.columns[col]])

else:

one_hot_encoding_indices.append(col)

# Apply One-Hot Encoding

X = OneHotEncoderMethod(one_hot_encoding_indices, X)

return X

X = EncodingSelection(X)

特征选择

选择最相关的特征可以提升模型性能并减少计算复杂性。

技术：使用卡方检验（chi2）作为评分函数的 SelectKBest。

实现：

from sklearn.feature_selection import SelectKBest, chi2
from sklearn import preprocessing

# Initialize SelectKBest
kbest = SelectKBest(score_func=chi2, k=10)

# Initialize Min-Max Scaler
MMS = preprocessing.MinMaxScaler()

# Scale features
x_temp = MMS.fit_transform(X)

# Fit SelectKBest
x_temp = kbest.fit(x_temp, y)

# Identify best features
best_features = np.argsort(x_temp.scores_)[-K_features:]
features_to_delete = best_features = np.argsort(x_temp.scores_)[:-K_features]

# Reduce dataset
X = np.delete(X, features_to_delete, axis=1)

from sklearn.feature_selection import SelectKBest, chi2

from sklearn import preprocessing

# Initialize SelectKBest

kbest = SelectKBest(score_func=chi2, k=10)

# Initialize Min-Max Scaler

MMS = preprocessing.MinMaxScaler()

# Scale features

x_temp = MMS.fit_transform(X)

# Fit SelectKBest

x_temp = kbest.fit(x_temp, y)

# Identify best features

best_features = np.argsort(x_temp.scores_)[-K_features:]

features_to_delete = best_features = np.argsort(x_temp.scores_)[:-K_features]

# Reduce dataset

X = np.delete(X, features_to_delete, axis=1)

构建与评估KNN模型

在数据集预处理和特征选择之后，我们将继续构建和评估一个K-最近邻（KNN）分类器。

训练集与测试集拆分

拆分数据集确保模型在未见过的数据上进行评估，从而提供其泛化能力的衡量。

实现：

from sklearn.model_selection import train_test_split

# Split the data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.20, random_state=1
)

print(X_train.shape)  # Output: (113754, 12)

from sklearn.model_selection import train_test_split

# Split the data

X_train, X_test, y_train, y_test = train_test_split(

X, y, test_size=0.20, random_state=1

)

print(X_train.shape) # Output: (113754, 12)

特征缩放

特征缩放标准化了特征的范围，这对于像KNN这样对数据尺度敏感的算法至关重要。

实现：

from sklearn import preprocessing

# Initialize StandardScaler
sc = preprocessing.StandardScaler(with_mean=False)

# Fit and transform the training data
sc.fit(X_train)
X_train = sc.transform(X_train)

# Transform the test data
X_test = sc.transform(X_test)

print(X_train.shape)  # Output: (113754, 12)
print(X_test.shape)   # Output: (28439, 12)

from sklearn import preprocessing

# Initialize StandardScaler

sc = preprocessing.StandardScaler(with_mean=False)

# Fit and transform the training data

sc.fit(X_train)

X_train = sc.transform(X_train)

# Transform the test data

X_test = sc.transform(X_test)

print(X_train.shape) # Output: (113754, 12)

print(X_test.shape) # Output: (28439, 12)

模型训练与评估

实现：

from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Initialize KNN classifier
knnClassifier = KNeighborsClassifier(n_neighbors=3)

# Train the model
knnClassifier.fit(X_train, y_train)

# Predict on test data
y_pred = knnClassifier.predict(X_test)

# Evaluate accuracy
accuracy = accuracy_score(y_pred, y_test)
print(f"Accuracy: {accuracy}")

from sklearn.neighbors import KNeighborsClassifier

from sklearn.metrics import accuracy_score

# Initialize KNN classifier

knnClassifier = KNeighborsClassifier(n_neighbors=3)

# Train the model

knnClassifier.fit(X_train, y_train)

# Predict on test data

y_pred = knnClassifier.predict(X_test)

# Evaluate accuracy

accuracy = accuracy_score(y_pred, y_test)

print(f"Accuracy: {accuracy}")

输出：

Accuracy: 0.8258

1	Accuracy: 0.8258

大约82.58%的准确率表明，该模型在根据提供的特征预测第二天是否下雨方面表现合理。

可视化决策区域

可视化决策区域可以提供对KNN模型如何做出预测的见解。尽管在特征较少时更具说明性，以下是用于可视化的示例代码片段。

实现：

# Install mlxtend if not already installed
# pip install mlxtend

from mlxtend.plotting import plot_decision_regions
import matplotlib.pyplot as plt

# Plotting decision regions (Example with first two features)
plot_decision_regions(X_train[:, :2], y_train, clf=knnClassifier, legend=2)

# Adding axis labels
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('KNN Decision Regions')
plt.show()

# Install mlxtend if not already installed

# pip install mlxtend

from mlxtend.plotting import plot_decision_regions

import matplotlib.pyplot as plt

# Plotting decision regions (Example with first two features)

plot_decision_regions(X_train[:, :2], y_train, clf=knnClassifier, legend=2)

# Adding axis labels

plt.xlabel('Feature 1')

plt.ylabel('Feature 2')

plt.title('KNN Decision Regions')

plt.show()

注意：可视化在使用两个特征时效果最佳。对于具有更多特征的数据集，考虑在可视化之前使用主成分分析（PCA）等降维技术。

结论

标签编码是数据预处理工具中的基础技术，使机器学习模型能够有效地解释分类数据。通过系统地处理缺失数据、选择相关特征和适当编码分类变量，您为构建稳健的预测模型奠定了坚实的基础。在工作流程中采用这些实践不仅提升了模型性能，还确保了机器学习项目的可扩展性和效率。

关键要点：

标签编码将分类数据转换为数值格式，这是机器学习算法的必要步骤。
适当处理缺失数据可以防止模型结果的偏差。
编码技术应根据类别的性质和数量进行选择。
特征选择通过消除不相关或冗余的特征来提高模型性能。
KNN模型的有效性受正确预处理和特征缩放的影响。

通过掌握这些预处理技术，开始您的机器学习之旅，释放构建既准确又可靠模型的潜力。

增强您的学习：

在我们的高级数据预处理指南中探索更多预处理技术。
通过我们的全面的机器学习模型教程深入了解机器学习算法。

编码愉快！

S19L04 – 标签编码类

掌握机器学习中的标签编码：全面指南

目录

标签编码简介

理解数据集

处理缺失数据

数值数据

分类数据

对分类变量进行编码

独热编码

标签编码

选择合适的编码技术

特征选择

构建与评估KNN模型

训练集与测试集拆分

特征缩放

模型训练与评估

可视化决策区域

结论