S18L08 – 简短讨论

html
机器学习中分类问题数据预处理全面指南

目录

    分类问题简介
    数据导入与概述
    处理缺失数据
        
            A. 数值数据
            B. 类别数据
        
    
    编码类别变量
        
            A. 标签编码
            B. 独热编码
        
    
    特征选择
    训练-测试划分
    特征缩放
    结论




分类问题简介

分类 是一种用于预测类别标签的监督学习技术。它涉及根据历史数据将输入数据分配到预定义的类别中。分类模型的范围从像逻辑回归这样简单的算法到像随机森林和神经网络这样更复杂的算法。这些模型的成功不仅依赖于所选择的算法，还在很大程度上依赖于数据的准备和预处理方式。

数据导入与概述

在进行预处理之前，了解并导入数据集是至关重要的。对于本指南，我们将使用来自 Kaggle 的 WeatherAUS 数据集，该数据集包含澳大利亚的每日天气观测数据。





		
		
			
			
Java
			
			# Importing necessary libraries
import pandas as pd 
import seaborn as sns

# Loading the dataset
data = pd.read_csv('weatherAUS.csv')

# Displaying the last five rows of the dataset
data.tail()
			
				
					
				
					1
2
3
4
5
6
7
8
9
				
						# Importing necessary libraries
import pandas as pd 
import seaborn as sns
 
# Loading the dataset
data = pd.read_csv('weatherAUS.csv')
 
# Displaying the last five rows of the dataset
data.tail()
					
				
			
		



输出：





		
		
			
			
Java
			
			           Date Location  MinTemp  MaxTemp  Rainfall  Evaporation  Sunshine  WindGustDir  WindGustSpeed WindDir9am  ...  Humidity3pm  Pressure9am  Pressure3pm  Cloud9am  Cloud3pm  Temp9am  Temp3pm  RainToday  RISK_MM  RainTomorrow
142188 2017-06-20    Uluru      3.5     21.8       0.0          NaN       NaN           E          31.0        ESE  ...        27.0       1024.7       1021.2       NaN       NaN      9.4     20.9         No      0.0            No
142189 2017-06-21    Uluru      2.8     23.4       0.0          NaN       NaN           E          31.0          SE  ...        24.0       1024.6       1020.3       NaN       NaN     10.1     22.4         No      0.0            No
142190 2017-06-22    Uluru      3.6     25.3       0.0          NaN       NaN         NNW          22.0          SE  ...        21.0       1023.5       1019.1       NaN       NaN     10.9     24.5         No      0.0            No
142191 2017-06-23    Uluru      5.4     26.9       0.0          NaN       NaN           N          37.0          SE  ...        24.0       1021.0       1016.8       NaN       NaN     12.5     26.1         No      0.0            No
142192 2017-06-24    Uluru      7.8     27.0       0.0          NaN       NaN          SE          28.0         SSE  ...        24.0       1019.4       1016.5         3.0         2.0     15.1     26.0         No      0.0            No

[5 rows x 24 columns]
			
				
					
				
					1
2
3
4
5
6
7
8
				
						           Date Location  MinTemp  MaxTemp  Rainfall  Evaporation  Sunshine  WindGustDir  WindGustSpeed WindDir9am  ...  Humidity3pm  Pressure9am  Pressure3pm  Cloud9am  Cloud3pm  Temp9am  Temp3pm  RainToday  RISK_MM  RainTomorrow
142188 2017-06-20    Uluru      3.5     21.8       0.0          NaN       NaN           E          31.0        ESE  ...        27.0       1024.7       1021.2       NaN       NaN      9.4     20.9         No      0.0            No
142189 2017-06-21    Uluru      2.8     23.4       0.0          NaN       NaN           E          31.0          SE  ...        24.0       1024.6       1020.3       NaN       NaN     10.1     22.4         No      0.0            No
142190 2017-06-22    Uluru      3.6     25.3       0.0          NaN       NaN         NNW          22.0          SE  ...        21.0       1023.5       1019.1       NaN       NaN     10.9     24.5         No      0.0            No
142191 2017-06-23    Uluru      5.4     26.9       0.0          NaN       NaN           N          37.0          SE  ...        24.0       1021.0       1016.8       NaN       NaN     12.5     26.1         No      0.0            No
142192 2017-06-24    Uluru      7.8     27.0       0.0          NaN       NaN          SE          28.0         SSE  ...        24.0       1019.4       1016.5         3.0         2.0     15.1     26.0         No      0.0            No
 
[5 rows x 24 columns]
					
				
			
		



该数据集包含各种特征，如温度、降雨量、湿度、风速等，这些特征对预测明天是否会下雨 (RainTomorrow) 至关重要。

处理缺失数据

现实世界中的数据集通常包含缺失或不完整的数据。处理这些缺口对于确保模型的可靠性至关重要。我们将从两个类别处理缺失数据：数值型 和 类别型。

A. 数值数据

对于数值型特征，一种常见的策略是用均值、中位数或众数等统计量替换缺失值。在这里，我们将使用 均值 来填补缺失值。





		
		
			
			
Java
			
			import numpy as np
from sklearn.impute import SimpleImputer

# Identifying numerical columns
numerical_cols = list(np.where((X.dtypes == np.int64) | (X.dtypes == np.float64))[0])

# Initializing the imputer with mean strategy
imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean')

# Fitting the imputer on numerical columns
imp_mean.fit(X.iloc[:, numerical_cols])

# Transforming the data
X.iloc[:, numerical_cols] = imp_mean.transform(X.iloc[:, numerical_cols])
			
				
					
				
					1
2
3
4
5
6
7
8
9
10
11
12
13
14
				
						import numpy as np
from sklearn.impute import SimpleImputer
 
# Identifying numerical columns
numerical_cols = list(np.where((X.dtypes == np.int64) | (X.dtypes == np.float64))[0])
 
# Initializing the imputer with mean strategy
imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean')
 
# Fitting the imputer on numerical columns
imp_mean.fit(X.iloc[:, numerical_cols])
 
# Transforming the data
X.iloc[:, numerical_cols] = imp_mean.transform(X.iloc[:, numerical_cols])
					
				
			
		



B. 类别数据

对于类别型特征，最频繁的值（众数）是替换缺失数据的合适方法。





		
		
			
			
Java
			
			# Identifying categorical columns
string_cols = list(np.where((X.dtypes == np.object))[0])

# Initializing the imputer with most frequent strategy
imp_mean = SimpleImputer(missing_values=np.nan, strategy='most_frequent')

# Fitting the imputer on categorical columns
imp_mean.fit(X.iloc[:, string_cols])

# Transforming the data
X.iloc[:, string_cols] = imp_mean.transform(X.iloc[:, string_cols])
			
				
					
				
					1
2
3
4
5
6
7
8
9
10
11
				
						# Identifying categorical columns
string_cols = list(np.where((X.dtypes == np.object))[0])
 
# Initializing the imputer with most frequent strategy
imp_mean = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
 
# Fitting the imputer on categorical columns
imp_mean.fit(X.iloc[:, string_cols])
 
# Transforming the data
X.iloc[:, string_cols] = imp_mean.transform(X.iloc[:, string_cols])
					
				
			
		



编码类别变量

机器学习模型需要数值输入。因此，将类别变量转换为数值格式至关重要。我们可以使用 标签编码 和 独热编码 来实现这一点。

A. 标签编码

标签编码 为特征中的每个唯一类别分配一个独特的整数。它简单，但可能在不存在的情况下引入序数关系。





		
		
			
			
Java
			
			from sklearn import preprocessing

def LabelEncoderMethod(series):
    le = preprocessing.LabelEncoder()
    return le.fit_transform(series) 

# Encoding the target variable
y = LabelEncoderMethod(y)
			
				
					
				
					1
2
3
4
5
6
7
8
				
						from sklearn import preprocessing
 
def LabelEncoderMethod(series):
    le = preprocessing.LabelEncoder()
    return le.fit_transform(series) 
 
# Encoding the target variable
y = LabelEncoderMethod(y)
					
				
			
		



B. 独热编码

独热编码 为每个类别创建二进制列，消除了序数关系，并确保每个类别被单独对待。





		
		
			
			
Java
			
			from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

def OneHotEncoderMethod(indices, data):
    columnTransformer = ColumnTransformer([('encoder', OneHotEncoder(), indices)], remainder='passthrough')
    return columnTransformer.fit_transform(data)
			
				
					
				
					1
2
3
4
5
6
				
						from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
 
def OneHotEncoderMethod(indices, data):
    columnTransformer = ColumnTransformer([('encoder', OneHotEncoder(), indices)], remainder='passthrough')
    return columnTransformer.fit_transform(data)
					
				
			
		



特征编码选择

根据唯一类别的数量，有效地选择标签编码或独热编码。





		
		
			
			
Java
			
			def EncodingSelection(X, threshold=10):
    # Step 1: Select the string columns
    string_cols = list(np.where((X.dtypes == np.object))[0])
    one_hot_encoding_indices = []
    
    # Step 2: Apply Label Encoding or mark for One-Hot Encoding based on category count
    for col in string_cols:
        length = len(pd.unique(X[X.columns[col]]))
        if length == 2 or length > threshold:
            X[X.columns[col]] = LabelEncoderMethod(X[X.columns[col]])
        else:
            one_hot_encoding_indices.append(col)
            
    # Step 3: Apply One-Hot Encoding where necessary
    X = OneHotEncoderMethod(one_hot_encoding_indices, X)
    return X

# Applying encoding selection
X = EncodingSelection(X)
			
				
					
				
					1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
				
						def EncodingSelection(X, threshold=10):
    # Step 1: Select the string columns
    string_cols = list(np.where((X.dtypes == np.object))[0])
    one_hot_encoding_indices = []
    
    # Step 2: Apply Label Encoding or mark for One-Hot Encoding based on category count
    for col in string_cols:
        length = len(pd.unique(X[X.columns[col]]))
        if length == 2 or length > threshold:
            X[X.columns[col]] = LabelEncoderMethod(X[X.columns[col]])
        else:
            one_hot_encoding_indices.append(col)
            
    # Step 3: Apply One-Hot Encoding where necessary
    X = OneHotEncoderMethod(one_hot_encoding_indices, X)
    return X
 
# Applying encoding selection
X = EncodingSelection(X)
					
				
			
		



输出：





		
		
			
			
Java
			
			(142193, 23)
			
				
					
				
					1
				
						(142193, 23)
					
				
			
		



此步骤通过仅选择最相关的编码特征来减少特征空间。

特征选择

并非所有特征对预测任务的贡献相同。特征选择 有助于识别并保留最具信息量的特征，提升模型性能并减少计算开销。





		
		
			
			
Java
			
			from sklearn.feature_selection import SelectKBest, chi2
from sklearn import preprocessing

# Initializing SelectKBest with chi-squared statistic
kbest = SelectKBest(score_func=chi2, k=10)

# Scaling features using MinMaxScaler before feature selection
MMS = preprocessing.MinMaxScaler()
x_temp = MMS.fit_transform(X)

# Fitting SelectKBest
x_temp = kbest.fit(x_temp, y)

# Selecting top features based on scores
best_features = np.argsort(x_temp.scores_)[-13:]
features_to_delete = np.argsort(x_temp.scores_)[:-13]

# Dropping the least important features
X = np.delete(X, features_to_delete, axis=1)

# Verifying the new shape
print(X.shape)
			
				
					
				
					1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
				
						from sklearn.feature_selection import SelectKBest, chi2
from sklearn import preprocessing
 
# Initializing SelectKBest with chi-squared statistic
kbest = SelectKBest(score_func=chi2, k=10)
 
# Scaling features using MinMaxScaler before feature selection
MMS = preprocessing.MinMaxScaler()
x_temp = MMS.fit_transform(X)
 
# Fitting SelectKBest
x_temp = kbest.fit(x_temp, y)
 
# Selecting top features based on scores
best_features = np.argsort(x_temp.scores_)[-13:]
features_to_delete = np.argsort(x_temp.scores_)[:-13]
 
# Dropping the least important features
X = np.delete(X, features_to_delete, axis=1)
 
# Verifying the new shape
print(X.shape)
					
				
			
		



输出：





		
		
			
			
Java
			
			(142193, 13)
			
				
					
				
					1
				
						(142193, 13)
					
				
			
		



此过程将特征集从23个减少到13个，专注于我们分类任务中最具影响力的特征。

训练-测试划分

为了评估我们的分类模型的性能，我们需要将数据集划分为 训练 和 测试 子集。





		
		
			
			
Java
			
			from sklearn.model_selection import train_test_split

# Splitting the data: 80% training and 20% testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=1)

# Displaying the shape of training data
print(X_train.shape)
			
				
					
				
					1
2
3
4
5
6
7
				
						from sklearn.model_selection import train_test_split
 
# Splitting the data: 80% training and 20% testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=1)
 
# Displaying the shape of training data
print(X_train.shape)
					
				
			
		



输出：





		
		
			
			
Java
			
			(113754, 13)
			
				
					
				
					1
				
						(113754, 13)
					
				
			
		



特征缩放

特征缩放确保所有特征对结果的贡献相等，这对于对特征量级敏感的算法尤为重要，如 支持向量机 或 K-最近邻。

标准化

标准化将数据重新缩放，使其均值为零，标准差为一。





		
		
			
			
Java
			
			from sklearn import preprocessing

# Initializing the StandardScaler
sc = preprocessing.StandardScaler(with_mean=False)

# Fitting the scaler on training data
sc.fit(X_train)

# Transforming both training and testing data
X_train = sc.transform(X_train)
X_test = sc.transform(X_test)

# Verifying the shape after scaling
print(X_train.shape)
print(X_test.shape)
			
				
					
				
					1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
				
						from sklearn import preprocessing
 
# Initializing the StandardScaler
sc = preprocessing.StandardScaler(with_mean=False)
 
# Fitting the scaler on training data
sc.fit(X_train)
 
# Transforming both training and testing data
X_train = sc.transform(X_train)
X_test = sc.transform(X_test)
 
# Verifying the shape after scaling
print(X_train.shape)
print(X_test.shape)
					
				
			
		



输出：





		
		
			
			
Java
			
			(113754, 13)
(28439, 13)
			
				
					
				
					1
2
				
						(113754, 13)
(28439, 13)
					
				
			
		



注意： 参数 with_mean=False 用于避免独热编码产生的稀疏数据矩阵问题。

结论

数据预处理是构建稳健且准确的分类模型的关键步骤。通过有条不紊地处理缺失数据、编码类别变量、选择相关特征和进行特征缩放，我们为任何机器学习模型奠定了坚实的基础。本指南提供了使用 Python 及其强大库的实操方法，确保您的分类问题为模型训练和评估做好充分准备。请记住，"垃圾进，垃圾出" 这一格言在机器学习中同样适用；因此，投入时间进行数据预处理将在模型性能上带来丰厚回报。



关键词： 分类问题, 数据预处理, 机器学习, 数据清洗, 特征选择, 标签编码, 独热编码, 特征缩放, Python, Pandas, Scikit-learn, 分类模型