S05L07 – 作业解决方案和独热编码 – 第01部分

html
全面数据预处理指南：Python中的独热编码和缺失数据处理

在数据科学和机器学习领域，数据预处理是一个关键步骤，可以显著影响您的模型的性能和准确性。本指南深入探讨了诸如独热编码、缺失数据处理、特征选择等基本预处理技术，使用Python的强大库如pandas和scikit-learn。我们将通过使用澳大利亚天气数据集的实际示例来介绍这些概念。

目录

    介绍
    了解数据集
    处理缺失数据
        
            数值数据
            分类数据
        
    
    特征选择
    标签编码
    独热编码
    处理数据不平衡
        
            过采样
            欠采样
        
    
    训练-测试分割
    特征缩放
        
            标准化
            归一化
        
    
    结论


介绍

数据预处理是构建稳健机器学习模型的基础。它涉及将原始数据转换为干净有序的格式，使其适合分析。此过程包括：


    处理缺失数据：解决数据集中的空缺。
    编码分类变量：将非数值数据转换为数值格式。
    特征选择：识别并保留最相关的特征。
    平衡数据集：确保类别的分布均衡。
    特征缩放：归一化数据以提升模型性能。


让我们使用Python逐步探索这些概念。

了解数据集

在深入预处理之前，了解我们正在使用的数据集至关重要。我们将使用澳大利亚天气数据集，其中包含142,193条记录和24列。该数据集包括各种气象属性，如温度、降雨量、湿度等，以及一个目标变量，指示第二天是否会下雨。

数据集样本





		
		
			
			
Java
			
			Date,Location,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustDir,WindGustSpeed,WindDir9am,WindDir3pm,WindSpeed9am,WindSpeed3pm,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm,RainToday,RISK_MM,RainTomorrow
2008-12-01,Albury,13.4,22.9,0.6,NA,NA,W,44,W,WNW,20,24,71,22,1007.7,1007.1,8,NA,16.9,21.8,No,0,No
... (additional rows)
			
				
					
				
					1
2
3
				
						Date,Location,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustDir,WindGustSpeed,WindDir9am,WindDir3pm,WindSpeed9am,WindSpeed3pm,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm,RainToday,RISK_MM,RainTomorrow
2008-12-01,Albury,13.4,22.9,0.6,NA,NA,W,44,W,WNW,20,24,71,22,1007.7,1007.1,8,NA,16.9,21.8,No,0,No
... (additional rows)
					
				
			
		



处理缺失数据

现实世界的数据集通常包含缺失值。正确处理这些空缺对于防止结果偏差和确保模型准确性至关重要。

数值数据

我们数据集中的数值列包括MinTemp、MaxTemp、Rainfall、Evaporation等。这些列中的缺失值可以通过使用平均值、中位数或众数等统计量进行填补。





		
		
			
			
Java
			
			import numpy as np
from sklearn.impute import SimpleImputer

# Initialize imputer with mean strategy
imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean')

# List of numerical column indices
numerical_cols = [2,3,4,5,6,8,11,12,13,14,15,16,17,18,19,20]

# Fit and transform the data
X.iloc[:, numerical_cols] = imp_mean.fit_transform(X.iloc[:, numerical_cols])
			
				
					
				
					1
2
3
4
5
6
7
8
9
10
11
				
						import numpy as np
from sklearn.impute import SimpleImputer
 
# Initialize imputer with mean strategy
imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean')
 
# List of numerical column indices
numerical_cols = [2,3,4,5,6,8,11,12,13,14,15,16,17,18,19,20]
 
# Fit and transform the data
X.iloc[:, numerical_cols] = imp_mean.fit_transform(X.iloc[:, numerical_cols])
					
				
			
		



分类数据

像Location、WindGustDir、WindDir9am等分类列的缺失值不能用平均值或中位数填补。相反，我们使用最频繁的值（众数）来填补这些空缺。





		
		
			
			
Java
			
			from sklearn.impute import SimpleImputer

# List of categorical column indices
string_cols = [1,7,9,10,21]

# Initialize imputer with most frequent strategy
imp_freq = SimpleImputer(missing_values=np.nan, strategy='most_frequent')

# Fit and transform the data
X.iloc[:, string_cols] = imp_freq.fit_transform(X.iloc[:, string_cols])
			
				
					
				
					1
2
3
4
5
6
7
8
9
10
				
						from sklearn.impute import SimpleImputer
 
# List of categorical column indices
string_cols = [1,7,9,10,21]
 
# Initialize imputer with most frequent strategy
imp_freq = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
 
# Fit and transform the data
X.iloc[:, string_cols] = imp_freq.fit_transform(X.iloc[:, string_cols])
					
				
			
		



特征选择

特征选择涉及识别对预测任务最相关的变量。在我们的案例中，我们将舍弃无关或冗余的列，如Date和RISK_MM。





		
		
			
			
Java
			
			# Dropping 'RISK_MM' and 'Date' columns
X.drop(['RISK_MM', 'Date'], axis=1, inplace=True)
			
				
					
				
					1
2
				
						# Dropping 'RISK_MM' and 'Date' columns
X.drop(['RISK_MM', 'Date'], axis=1, inplace=True)
					
				
			
		



标签编码

标签编码将分类目标变量转换为数值格式。对于像预测明天下雨（Yes或No）这样的二分类任务，这种方法非常直接。





		
		
			
			
Java
			
			from sklearn import preprocessing

# Initialize LabelEncoder
le = preprocessing.LabelEncoder()

# Fit and transform the target variable
Y = le.fit_transform(Y)
			
				
					
				
					1
2
3
4
5
6
7
				
						from sklearn import preprocessing
 
# Initialize LabelEncoder
le = preprocessing.LabelEncoder()
 
# Fit and transform the target variable
Y = le.fit_transform(Y)
					
				
			
		



独热编码

虽然标签编码适用于有序数据，但独热编码更适用于类别没有固有顺序的名义数据。这种技术为每个类别创建二进制列，增强了模型解释分类变量的能力。





		
		
			
			
Java
			
			from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

# Initialize ColumnTransformer with OneHotEncoder for specified columns
columnTransformer = ColumnTransformer(
    [('encoder', OneHotEncoder(), [0,6,8,9,20])],
    remainder='passthrough'
)

# Fit and transform the feature matrix
X = columnTransformer.fit_transform(X)
			
				
					
				
					1
2
3
4
5
6
7
8
9
10
11
				
						from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
 
# Initialize ColumnTransformer with OneHotEncoder for specified columns
columnTransformer = ColumnTransformer(
    [('encoder', OneHotEncoder(), [0,6,8,9,20])],
    remainder='passthrough'
)
 
# Fit and transform the feature matrix
X = columnTransformer.fit_transform(X)
					
				
			
		



注意：列 [0,6,8,9,20] 对应于分类特征，如Location、WindGustDir等。

处理数据不平衡

数据不平衡的数据集，其中一个类别的数量显著超过另一个类别，可能会对模型造成偏差。像过采样和欠采样这样的技术有助于平衡数据集。

过采样

随机过采样通过复制少数类的实例来平衡类别分布。





		
		
			
			
Java
			
			from imblearn.over_sampling import RandomOverSampler
from collections import Counter

# Initialize RandomOverSampler
ros = RandomOverSampler(random_state=42)

# Resample the dataset
X, Y = ros.fit_resample(X, Y)

# Verify the new class distribution
print(Counter(Y))
			
				
					
				
					1
2
3
4
5
6
7
8
9
10
11
				
						from imblearn.over_sampling import RandomOverSampler
from collections import Counter
 
# Initialize RandomOverSampler
ros = RandomOverSampler(random_state=42)
 
# Resample the dataset
X, Y = ros.fit_resample(X, Y)
 
# Verify the new class distribution
print(Counter(Y))
					
				
			
		



输出：




		
		
			
			
Java
			
			Counter({0: 110316, 1: 110316})
			
				
					
				
					1
				
						Counter({0: 110316, 1: 110316})
					
				
			
		



欠采样

随机欠采样减少多数类的实例，但可能导致信息丢失。在本指南中，我们采用了过采样以保留所有数据点。

训练-测试分割

将数据集分割为训练集和测试集对于评估模型在未见数据上的性能至关重要。





		
		
			
			
Java
			
			from sklearn.model_selection import train_test_split

# Split the data: 80% training, 20% testing
X_train, X_test, y_train, y_test = train_test_split(
    X, Y, test_size=0.20, random_state=1
)

print(X_train.shape)  # Output: (176505, 115)
print(X_test.shape)   # Output: (44127, 115)
			
				
					
				
					1
2
3
4
5
6
7
8
9
				
						from sklearn.model_selection import train_test_split
 
# Split the data: 80% training, 20% testing
X_train, X_test, y_train, y_test = train_test_split(
    X, Y, test_size=0.20, random_state=1
)
 
print(X_train.shape)  # Output: (176505, 115)
print(X_test.shape)   # Output: (44127, 115)
					
				
			
		



特征缩放

特征缩放确保所有数值特征对模型的性能贡献均等。

标准化

标准化将数据转换为具有均值为0和标准差为1。





		
		
			
			
Java
			
			from sklearn import preprocessing

# Initialize StandardScaler
sc = preprocessing.StandardScaler()

# Fit on training data
sc.fit(X_train)

# Transform both training and testing data
X_train = sc.transform(X_train)
X_test = sc.transform(X_test)

print(X_train.shape)  # Output: (176505, 115)
print(X_test.shape)   # Output: (44127, 115)
			
				
					
				
					1
2
3
4
5
6
7
8
9
10
11
12
13
14
				
						from sklearn import preprocessing
 
# Initialize StandardScaler
sc = preprocessing.StandardScaler()
 
# Fit on training data
sc.fit(X_train)
 
# Transform both training and testing data
X_train = sc.transform(X_train)
X_test = sc.transform(X_test)
 
print(X_train.shape)  # Output: (176505, 115)
print(X_test.shape)   # Output: (44127, 115)
					
				
			
		



归一化

归一化将特征缩放到0到1之间的范围。虽然本指南未涵盖，但根据数据集和模型需求，这是另一种有价值的缩放技术。





		
		
			
			
Java
			
			from sklearn.preprocessing import MinMaxScaler

# Initialize MinMaxScaler
mm_scaler = MinMaxScaler()

# Fit and transform the data
X_train_mm = mm_scaler.fit_transform(X_train)
X_test_mm = mm_scaler.transform(X_test)
			
				
					
				
					1
2
3
4
5
6
7
8
				
						from sklearn.preprocessing import MinMaxScaler
 
# Initialize MinMaxScaler
mm_scaler = MinMaxScaler()
 
# Fit and transform the data
X_train_mm = mm_scaler.fit_transform(X_train)
X_test_mm = mm_scaler.transform(X_test)
					
				
			
		



结论

有效的数据预处理对于构建高性能的机器学习模型至关重要。通过细致地处理缺失数据、编码分类变量、平衡数据集和缩放特征，您为预测任务奠定了坚实的基础。本指南提供了使用Python强大库的实用方法，展示了如何将这些技术无缝集成到您的数据科学工作流程中。

请记住，数据质量直接影响模型的成功。投入时间进行预处理，以充分挖掘数据集的潜力。



关键词

    数据预处理
    独热编码
    处理缺失数据
    Python pandas
    scikit-learn
    机器学习
    特征缩放
    数据不平衡
    标签编码
    分类变量