html
精通分类模型:全面的数据科学Python模板
目录
- 分类模型简介
- 设置环境
- 数据导入与探索
- 处理缺失数据
- 编码分类变量
- 特征选择
- 训练测试集划分
- 特征缩放
- 构建和评估模型
- 结论
1. 分类模型简介
分类模型是监督机器学习的基石,能够基于输入特征预测离散标签。这些模型在各种应用中发挥着重要作用,从电子邮件垃圾检测到医疗诊断。掌握这些模型需要理解数据预处理、特征工程、模型选择和评估指标。
2. 设置环境
在深入构建模型之前,确保您的Python环境配备了必要的库。以下是设置环境的方法:
|
# Install necessary libraries !pip install pandas seaborn scikit-learn xgboost |
导入必要的库:
|
import pandas as pd import seaborn as sns import numpy as np from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler, LabelEncoder, OneHotEncoder from sklearn.impute import SimpleImputer from sklearn.compose import ColumnTransformer from sklearn.feature_selection import SelectKBest, chi2 from sklearn.metrics import accuracy_score from sklearn.neighbors import KNeighborsClassifier from sklearn.linear_model import LogisticRegression from sklearn.naive_bayes import GaussianNB from sklearn.svm import SVC from sklearn.tree import DecisionTreeClassifier from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier import xgboost as xgb |
3. 数据导入与探索
在本教程中,我们将使用Kaggle的澳大利亚天气数据集。这个全面的数据集提供了多样的与天气相关的特征,非常适合构建分类模型。
|
# Import data data = pd.read_csv('weatherAUS.csv') # Ensure the CSV file is in your working directory print(data.tail()) |
示例输出:
|
Date Location MinTemp MaxTemp Rainfall Evaporation Sunshine WindGustDir WindGustSpeed WindDir9am ... Humidity3pm Pressure9am Pressure3pm Cloud9am Cloud3pm Temp9am Temp3pm RainToday RISK_MM RainTomorrow 142188 2017-06-20 Uluru 3.5 21.8 0.0 NaN E 31.0 ESE ... 27.0 1024.7 1021.2 NaN NaN 9.4 20.9 No 0.0 No 142189 2017-06-21 Uluru 2.8 23.4 0.0 NaN E 31.0 SE ... 24.0 1024.6 1020.3 NaN NaN 10.1 22.4 No 0.0 No 142190 2017-06-22 Uluru 3.6 25.3 0.0 NaN NNW 22.0 SE ... 21.0 1023.5 1019.1 NaN NaN 10.9 24.5 No 0.0 No 142191 2017-06-23 Uluru 5.4 26.9 0.0 NaN N 37.0 SE ... 24.0 1021.0 1016.8 NaN NaN 12.5 26.1 No 0.0 No 142192 2017-06-24 Uluru 7.8 27.0 0.0 NaN SE 28.0 SSE ... 24.0 1019.4 1016.5 3.0 2.0 15.1 26.0 No 0.0 No |
4. 处理缺失数据
数据完整性对于构建可靠的模型至关重要。让我们处理数值和分类特征中的缺失值。
处理缺失数值数据
使用Scikit-learn中的SimpleImputer 将数值缺失值填充为每列的均值。
|
from sklearn.impute import SimpleImputer # Separate features and target X = data.iloc[:, :-1] # All columns except the last one y = data.iloc[:, -1] # Target column # Identify numeric columns numerical_cols = X.select_dtypes(include=['int64', 'float64']).columns # Impute missing numeric values with mean imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean') X[numerical_cols] = imp_mean.fit_transform(X[numerical_cols]) |
处理缺失分类数据
对于分类变量,使用最频繁(模态)值填充缺失值。
|
# Identify categorical columns categorical_cols = X.select_dtypes(include=['object']).columns # Impute missing categorical values with the most frequent value imp_freq = SimpleImputer(missing_values=np.nan, strategy='most_frequent') X[categorical_cols] = imp_freq.fit_transform(X[categorical_cols]) |
5. 编码分类变量
机器学习模型需要数值输入。因此,分类变量需要被编码。我们将对二元分类使用标签编码,对多类别使用独热编码。
标签编码
|
from sklearn.preprocessing import LabelEncoder le = LabelEncoder() y = le.fit_transform(y) # Encoding the target variable |
独热编码
根据唯一类别的数量实现一种编码处理方法。
|
from sklearn.compose import ColumnTransformer from sklearn.preprocessing import OneHotEncoder def one_hot_encode(columns, data): ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), columns)], remainder='passthrough') return ct.fit_transform(data) # Example usage: # X = one_hot_encode(['WindGustDir', 'WindDir9am'], X) |
或者,根据唯一类别的阈值自动化编码过程。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
|
def encoding_selection(X, threshold=10): # Identify string columns string_cols = X.select_dtypes(include=['object']).columns one_hot_encoding_cols = [] for col in string_cols: unique_count = X[col].nunique() if unique_count == 2 or unique_count > threshold: X[col] = le.fit_transform(X[col]) else: one_hot_encoding_cols.append(col) if one_hot_encoding_cols: X = one_hot_encode(one_hot_encoding_cols, X) return X X = encoding_selection(X) |
6. 特征选择
减少特征数量可以提升模型性能并降低计算成本。我们将使用带有卡方检验的SelectKBest 来选择最佳特征。
|
from sklearn.feature_selection import SelectKBest, chi2 from sklearn.preprocessing import MinMaxScaler # Scale features scaler = MinMaxScaler() X_scaled = scaler.fit_transform(X) # Select top K features k = 10 # You can adjust this based on your requirement selector = SelectKBest(score_func=chi2, k=k) X_selected = selector.fit_transform(X_scaled, y) # Get selected feature indices selected_indices = selector.get_support(indices=True) selected_features = X.columns[selected_indices] print("Selected Features:", selected_features) |
7. 训练测试集划分
将数据集划分为训练集和测试集对于评估模型在未见数据上的性能至关重要。
|
from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X_selected, y, test_size=0.20, random_state=1) print("Training set shape:", X_train.shape) print("Test set shape:", X_test.shape) |
输出:
|
Training set shape: (113754, 10) Test set shape: (28439, 10) |
8. 特征缩放
标准化特征确保每个特征在KNN和SVM等算法的距离计算中具有相同的贡献。
|
from sklearn.preprocessing import StandardScaler scaler = StandardScaler() X_train = scaler.fit_transform(X_train) X_test = scaler.transform(X_test) print("Scaled Training set shape:", X_train.shape) print("Scaled Test set shape:", X_test.shape) |
输出:
|
Scaled Training set shape: (113754, 10) Scaled Test set shape: (28439, 10) |
9. 构建和评估模型
数据预处理完成后,我们现在可以构建和评估各种分类模型。我们将根据它们的准确率来评估模型。
K-最近邻 (KNN)
|
from sklearn.neighbors import KNeighborsClassifier from sklearn.metrics import accuracy_score knn = KNeighborsClassifier(n_neighbors=3) knn.fit(X_train, y_train) y_pred = knn.predict(X_test) print("KNN Accuracy:", accuracy_score(y_test, y_pred)) |
输出:
逻辑回归
|
from sklearn.linear_model import LogisticRegression log_reg = LogisticRegression(random_state=0, max_iter=200) log_reg.fit(X_train, y_train) y_pred = log_reg.predict(X_test) print("Logistic Regression Accuracy:", accuracy_score(y_test, y_pred)) |
输出:
|
Logistic Regression Accuracy: 0.99996 |
高斯朴素贝叶斯
|
from sklearn.naive_bayes import GaussianNB gnb = GaussianNB() gnb.fit(X_train, y_train) y_pred = gnb.predict(X_test) print("GaussianNB Accuracy:", accuracy_score(y_test, y_pred)) |
输出:
|
GaussianNB Accuracy: 0.97437 |
支持向量机 (SVM)
|
from sklearn.svm import SVC svm = SVC() svm.fit(X_train, y_train) y_pred = svm.predict(X_test) print("SVM Accuracy:", accuracy_score(y_test, y_pred)) |
输出:
决策树分类器
|
from sklearn.tree import DecisionTreeClassifier dtc = DecisionTreeClassifier() dtc.fit(X_train, y_train) y_pred = dtc.predict(X_test) print("Decision Tree Accuracy:", accuracy_score(y_test, y_pred)) |
输出:
|
Decision Tree Accuracy: 1.0 |
随机森林分类器
|
from sklearn.ensemble import RandomForestClassifier rfc = RandomForestClassifier(n_estimators=500, max_depth=5) rfc.fit(X_train, y_train) y_pred = rfc.predict(X_test) print("Random Forest Accuracy:", accuracy_score(y_test, y_pred)) |
输出:
|
Random Forest Accuracy: 1.0 |
AdaBoost 分类器
|
from sklearn.ensemble import AdaBoostClassifier abc = AdaBoostClassifier() abc.fit(X_train, y_train) y_pred = abc.predict(X_test) print("AdaBoost Accuracy:", accuracy_score(y_test, y_pred)) |
输出:
XGBoost 分类器
|
import xgboost as xgb xgb_model = xgb.XGBClassifier(use_label_encoder=False, eval_metric='logloss') xgb_model.fit(X_train, y_train) y_pred = xgb_model.predict(X_test) print("XGBoost Accuracy:", accuracy_score(y_test, y_pred)) |
输出:
注意:通过显式设置 eval_metric
参数,如上所示,可以抑制关于XGBoost中评估指标的警告。
10. 结论
构建分类模型不必令人望而生畏。通过系统化的方法进行数据预处理、编码、特征选择和模型评估,您可以高效地开发适合您特定需求的稳健模型。本文中展示的主模板作为全面的指南,从数据获取到模型评估,简化了整个工作流程。无论您是初学者还是经验丰富的数据科学家,利用这样的模板都可以提高生产力和模型性能。
关键要点:
- 数据预处理: 细致地清理和准备您的数据,以确保模型的准确性。
- 编码技术: 适当地编码分类变量,以适应不同的算法。
- 特征选择: 利用特征选择方法提升模型效率和性能。
- 模型多样性: 尝试各种模型以找出最适合您的数据集的最佳性能者。
- 评估指标: 不仅仅依赖准确率;考虑其他指标如精确率、召回率和F1分数,以进行全面评估。
拥抱这些实践,以清晰和精确增强您的数据科学项目!