html
Python中多类分类的逻辑回归实现:全面指南
在不断发展的机器学习领域,多类分类是一个关键任务,使得能够在数据集中区分多个类别。在众多可用的算法中,Logistic Regression(逻辑回归)作为一种稳健且易于解释的选择脱颖而出,适用于解决此类问题。在本指南中,我们深入探讨了如何使用Python实现多类分类的逻辑回归,利用Scikit-learn等工具和来自Kaggle的孟加拉音乐数据集。
目录
多类分类简介
多类分类是一种分类任务,其中每个实例被归类到三类或更多类别中的一个。与处理两类问题的二元分类不同,多类分类提出了独特的挑战,需要能够有效区分多个类别的算法。
Logistic Regression(逻辑回归)传统上用于二元分类,但可以通过“一对多”(OvR)或多项式方法等策略扩展以处理多类场景。其简单性、可解释性和效率使其成为各种分类任务的流行选择。
理解数据集
在本指南中,我们使用了Bangla Music Dataset(孟加拉音乐数据集),其中包含从孟加拉歌曲中提取的特征。主要目标是根据这些特征将歌曲分类到不同的流派。数据集包括各种音频特征,如频谱质心、频谱带宽、色度频率和梅尔频率倒谱系数(MFCCs)。
数据集来源: Kaggle - Bangla Music Dataset
样本数据概览
12345
import pandas as pd # Load the datasetdata = pd.read_csv('bangla.csv')print(data.tail())
123456789101112131415161718192021222324252627282930313233343536
file_name zero_crossing \1737 Tumi Robe Nirobe, Artist - DWIJEN MUKHOPADHYA... 78516 1738 TUMI SANDHYAR MEGHMALA Srikanta Acharya Rabi... 176887 1739 Utal Haowa Laglo Amar Gaaner Taranite Sagar S... 133326 1740 venge mor ghorer chabi by anima roy.. album ro... 179932 1741 vora thak vora thak by anima roy ( 160kbps ).mp3 175244 spectral_centroid spectral_rolloff spectral_bandwidth \1737 800.797115 1436.990088 1090.389766 1738 1734.844686 3464.133429 1954.831684 1739 1380.139172 2745.410904 1775.717428 1740 1961.435018 4141.554401 2324.507425 1741 1878.657768 3877.461439 2228.147952 chroma_frequency rmse delta melspectogram tempo \1737 0.227325 0.108344 2.078194e-08 3.020211 117.453835 1738 0.271189 0.124934 5.785562e-08 4.098559 129.199219 1739 0.263462 0.111411 4.204189e-08 3.147722 143.554688 1740 0.261823 0.168673 3.245319e-07 7.674615 143.554688 1741 0.232985 0.311113 1.531590e-07 26.447679 129.199219 ... mfcc11 mfcc12 mfcc13 mfcc14 mfcc15 mfcc16 \1737 ... -2.615630 2.119485 -12.506942 -1.148996 0.090582 -8.694072 1738 ... 1.693247 -4.076407 -2.017894 -7.419591 -0.488603 -8.690254 1739 ... 2.487961 -3.434017 -6.099467 -6.008315 -7.483330 -2.908477 1740 ... 1.192605 -13.142963 0.281834 -5.981567 -1.066383 0.677886 1741 ... -5.636770 -12.078487 1.692546 -6.005674 1.502304 -0.415201 mfcc17 mfcc18 mfcc19 label 1737 -6.597594 2.925687 -6.154576 rabindra 1738 -7.090489 -6.530357 -5.593533 rabindra 1739 0.783345 -3.394053 -3.157621 rabindra 1740 0.803132 -3.304548 4.309490 rabindra 1741 2.389623 -3.135799 0.225479 rabindra [5 rows x 31 columns]
数据预处理
有效的数据预处理对于构建可靠的机器学习模型至关重要。本节概述了为建模准备数据所采取的步骤。
处理缺失数据
缺失数据会对机器学习模型的性能产生不利影响。识别并适当处理缺失值至关重要。
数值数据
对于数值特征,使用均值策略填补缺失值。
123456789101112
import numpy as npfrom sklearn.impute import SimpleImputer # Identify numerical columnsnumerical_cols = list(np.where((X.dtypes == np.int64) | (X.dtypes == np.float64))[0]) # Initialize SimpleImputer for mean strategyimp_mean = SimpleImputer(missing_values=np.nan, strategy='mean') # Fit and transform the dataimp_mean.fit(X.iloc[:, numerical_cols])X.iloc[:, numerical_cols] = imp_mean.transform(X.iloc[:, numerical_cols])
分类数据
对于分类特征,使用最频繁策略填补缺失值。
123456789
# Identify string columnsstring_cols = list(np.where((X.dtypes == object))[0]) # Initialize SimpleImputer for most frequent strategyimp_freq = SimpleImputer(missing_values=np.nan, strategy='most_frequent') # Fit and transform the dataimp_freq.fit(X.iloc[:, string_cols])X.iloc[:, string_cols] = imp_freq.transform(X.iloc[:, string_cols])
编码分类变量
机器学习算法需要数值输入。因此,需要适当地编码分类变量。
独热编码
对于具有大量唯一类别的分类特征,使用独热编码以防止引入序数关系。
123456789
from sklearn.compose import ColumnTransformerfrom sklearn.preprocessing import OneHotEncoder def OneHotEncoderMethod(indices, data): columnTransformer = ColumnTransformer( [('encoder', OneHotEncoder(), indices)], remainder='passthrough' ) return columnTransformer.fit_transform(data)
标签编码
对于二元分类特征或具有可管理类别数的特征,使用标签编码。
123456
from sklearn import preprocessing def LabelEncoderMethod(series): le = preprocessing.LabelEncoder() le.fit(series) return le.transform(series)
对X的编码选择
根据每个特征中的唯一类别数量,应用编码策略的组合。
1234567891011121314151617181920
def EncodingSelection(X, threshold=10): # Step 01: Select the string columns string_cols = list(np.where((X.dtypes == object))[0]) one_hot_encoding_indices = [] # Step 02: Label encode columns with 2 or more than 'threshold' categories for col in string_cols: length = len(pd.unique(X[X.columns[col]])) if length == 2 or length > threshold: X[X.columns[col]] = LabelEncoderMethod(X[X.columns[col]]) else: one_hot_encoding_indices.append(col) # Step 03: One-hot encode the remaining columns X = OneHotEncoderMethod(one_hot_encoding_indices, X) return X # Apply encoding selectionX = EncodingSelection(X)print(f"Encoded feature shape: {X.shape}")
输出:
1
Encoded feature shape: (1742, 30)
特征选择
选择最相关的特征可以提高模型性能并减少计算复杂性。
12345678910111213141516171819202122232425
from sklearn.feature_selection import SelectKBest, chi2from sklearn import preprocessing # Initialize Min-Max ScalerMMS = preprocessing.MinMaxScaler() # Define number of best features to selectK_features = 12 # Scale the featuresx_temp = MMS.fit_transform(X) # Apply SelectKBest with chi-squared scoringkbest = SelectKBest(score_func=chi2, k=10)x_temp = kbest.fit(x_temp, y) # Identify top featuresbest_features = np.argsort(x_temp.scores_)[-K_features:] # Determine features to deletefeatures_to_delete = np.argsort(x_temp.scores_)[:-K_features] # Reduce X to selected featuresX = np.delete(X, features_to_delete, axis=1)print(f"Reduced feature shape: {X.shape}")
输出:
1
Reduced feature shape: (1742, 12)
模型训练与评估
在数据预处理和特征选择之后,我们继续训练和评估我们的模型。
K-近邻(KNN)分类器
KNN是一种简单的基于实例的学习算法,可以作为分类任务的基线。
123456789101112131415
from sklearn.neighbors import KNeighborsClassifierfrom sklearn.metrics import accuracy_score # Initialize KNN with 8 neighborsknnClassifier = KNeighborsClassifier(n_neighbors=8) # Train the modelknnClassifier.fit(X_train, y_train) # Make predictionsy_pred_knn = knnClassifier.predict(X_test) # Evaluate accuracyknn_accuracy = accuracy_score(y_pred_knn, y_test)print(f"KNN Accuracy: {knn_accuracy:.2f}")
输出:
1
KNN Accuracy: 0.68
逻辑回归模型
在此,逻辑回归通过多项式方法扩展以处理多类分类。
1234567891011121314
from sklearn.linear_model import LogisticRegression # Initialize Logistic Regression with increased iterationsLRM = LogisticRegression(random_state=0, max_iter=1000, multi_class='multinomial', solver='lbfgs') # Train the modelLRM.fit(X_train, y_train) # Make predictionsy_pred_lr = LRM.predict(X_test) # Evaluate accuracylr_accuracy = accuracy_score(y_pred_lr, y_test)print(f"Logistic Regression Accuracy: {lr_accuracy:.2f}")
输出:
1
Logistic Regression Accuracy: 0.65
比较分析
在评估了两种模型之后,K-近邻分类器在这种特定情况下的表现优于逻辑回归。
- KNN 准确率:67.9%
- 逻辑回归准确率:65.0%
然而,需要注意以下观察结果:
- 迭代限制警告:最初,逻辑回归遇到了收敛问题,通过将
max_iter
参数从300增加到1000得以解决。
- 模型性能:虽然KNN显示出更高的准确率,但逻辑回归具有更好的可解释性,并且在处理更大规模的数据集时可能更具可扩展性。
未来的改进:
- 超参数调整:调整逻辑回归中的参数,如
C
、penalty
等,可以提高性能。
- 交叉验证:实施交叉验证技术可以提供更稳健的模型性能评估。
- 特征工程:创建或选择更具信息性的特征可以提升分类准确率。
结论
本全面指南展示了在Python中实现多类分类逻辑回归的过程,重点介绍了从数据预处理到模型评估的整个过程。虽然在本例中KNN显示出更好的准确率,但逻辑回归仍然是一个强大的工具,特别是在可解释性是优先考虑因素时。通过遵循结构化的预处理、特征选择和深思熟虑的模型训练,可以有效地解决各个领域的多类分类问题。
完整的Python实现
以下是涵盖所有讨论步骤的完整Python代码:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101
# Import necessary librariesimport pandas as pdimport numpy as npimport seaborn as snsfrom sklearn.impute import SimpleImputerfrom sklearn.compose import ColumnTransformerfrom sklearn.preprocessing import OneHotEncoder, LabelEncoder, StandardScaler, MinMaxScalerfrom sklearn.feature_selection import SelectKBest, chi2from sklearn.model_selection import train_test_splitfrom sklearn.neighbors import KNeighborsClassifierfrom sklearn.linear_model import LogisticRegressionfrom sklearn.metrics import accuracy_score # Load the datasetdata = pd.read_csv('bangla.csv') # Separate features and targetX = data.iloc[:, :-1]y = data.iloc[:, -1] # Handling missing data - Numeric typenumerical_cols = list(np.where((X.dtypes == np.int64) | (X.dtypes == np.float64))[0])imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean')imp_mean.fit(X.iloc[:, numerical_cols])X.iloc[:, numerical_cols] = imp_mean.transform(X.iloc[:, numerical_cols]) # Handling missing string datastring_cols = list(np.where((X.dtypes == object))[0])imp_freq = SimpleImputer(missing_values=np.nan, strategy='most_frequent')imp_freq.fit(X.iloc[:, string_cols])X.iloc[:, string_cols] = imp_freq.transform(X.iloc[:, string_cols]) # Encoding methodsdef OneHotEncoderMethod(indices, data): columnTransformer = ColumnTransformer( [('encoder', OneHotEncoder(), indices)], remainder='passthrough' ) return columnTransformer.fit_transform(data) def LabelEncoderMethod(series): le = LabelEncoder() le.fit(series) return le.transform(series) def EncodingSelection(X, threshold=10): string_cols = list(np.where((X.dtypes == object))[0]) one_hot_encoding_indices = [] for col in string_cols: length = len(pd.unique(X[X.columns[col]])) if length == 2 or length > threshold: X[X.columns[col]] = LabelEncoderMethod(X[X.columns[col]]) else: one_hot_encoding_indices.append(col) X = OneHotEncoderMethod(one_hot_encoding_indices, X) return X # Apply encoding selectionX = EncodingSelection(X)print(f"Encoded feature shape: {X.shape}") # Feature selectionMMS = MinMaxScaler()K_features = 12x_temp = MMS.fit_transform(X)kbest = SelectKBest(score_func=chi2, k=10)x_temp = kbest.fit(x_temp, y)best_features = np.argsort(x_temp.scores_)[-K_features:]features_to_delete = np.argsort(x_temp.scores_)[:-K_features]X = np.delete(X, features_to_delete, axis=1)print(f"Reduced feature shape: {X.shape}") # Train-test splitX_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.20, random_state=1)print(f"Training set shape: {X_train.shape}") # Feature scalingsc = StandardScaler(with_mean=False)sc.fit(X_train)X_train = sc.transform(X_train)X_test = sc.transform(X_test)print(f"Scaled Training set shape: {X_train.shape}")print(f"Scaled Test set shape: {X_test.shape}") # Building KNN modelknnClassifier = KNeighborsClassifier(n_neighbors=8)knnClassifier.fit(X_train, y_train)y_pred_knn = knnClassifier.predict(X_test)knn_accuracy = accuracy_score(y_pred_knn, y_test)print(f"KNN Accuracy: {knn_accuracy:.2f}") # Building Logistic Regression modelLRM = LogisticRegression(random_state=0, max_iter=1000, multi_class='multinomial', solver='lbfgs')LRM.fit(X_train, y_train)y_pred_lr = LRM.predict(X_test)lr_accuracy = accuracy_score(y_pred_lr, y_test)print(f"Logistic Regression Accuracy: {lr_accuracy:.2f}")
注意:在执行代码之前,请确保数据集bangla.csv
已正确放置在您的工作目录中。
关键词
- Logistic Regression
- 多类分类
- Python教程
- 机器学习
- 数据预处理
- 特征选择
- K-近邻(KNN)
- Scikit-learn
- 数据科学
- Python机器学习