Implementing Logistic Regression for Multiclass Classification in Python: A Comprehensive Guide
In the ever-evolving field of machine learning, multiclass classification stands as a pivotal task, enabling the differentiation between multiple categories within a dataset. Among the myriad of algorithms available, Logistic Regression emerges as a robust and interpretable choice for tackling such problems. In this guide, we delve deep into implementing logistic regression for multiclass classification using Python, leveraging tools like Scikit-learn and a Bangla music dataset sourced from Kaggle.
Table of Contents
- Introduction to Multiclass Classification
- Understanding the Dataset
- Data Preprocessing
- Feature Selection
- Model Training and Evaluation
- Comparative Analysis
- Conclusion
- Full Python Implementation
Introduction to Multiclass Classification
Multiclass classification is a type of classification task where each instance is categorized into one of three or more classes. Unlike binary classification, which deals with two classes, multiclass classification presents unique challenges and requires algorithms that can effectively distinguish between multiple categories.
Logistic Regression is traditionally known for binary classification but can be extended to handle multiclass scenarios using strategies like One-vs-Rest (OvR) or multinomial approaches. Its simplicity, interpretability, and efficiency make it a popular choice for various classification tasks.
Understanding the Dataset
For this guide, we utilize the Bangla Music Dataset, which contains features extracted from Bangla songs. The primary objective is to classify songs into genres based on these features. The dataset includes various audio features such as spectral centroid, spectral bandwidth, chroma frequency, and Mel-frequency cepstral coefficients (MFCCs).
Dataset Source: Kaggle – Bangla Music Dataset
Sample Data Overview
1 2 3 4 5 |
import pandas as pd # Load the dataset data = pd.read_csv('bangla.csv') print(data.tail()) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 |
file_name zero_crossing \ 1737 Tumi Robe Nirobe, Artist - DWIJEN MUKHOPADHYA... 78516 1738 TUMI SANDHYAR MEGHMALA Srikanta Acharya Rabi... 176887 1739 Utal Haowa Laglo Amar Gaaner Taranite Sagar S... 133326 1740 venge mor ghorer chabi by anima roy.. album ro... 179932 1741 vora thak vora thak by anima roy ( 160kbps ).mp3 175244 spectral_centroid spectral_rolloff spectral_bandwidth \ 1737 800.797115 1436.990088 1090.389766 1738 1734.844686 3464.133429 1954.831684 1739 1380.139172 2745.410904 1775.717428 1740 1961.435018 4141.554401 2324.507425 1741 1878.657768 3877.461439 2228.147952 chroma_frequency rmse delta melspectogram tempo \ 1737 0.227325 0.108344 2.078194e-08 3.020211 117.453835 1738 0.271189 0.124934 5.785562e-08 4.098559 129.199219 1739 0.263462 0.111411 4.204189e-08 3.147722 143.554688 1740 0.261823 0.168673 3.245319e-07 7.674615 143.554688 1741 0.232985 0.311113 1.531590e-07 26.447679 129.199219 ... mfcc11 mfcc12 mfcc13 mfcc14 mfcc15 mfcc16 \ 1737 ... -2.615630 2.119485 -12.506942 -1.148996 0.090582 -8.694072 1738 ... 1.693247 -4.076407 -2.017894 -7.419591 -0.488603 -8.690254 1739 ... 2.487961 -3.434017 -6.099467 -6.008315 -7.483330 -2.908477 1740 ... 1.192605 -13.142963 0.281834 -5.981567 -1.066383 0.677886 1741 ... -5.636770 -12.078487 1.692546 -6.005674 1.502304 -0.415201 mfcc17 mfcc18 mfcc19 label 1737 -6.597594 2.925687 -6.154576 rabindra 1738 -7.090489 -6.530357 -5.593533 rabindra 1739 0.783345 -3.394053 -3.157621 rabindra 1740 0.803132 -3.304548 4.309490 rabindra 1741 2.389623 -3.135799 0.225479 rabindra [5 rows x 31 columns] |
Data Preprocessing
Effective data preprocessing is paramount to building a reliable machine learning model. This section outlines the steps undertaken to prepare the data for modeling.
Handling Missing Data
Missing data can adversely affect the performance of machine learning models. It’s crucial to identify and appropriately handle missing values.
Numeric Data
For numerical features, missing values are imputed using the mean strategy.
1 2 3 4 5 6 7 8 9 10 11 12 |
import numpy as np from sklearn.impute import SimpleImputer # Identify numerical columns numerical_cols = list(np.where((X.dtypes == np.int64) | (X.dtypes == np.float64))[0]) # Initialize SimpleImputer for mean strategy imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean') # Fit and transform the data imp_mean.fit(X.iloc[:, numerical_cols]) X.iloc[:, numerical_cols] = imp_mean.transform(X.iloc[:, numerical_cols]) |
Categorical Data
For categorical features, missing values are imputed using the most frequent strategy.
1 2 3 4 5 6 7 8 9 |
# Identify string columns string_cols = list(np.where((X.dtypes == object))[0]) # Initialize SimpleImputer for most frequent strategy imp_freq = SimpleImputer(missing_values=np.nan, strategy='most_frequent') # Fit and transform the data imp_freq.fit(X.iloc[:, string_cols]) X.iloc[:, string_cols] = imp_freq.transform(X.iloc[:, string_cols]) |
Encoding Categorical Variables
Machine learning algorithms require numerical input. Thus, categorical variables need to be encoded appropriately.
One-Hot Encoding
For categorical features with a high number of unique categories, One-Hot Encoding is employed to prevent the introduction of ordinal relationships.
1 2 3 4 5 6 7 8 9 |
from sklearn.compose import ColumnTransformer from sklearn.preprocessing import OneHotEncoder def OneHotEncoderMethod(indices, data): columnTransformer = ColumnTransformer( [('encoder', OneHotEncoder(), indices)], remainder='passthrough' ) return columnTransformer.fit_transform(data) |
Label Encoding
For binary categorical features or those with a manageable number of categories, Label Encoding is utilized.
1 2 3 4 5 6 |
from sklearn import preprocessing def LabelEncoderMethod(series): le = preprocessing.LabelEncoder() le.fit(series) return le.transform(series) |
Encoding Selection for X
A combination of encoding strategies is applied based on the number of unique categories in each feature.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
def EncodingSelection(X, threshold=10): # Step 01: Select the string columns string_cols = list(np.where((X.dtypes == object))[0]) one_hot_encoding_indices = [] # Step 02: Label encode columns with 2 or more than 'threshold' categories for col in string_cols: length = len(pd.unique(X[X.columns[col]])) if length == 2 or length > threshold: X[X.columns[col]] = LabelEncoderMethod(X[X.columns[col]]) else: one_hot_encoding_indices.append(col) # Step 03: One-hot encode the remaining columns X = OneHotEncoderMethod(one_hot_encoding_indices, X) return X # Apply encoding selection X = EncodingSelection(X) print(f"Encoded feature shape: {X.shape}") |
Output:
1 |
Encoded feature shape: (1742, 30) |
Feature Selection
Selecting the most relevant features enhances model performance and reduces computational complexity.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
from sklearn.feature_selection import SelectKBest, chi2 from sklearn import preprocessing # Initialize Min-Max Scaler MMS = preprocessing.MinMaxScaler() # Define number of best features to select K_features = 12 # Scale the features x_temp = MMS.fit_transform(X) # Apply SelectKBest with chi-squared scoring kbest = SelectKBest(score_func=chi2, k=10) x_temp = kbest.fit(x_temp, y) # Identify top features best_features = np.argsort(x_temp.scores_)[-K_features:] # Determine features to delete features_to_delete = np.argsort(x_temp.scores_)[:-K_features] # Reduce X to selected features X = np.delete(X, features_to_delete, axis=1) print(f"Reduced feature shape: {X.shape}") |
Output:
1 |
Reduced feature shape: (1742, 12) |
Model Training and Evaluation
With the data preprocessed and features selected, we proceed to train and evaluate our models.
K-Nearest Neighbors (KNN) Classifier
KNN is a simple, instance-based learning algorithm that can serve as a baseline for classification tasks.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
from sklearn.neighbors import KNeighborsClassifier from sklearn.metrics import accuracy_score # Initialize KNN with 8 neighbors knnClassifier = KNeighborsClassifier(n_neighbors=8) # Train the model knnClassifier.fit(X_train, y_train) # Make predictions y_pred_knn = knnClassifier.predict(X_test) # Evaluate accuracy knn_accuracy = accuracy_score(y_pred_knn, y_test) print(f"KNN Accuracy: {knn_accuracy:.2f}") |
Output:
1 |
KNN Accuracy: 0.68 |
Logistic Regression Model
Logistic Regression is extended here to handle multiclass classification using the multinomial approach.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
from sklearn.linear_model import LogisticRegression # Initialize Logistic Regression with increased iterations LRM = LogisticRegression(random_state=0, max_iter=1000, multi_class='multinomial', solver='lbfgs') # Train the model LRM.fit(X_train, y_train) # Make predictions y_pred_lr = LRM.predict(X_test) # Evaluate accuracy lr_accuracy = accuracy_score(y_pred_lr, y_test) print(f"Logistic Regression Accuracy: {lr_accuracy:.2f}") |
Output:
1 |
Logistic Regression Accuracy: 0.65 |
Comparative Analysis
Upon evaluating both models, the K-Nearest Neighbors classifier outperforms Logistic Regression in this particular scenario.
- KNN Accuracy: 67.9%
- Logistic Regression Accuracy: 65.0%
However, it’s essential to note the following observations:
- Iteration Limit Warning: Initially, logistic regression faced convergence issues, which were resolved by increasing the
max_iter
parameter from 300 to 1000. - Model Performance: While KNN showed higher accuracy, Logistic Regression offers better interpretability and can be more scalable with larger datasets.
Future Enhancements:
- Hyperparameter Tuning: Adjusting parameters like
C
,penalty
, and others in Logistic Regression can lead to improved performance. - Cross-Validation: Implementing cross-validation techniques can provide a more robust evaluation of model performance.
- Feature Engineering: Creating or selecting more informative features can enhance the classification accuracy.
Conclusion
This comprehensive guide demonstrates the implementation of Logistic Regression for multiclass classification in Python, highlighting the entire process from data preprocessing to model evaluation. While KNN showcased better accuracy in this case, Logistic Regression remains a powerful tool, especially when interpretability is a priority. By following structured preprocessing, feature selection, and thoughtful model training, one can effectively tackle multiclass classification problems in various domains.
Full Python Implementation
Below is the complete Python code encapsulating all the steps discussed:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 |
# Import necessary libraries import pandas as pd import numpy as np import seaborn as sns from sklearn.impute import SimpleImputer from sklearn.compose import ColumnTransformer from sklearn.preprocessing import OneHotEncoder, LabelEncoder, StandardScaler, MinMaxScaler from sklearn.feature_selection import SelectKBest, chi2 from sklearn.model_selection import train_test_split from sklearn.neighbors import KNeighborsClassifier from sklearn.linear_model import LogisticRegression from sklearn.metrics import accuracy_score # Load the dataset data = pd.read_csv('bangla.csv') # Separate features and target X = data.iloc[:, :-1] y = data.iloc[:, -1] # Handling missing data - Numeric type numerical_cols = list(np.where((X.dtypes == np.int64) | (X.dtypes == np.float64))[0]) imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean') imp_mean.fit(X.iloc[:, numerical_cols]) X.iloc[:, numerical_cols] = imp_mean.transform(X.iloc[:, numerical_cols]) # Handling missing string data string_cols = list(np.where((X.dtypes == object))[0]) imp_freq = SimpleImputer(missing_values=np.nan, strategy='most_frequent') imp_freq.fit(X.iloc[:, string_cols]) X.iloc[:, string_cols] = imp_freq.transform(X.iloc[:, string_cols]) # Encoding methods def OneHotEncoderMethod(indices, data): columnTransformer = ColumnTransformer( [('encoder', OneHotEncoder(), indices)], remainder='passthrough' ) return columnTransformer.fit_transform(data) def LabelEncoderMethod(series): le = LabelEncoder() le.fit(series) return le.transform(series) def EncodingSelection(X, threshold=10): string_cols = list(np.where((X.dtypes == object))[0]) one_hot_encoding_indices = [] for col in string_cols: length = len(pd.unique(X[X.columns[col]])) if length == 2 or length > threshold: X[X.columns[col]] = LabelEncoderMethod(X[X.columns[col]]) else: one_hot_encoding_indices.append(col) X = OneHotEncoderMethod(one_hot_encoding_indices, X) return X # Apply encoding selection X = EncodingSelection(X) print(f"Encoded feature shape: {X.shape}") # Feature selection MMS = MinMaxScaler() K_features = 12 x_temp = MMS.fit_transform(X) kbest = SelectKBest(score_func=chi2, k=10) x_temp = kbest.fit(x_temp, y) best_features = np.argsort(x_temp.scores_)[-K_features:] features_to_delete = np.argsort(x_temp.scores_)[:-K_features] X = np.delete(X, features_to_delete, axis=1) print(f"Reduced feature shape: {X.shape}") # Train-test split X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.20, random_state=1 ) print(f"Training set shape: {X_train.shape}") # Feature scaling sc = StandardScaler(with_mean=False) sc.fit(X_train) X_train = sc.transform(X_train) X_test = sc.transform(X_test) print(f"Scaled Training set shape: {X_train.shape}") print(f"Scaled Test set shape: {X_test.shape}") # Building KNN model knnClassifier = KNeighborsClassifier(n_neighbors=8) knnClassifier.fit(X_train, y_train) y_pred_knn = knnClassifier.predict(X_test) knn_accuracy = accuracy_score(y_pred_knn, y_test) print(f"KNN Accuracy: {knn_accuracy:.2f}") # Building Logistic Regression model LRM = LogisticRegression(random_state=0, max_iter=1000, multi_class='multinomial', solver='lbfgs') LRM.fit(X_train, y_train) y_pred_lr = LRM.predict(X_test) lr_accuracy = accuracy_score(y_pred_lr, y_test) print(f"Logistic Regression Accuracy: {lr_accuracy:.2f}") |
Note: Ensure that the dataset bangla.csv
is correctly placed in your working directory before executing the code.
Keywords
- Logistic Regression
- Multiclass Classification
- Python Tutorial
- Machine Learning
- Data Preprocessing
- Feature Selection
- K-Nearest Neighbors (KNN)
- Scikit-learn
- Data Science
- Python Machine Learning