Mastering Multiclass Classification with K-Nearest Neighbors (KNN): A Comprehensive Guide

Introduction to Classification
Binary vs. Multiclass Classification
Understanding K-Nearest Neighbors (KNN)
Implementing KNN for Multiclass Classification
Case Study: Classifying Bangla Music Genres
1. Dataset Overview
2. Data Preprocessing Steps
Building and Evaluating the KNN Model
Conclusion
FAQs

Introduction to Classification

Classification is a supervised learning technique where the goal is to predict categorical labels for given input data. It’s widely used in various applications, such as spam detection in emails, image recognition, medical diagnosis, and more. Classification tasks can be broadly categorized into two types: binary classification and multiclass classification.

Binary vs. Multiclass Classification

Binary Classification: This involves categorizing data into two distinct classes. For example, determining whether an email is spam or not spam.
Multiclass Classification: This extends binary classification to scenarios where there are more than two classes. For instance, classifying different genres of music or types of vehicles.

Understanding the difference is crucial as it influences the choice of algorithms and evaluation metrics.

Understanding K-Nearest Neighbors (KNN)

K-Nearest Neighbors (KNN) is a simple, yet powerful machine learning algorithm used for both classification and regression tasks. Here’s a breakdown of how KNN works:

Instance-Based Learning: KNN doesn’t build an explicit model. Instead, it memorizes the training dataset.
Distance Measurement: To make a prediction, KNN calculates the distance between the new data point and all points in the training set.
Voting Mechanism: For classification, KNN selects the ‘k’ closest neighbors and assigns the most common class among them to the new data point.
Choice of ‘k’: The number of neighbors, ‘k’, is a crucial hyperparameter. A small ‘k’ can make the model sensitive to noise, while a large ‘k’ can smooth out the decision boundaries.

KNN is particularly effective for multiclass classification due to its inherent ability to handle multiple classes through voting.

Implementing KNN for Multiclass Classification

Implementing KNN for multiclass classification involves several steps, including data preprocessing, feature selection, scaling, and model evaluation. Let’s explore these steps through a practical case study.

Case Study: Classifying Bangla Music Genres

In this section, we’ll walk through a practical implementation of multiclass classification using KNN on a Bangla music dataset. The objective is to categorize songs into different genres based on various audio features.

Dataset Overview

The Bangla Music Dataset comprises data from 1,742 songs categorized into six distinct genres. Each song is described using 31 features, including audio attributes like zero crossing rate, spectral centroid, chroma frequency, and MFCCs (Mel Frequency Cepstral Coefficients).

Key Features:

Numerical Features: Such as zero crossing, spectral centroid, spectral rolloff, etc.
Categorical Features: File names and labels indicating the genre.

Target Variable: The genre label (label) indicating the music category.

Data Preprocessing Steps

Data preprocessing is a critical step in machine learning workflows. Proper preprocessing ensures that the data is clean, consistent, and suitable for model training.

Handling Missing Data

Why It Matters: Missing data can skew results and reduce the model’s effectiveness. It’s essential to address missing values to maintain data integrity.

Steps:

Numeric Data:
- Use the Mean Imputation strategy to fill missing values.
- Implemented using SimpleImputer with strategy='mean'.
Categorical Data:
- Use the Most Frequent Imputation strategy to fill missing values.
- Implemented using SimpleImputer with strategy='most_frequent'.

Python Implementation:

import numpy as np
from sklearn.impute import SimpleImputer

# Handling numeric data
imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean')
numerical_cols = list(np.where((X.dtypes == np.int64) | (X.dtypes == np.float64))[0])
imp_mean.fit(X.iloc[:, numerical_cols])
X.iloc[:, numerical_cols] = imp_mean.transform(X.iloc[:, numerical_cols])

# Handling categorical data
imp_freq = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
string_cols = list(np.where((X.dtypes == object))[0])
imp_freq.fit(X.iloc[:, string_cols])
X.iloc[:, string_cols] = imp_freq.transform(X.iloc[:, string_cols])

import numpy as np

from sklearn.impute import SimpleImputer

# Handling numeric data

imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean')

numerical_cols = list(np.where((X.dtypes == np.int64) | (X.dtypes == np.float64))[0])

imp_mean.fit(X.iloc[:, numerical_cols])

X.iloc[:, numerical_cols] = imp_mean.transform(X.iloc[:, numerical_cols])

# Handling categorical data

imp_freq = SimpleImputer(missing_values=np.nan, strategy='most_frequent')

string_cols = list(np.where((X.dtypes == object))[0])

imp_freq.fit(X.iloc[:, string_cols])

X.iloc[:, string_cols] = imp_freq.transform(X.iloc[:, string_cols])

Encoding Categorical Variables

Why It Matters: Machine learning models require numerical input. Categorical variables need to be converted into numerical format.

Two Primary Encoding Methods:

Label Encoding:
- Assigns a unique integer to each category.
- Suitable for binary or ordinal categorical variables.
One-Hot Encoding:
- Creates binary columns for each category.
- Suitable for nominal categorical variables with more than two categories.

Encoding Strategy:

Categories with Two Classes or More Than a Threshold: Apply label encoding.
Other Categories: Apply one-hot encoding.

Python Implementation:

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, LabelEncoder

# Label Encoding Function
def LabelEncoderMethod(series):
    le = LabelEncoder()
    return le.fit_transform(series)

# One-Hot Encoding Function
def OneHotEncoderMethod(indices, data):
    columnTransformer = ColumnTransformer([('encoder', OneHotEncoder(), indices)], remainder='passthrough')
    return columnTransformer.fit_transform(data)

# Encoding Selection Function
def EncodingSelection(X, threshold=10):
    string_cols = list(np.where((X.dtypes == object))[0])
    one_hot_encoding_indices = []
    
    for col in string_cols:
        unique_values = len(pd.unique(X[X.columns[col]]))
        if unique_values == 2 or unique_values > threshold:
            X[X.columns[col]] = LabelEncoderMethod(X[X.columns[col]])
        else:
            one_hot_encoding_indices.append(col)
    
    X = OneHotEncoderMethod(one_hot_encoding_indices, X)
    return X

# Apply Encoding Selection
X = EncodingSelection(X)

from sklearn.compose import ColumnTransformer

from sklearn.preprocessing import OneHotEncoder, LabelEncoder

# Label Encoding Function

def LabelEncoderMethod(series):

le = LabelEncoder()

return le.fit_transform(series)

# One-Hot Encoding Function

def OneHotEncoderMethod(indices, data):

columnTransformer = ColumnTransformer([('encoder', OneHotEncoder(), indices)], remainder='passthrough')

return columnTransformer.fit_transform(data)

# Encoding Selection Function

def EncodingSelection(X, threshold=10):

string_cols = list(np.where((X.dtypes == object))[0])

one_hot_encoding_indices = []

for col in string_cols:

unique_values = len(pd.unique(X[X.columns[col]]))

if unique_values == 2 or unique_values > threshold:

X[X.columns[col]] = LabelEncoderMethod(X[X.columns[col]])

else:

one_hot_encoding_indices.append(col)

X = OneHotEncoderMethod(one_hot_encoding_indices, X)

return X

# Apply Encoding Selection

X = EncodingSelection(X)

Feature Selection

Why It Matters: Selecting the right features enhances model performance by eliminating irrelevant or redundant data, reducing overfitting, and improving computational efficiency.

Feature Selection Method Used:

SelectKBest with Chi-Squared Test:
- Evaluates the relationship between each feature and the target variable.
- Selects the top ‘k’ features with the highest scores.

Python Implementation:

from sklearn.feature_selection import SelectKBest, chi2
from sklearn.preprocessing import MinMaxScaler

# Initialize SelectKBest
kbest = SelectKBest(score_func=chi2, k=12)
scaler = MinMaxScaler()

# Fit and transform the data
X_scaled = scaler.fit_transform(X)
kbest.fit(X_scaled, y)

# Get top features
best_features = np.argsort(kbest.scores_)[-12:]
features_to_delete = np.argsort(kbest.scores_)[:-12]
X = np.delete(X, features_to_delete, axis=1)

from sklearn.feature_selection import SelectKBest, chi2

from sklearn.preprocessing import MinMaxScaler

# Initialize SelectKBest

kbest = SelectKBest(score_func=chi2, k=12)

scaler = MinMaxScaler()

# Fit and transform the data

X_scaled = scaler.fit_transform(X)

kbest.fit(X_scaled, y)

# Get top features

best_features = np.argsort(kbest.scores_)[-12:]

features_to_delete = np.argsort(kbest.scores_)[:-12]

X = np.delete(X, features_to_delete, axis=1)

Feature Scaling

Why It Matters: Scaling ensures that all features contribute equally to the distance calculations in KNN, preventing features with larger scales from dominating.

Scaling Method Used:

Standardization:
- Transforms data to have a mean of zero and a standard deviation of one.
- Implemented using StandardScaler.

Python Implementation:

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

# Split the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=1)

# Initialize and fit the scaler
scaler = StandardScaler(with_mean=False)
scaler.fit(X_train)

# Transform the data
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

from sklearn.preprocessing import StandardScaler

from sklearn.model_selection import train_test_split

# Split the dataset

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=1)

# Initialize and fit the scaler

scaler = StandardScaler(with_mean=False)

scaler.fit(X_train)

# Transform the data

X_train = scaler.transform(X_train)

X_test = scaler.transform(X_test)

Building and Evaluating the KNN Model

With the data preprocessed and prepared, the next step is to build the KNN model and evaluate its performance.

Model Training

Steps:

Initialize KNN Classifier:
- Set the number of neighbors (k=8 in this case).
Train the Model:
- Fit the KNN classifier on the training data.
Predict:
- Use the trained model to make predictions on the test set.
Evaluate:
- Calculate the accuracy score to assess the model’s performance.

Python Implementation:

from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Initialize KNN with k=8
knnClassifier = KNeighborsClassifier(n_neighbors=8)

# Train the model
knnClassifier.fit(X_train, y_train)

# Make predictions
y_pred = knnClassifier.predict(X_test)

# Evaluate accuracy
accuracy = accuracy_score(y_pred, y_test)
print(f"Model Accuracy: {accuracy:.2f}")

from sklearn.neighbors import KNeighborsClassifier

from sklearn.metrics import accuracy_score

# Initialize KNN with k=8

knnClassifier = KNeighborsClassifier(n_neighbors=8)

# Train the model

knnClassifier.fit(X_train, y_train)

# Make predictions

y_pred = knnClassifier.predict(X_test)

# Evaluate accuracy

accuracy = accuracy_score(y_pred, y_test)

print(f"Model Accuracy: {accuracy:.2f}")

Output:

Model Accuracy: 0.68

1	Model Accuracy: 0.68

Interpretation: The KNN model achieved an accuracy of approximately 68%, indicating that it correctly classified 68% of the songs in the test set.

Hyperparameter Tuning

Adjusting the number of neighbors (‘k’) can significantly impact the model’s performance. It’s advisable to experiment with different ‘k’ values to find the optimal balance between bias and variance.

# Experiment with different k values
for k in range(3, 21, 2):
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(X_train, y_train)
    y_pred = knn.predict(X_test)
    accuracy = accuracy_score(y_pred, y_test)
    print(f"k={k}, Accuracy={accuracy:.2f}")

# Experiment with different k values

for k in range(3, 21, 2):

knn = KNeighborsClassifier(n_neighbors=k)

knn.fit(X_train, y_train)

y_pred = knn.predict(X_test)

accuracy = accuracy_score(y_pred, y_test)

print(f"k={k}, Accuracy={accuracy:.2f}")

Sample Output:

k=3, Accuracy=0.65
k=5, Accuracy=0.66
k=7, Accuracy=0.67
k=9, Accuracy=0.68
...
k=19, Accuracy=0.65

k=3, Accuracy=0.65

k=5, Accuracy=0.66

k=7, Accuracy=0.67

k=9, Accuracy=0.68

...

k=19, Accuracy=0.65

Best Performance: In this scenario, a k-value of 9 yielded the highest accuracy.

Conclusion

Multiclass classification is a fundamental task in machine learning, enabling the categorization of data points into multiple classes. The K-Nearest Neighbors (KNN) algorithm, known for its simplicity and effectiveness, proves to be a strong contender for such tasks. Through this comprehensive guide, we’ve explored the intricacies of implementing KNN for multiclass classification, emphasizing the importance of data preprocessing, feature selection, and model evaluation.

By following the systematic approach outlined—from handling missing data and encoding categorical variables to selecting relevant features and scaling—you can harness the full potential of KNN for your multiclass classification problems. Remember, the key to a successful model lies not just in the algorithm but also in the quality and preparation of the data.

FAQs

1. What is the main difference between binary and multiclass classification?

Binary classification involves categorizing data into two distinct classes, whereas multiclass classification extends this to scenarios with more than two classes.

2. Why is feature scaling important for KNN?

KNN relies on distance calculations to determine the nearest neighbors. Without scaling, features with larger scales can disproportionately influence the distance metrics, leading to biased predictions.

3. How do I choose the optimal number of neighbors (k) in KNN?

The optimal ‘k’ balances bias and variance. It’s typically determined through experimentation, such as cross-validation, to identify the ‘k’ value that yields the highest accuracy.

4. Can KNN handle both numerical and categorical data?

KNN primarily works with numerical data. Categorical variables need to be encoded into numerical formats before applying KNN.

5. What are some alternatives to KNN for multiclass classification?

Alternatives include algorithms like Support Vector Machines (SVM), Decision Trees, Random Forests, and Neural Networks, each with its own strengths and suitable use-cases.

S19L05 – KNN on multi class classification

Mastering Multiclass Classification with K-Nearest Neighbors (KNN): A Comprehensive Guide

Table of Contents

Introduction to Classification

Binary vs. Multiclass Classification

Understanding K-Nearest Neighbors (KNN)

Implementing KNN for Multiclass Classification

Case Study: Classifying Bangla Music Genres

Dataset Overview

Data Preprocessing Steps

Handling Missing Data

Encoding Categorical Variables

Feature Selection

Feature Scaling

Building and Evaluating the KNN Model

Model Training

Hyperparameter Tuning

Conclusion

FAQs