Building a K-Nearest Neighbors (KNN) Model in Python: A Comprehensive Guide

Welcome to this comprehensive guide on building a K-Nearest Neighbors (KNN) model in Python. Whether you’re a data science enthusiast or a seasoned professional, this article will walk you through each step of developing a KNN classifier, from data preprocessing to model evaluation. By the end of this guide, you’ll have a solid understanding of how to implement KNN using Python’s powerful libraries.

Introduction to K-Nearest Neighbors (KNN)
Understanding the Dataset
Data Preprocessing
Building the KNN Model
Model Evaluation
Conclusion
Additional Resources

Introduction to K-Nearest Neighbors (KNN)

K-Nearest Neighbors (KNN) is a simple, yet effective, supervised machine learning algorithm used for classification and regression tasks. The KNN algorithm classifies a data point based on how its neighbors are classified. It’s intuitive, easy to implement, and doesn’t require a training phase, making it efficient for real-time predictions.

Key Features of KNN:

Lazy Learning: KNN doesn’t build an internal model; it memorizes the training dataset.
Instance-Based: Predictions are based on instances (neighbors) from the training data.
Non-Parametric: KNN makes no assumptions about the underlying data distribution.

Understanding the Dataset

For this tutorial, we’ll use the WeatherAUS dataset from Kaggle. This dataset contains weather attributes recorded over multiple years across various Australian locations.

Dataset Overview:

Features	Target Variable
Date, Location, MinTemp, MaxTemp, Rainfall, Evaporation, Sunshine, WindGustDir, WindGustSpeed, WindDir9am, WindDir3pm, WindSpeed9am, WindSpeed3pm, Humidity9am, Humidity3pm, Pressure9am, Pressure3pm, Cloud9am, Cloud3pm, Temp9am, Temp3pm, RainToday, RISK_MM	RainTomorrow (Yes/No)

Data Preprocessing

Data preprocessing is a crucial step in machine learning. It involves transforming raw data into an understandable format. Proper preprocessing can significantly enhance the performance of machine learning algorithms.

Handling Missing Data

Missing data can adversely affect the performance of machine learning models. We’ll handle missing values for both numerical and categorical features.

Numeric Data

Identify Numerical Columns:

Java

numerical_cols = list(np.where((X.dtypes == np.int64) | (X.dtypes == np.float64))[0])

1

numerical_cols = list(np.where((X.dtypes == np.int64) | (X.dtypes == np.float64))[0])

Impute Missing Values with Mean:

from sklearn.impute import SimpleImputer
imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean')
imp_mean.fit(X.iloc[:, numerical_cols])
X.iloc[:, numerical_cols] = imp_mean.transform(X.iloc[:, numerical_cols])

from sklearn.impute import SimpleImputer

imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean')

imp_mean.fit(X.iloc[:, numerical_cols])

X.iloc[:, numerical_cols] = imp_mean.transform(X.iloc[:, numerical_cols])

Categorical Data

Identify Categorical Columns:

Java

string_cols = list(np.where((X.dtypes == object))[0])

1

string_cols = list(np.where((X.dtypes == object))[0])

Impute Missing Values with Mode (Most Frequent):

imp_mode = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
imp_mode.fit(X.iloc[:, string_cols])
X.iloc[:, string_cols] = imp_mode.transform(X.iloc[:, string_cols])

imp_mode = SimpleImputer(missing_values=np.nan, strategy='most_frequent')

imp_mode.fit(X.iloc[:, string_cols])

X.iloc[:, string_cols] = imp_mode.transform(X.iloc[:, string_cols])

Encoding Categorical Variables

Machine learning algorithms require numerical input. Therefore, we need to convert categorical variables into numerical formats.

Label Encoding

Label Encoding assigns each category a unique integer based on alphabetical ordering.

from sklearn import preprocessing

def LabelEncoderMethod(series):
    le = preprocessing.LabelEncoder()
    return le.fit_transform(series)

from sklearn import preprocessing

def LabelEncoderMethod(series):

le = preprocessing.LabelEncoder()

return le.fit_transform(series)

One-Hot Encoding

One-Hot Encoding creates binary columns for each category.

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

def OneHotEncoderMethod(indices, data):
    columnTransformer = ColumnTransformer(
        [('encoder', OneHotEncoder(), indices)],
        remainder='passthrough'
    )
    return columnTransformer.fit_transform(data)

from sklearn.compose import ColumnTransformer

from sklearn.preprocessing import OneHotEncoder

def OneHotEncoderMethod(indices, data):

columnTransformer = ColumnTransformer(

[('encoder', OneHotEncoder(), indices)],

remainder='passthrough'

)

return columnTransformer.fit_transform(data)

Encoding Selection Function

This function decides whether to apply Label Encoding or One-Hot Encoding based on the number of unique categories.

def EncodingSelection(X, threshold=10):
    string_cols = list(np.where((X.dtypes == object))[0])
    one_hot_encoding_indices = []

    for col in string_cols:
        length = len(pd.unique(X[X.columns[col]]))
        if length == 2 or length > threshold:
            X[X.columns[col]] = LabelEncoderMethod(X[X.columns[col]])
        else:
            one_hot_encoding_indices.append(col)
    
    X = OneHotEncoderMethod(one_hot_encoding_indices, X)
    return X

def EncodingSelection(X, threshold=10):

string_cols = list(np.where((X.dtypes == object))[0])

one_hot_encoding_indices = []

for col in string_cols:

length = len(pd.unique(X[X.columns[col]]))

if length == 2 or length > threshold:

X[X.columns[col]] = LabelEncoderMethod(X[X.columns[col]])

else:

one_hot_encoding_indices.append(col)

X = OneHotEncoderMethod(one_hot_encoding_indices, X)

return X

Apply Encoding:

X = EncodingSelection(X)

1	X = EncodingSelection(X)

Feature Selection

Selecting relevant features can enhance model performance.

Apply SelectKBest with Chi-Squared Test:

from sklearn.feature_selection import SelectKBest, chi2
from sklearn import preprocessing

kbest = SelectKBest(score_func=chi2, k=10)
MMS = preprocessing.MinMaxScaler()
x_temp = MMS.fit_transform(X)
x_temp = kbest.fit(x_temp, y)
best_features = np.argsort(x_temp.scores_)[-13:]
features_to_delete = np.argsort(x_temp.scores_)[:-13]
X = np.delete(X, features_to_delete, axis=1)

from sklearn.feature_selection import SelectKBest, chi2

from sklearn import preprocessing

kbest = SelectKBest(score_func=chi2, k=10)

MMS = preprocessing.MinMaxScaler()

x_temp = MMS.fit_transform(X)

x_temp = kbest.fit(x_temp, y)

best_features = np.argsort(x_temp.scores_)[-13:]

features_to_delete = np.argsort(x_temp.scores_)[:-13]

X = np.delete(X, features_to_delete, axis=1)

Resulting Shape:

Java

print(X.shape) # Output: (142193, 13)

1

print(X.shape) # Output: (142193, 13)

Train-Test Split

Splitting the dataset into training and testing sets ensures that the model is evaluated on unseen data.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.20, random_state=1
)
print(X_train.shape)  # Output: (113754, 13)

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(

X, y, test_size=0.20, random_state=1

)

print(X_train.shape) # Output: (113754, 13)

Feature Scaling

Feature scaling standardizes the range of independent variables, ensuring that each feature contributes equally to the result.

Standardization:

from sklearn import preprocessing
sc = preprocessing.StandardScaler(with_mean=False)
sc.fit(X_train)
X_train = sc.transform(X_train)
X_test = sc.transform(X_test)

from sklearn import preprocessing

sc = preprocessing.StandardScaler(with_mean=False)

sc.fit(X_train)

X_train = sc.transform(X_train)

X_test = sc.transform(X_test)

Check Shapes:

Java

print(X_train.shape) # Output: (113754, 13) print(X_test.shape) # Output: (28439, 13)

1
2

print(X_train.shape) # Output: (113754, 13)
print(X_test.shape) # Output: (28439, 13)

Building the KNN Model

With the data preprocessed, we’re now ready to build the KNN classifier.

Import KNeighborsClassifier:

Java

from sklearn.neighbors import KNeighborsClassifier

1

from sklearn.neighbors import KNeighborsClassifier
Initialize the Classifier:

Java

knnClassifier = KNeighborsClassifier(n_neighbors=3)

1

knnClassifier = KNeighborsClassifier(n_neighbors=3)
Train the Model:

Java

knnClassifier.fit(X_train, y_train)

1

knnClassifier.fit(X_train, y_train)
Make Predictions:

Java

y_pred = knnClassifier.predict(X_test)

1

y_pred = knnClassifier.predict(X_test)
Single Prediction Example:

Java

single_prediction = knnClassifier.predict([X_test[0]]) print(single_prediction) # Output: [1] (1 indicates 'Yes' for RainTomorrow)

1
2

single_prediction = knnClassifier.predict([X_test[0]])
print(single_prediction) # Output: [1] (1 indicates 'Yes' for RainTomorrow)
Prediction Probabilities:

Java

prediction_prob = knnClassifier.predict_proba([X_test[0]]) print(prediction_prob) # Output: [[0.33333333 0.66666667]]

1
2

prediction_prob = knnClassifier.predict_proba([X_test[0]])
print(prediction_prob) # Output: [[0.33333333 0.66666667]]

Model Evaluation

Evaluating the model’s performance is essential to understand its accuracy and reliability.

Import Accuracy Score:

Java

from sklearn.metrics import accuracy_score

1

from sklearn.metrics import accuracy_score
Calculate Accuracy:

Java

accuracy = accuracy_score(y_pred, y_test) * 100 print(f"Accuracy: {accuracy:.2f}%") # Output: Accuracy: 90.28%

1
2

accuracy = accuracy_score(y_pred, y_test) * 100
print(f"Accuracy: {accuracy:.2f}%") # Output: Accuracy: 90.28%

Interpretation:

The KNN model achieved an accuracy of 90.28%, indicating that it correctly predicts the rain status for the next day in over 90% of cases. This high accuracy suggests that the model is well-suited for this classification task.

Conclusion

In this guide, we’ve walked through the entire process of building a K-Nearest Neighbors (KNN) model in Python:

Data Importation: Utilizing the WeatherAUS dataset.
Data Preprocessing: Handling missing values, encoding categorical variables, and selecting relevant features.
Train-Test Split & Feature Scaling: Preparing the data for training and ensuring uniformity across features.
Model Building: Training the KNN classifier and making predictions.
Model Evaluation: Assessing the model’s accuracy.

The KNN algorithm proves to be a robust choice for classification tasks, especially with well-preprocessed data. However, it’s essential to experiment with different hyperparameters (like the number of neighbors) and cross-validation techniques to further enhance model performance.

Additional Resources

Happy Modeling! 🚀

Disclaimer: This article is based on a transcription of a video tutorial and supplemented with code examples from Jupyter Notebook and Python scripts. Ensure to adapt and modify the code as per your specific dataset and requirements.

S19L02-KNN in python

Building a K-Nearest Neighbors (KNN) Model in Python: A Comprehensive Guide

Table of Contents

Introduction to K-Nearest Neighbors (KNN)

Understanding the Dataset

Data Preprocessing

Handling Missing Data

Numeric Data

Categorical Data

Encoding Categorical Variables

Label Encoding

One-Hot Encoding

Encoding Selection Function

Feature Selection

Train-Test Split

Feature Scaling

Building the KNN Model

Model Evaluation

Conclusion

Additional Resources