Understanding K-Nearest Neighbors (KNN) for Classification and Regression

Introduction to K-Nearest Neighbors
How KNN Works
1. Data Representation
2. Distance Metrics
Choosing the Right ‘K’ Value
KNN for Classification
KNN for Regression
Advantages and Disadvantages of KNN
Implementing KNN in Python
1. Data Preprocessing
2. Model Training and Evaluation
Practical Example
Conclusion
References

1. Introduction to K-Nearest Neighbors

K-Nearest Neighbors (KNN) is a simple, yet powerful, supervised machine learning algorithm used for both classification and regression tasks. The core idea is to predict the label of a new data point based on the labels of its ‘K’ closest neighbors in the feature space.

Why KNN?

Simplicity: Easy to understand and implement.
No Training Phase: KNN is a lazy learner, meaning it doesn’t explicitly train a model but makes decisions based on the entire dataset.
Versatility: Applicable to various types of problems, including classification, regression, and even anomaly detection.

2. How KNN Works

KNN operates on the principle that similar data points are likely to have similar outcomes. Here’s a step-by-step breakdown of how the algorithm functions:

Data Representation

Imagine a two-dimensional space where each data point represents a car based on two features:

Manufacturing Time (X-axis)
Manufacturing Cost (Y-axis)

Data points are color-coded:

Red Dots: Petrol cars
Blue Dots: Electric cars

Distance Metrics

To determine the “closeness” of data points, KNN utilizes distance metrics. The most commonly used metrics are:

Euclidean Distance
\[ d(p, q) = \sqrt{\sum_{i=1}^{n} (q_i – p_i)^2} \]
- Used When: Data is in a continuous space.
- Pro Tip: Euclidean distance is the default metric in many KNN implementations, including scikit-learn.
Manhattan Distance
\[ d(p, q) = \sum_{i=1}^{n} |q_i – p_i| \]
- Used When: Data is grid-like and movement is restricted to horizontal and vertical paths.
Minkowski Distance
A generalization of both Euclidean and Manhattan distances.

\[ d(p, q) = \left( \sum_{i=1}^{n} |q_i – p_i|^p \right)^{1/p} \]
- When \( p = 1 \): Equivalent to Manhattan distance.
- When \( p = 2 \): Equivalent to Euclidean distance.

3. Choosing the Right ‘K’ Value

The parameter ‘K’ determines the number of neighbors to consider when making a prediction. Selecting the optimal ‘K’ value is crucial for the performance of the KNN algorithm.

Impact of ‘K’

Small ‘K’ (e.g., K=1):
- More sensitive to noise.
- Can lead to overfitting.
Large ‘K’ (e.g., K=20):
- Smoother decision boundary.
- May underfit by oversimplifying the data.

Best Practices

Cross-Validation: Use techniques like cross-validation to find the ‘K’ value that yields the best accuracy.
Odd Numbers: When dealing with binary classification, using odd ‘K’ values helps in avoiding ties.

4. KNN for Classification

In classification, KNN assigns the class most common among its ‘K’ nearest neighbors to the new data point.

Example Scenario

Consider a new car data point with specific manufacturing time and cost. The KNN algorithm will:

Calculate the distance from this point to all other points in the dataset.
Identify the ‘K’ closest neighbors.
Assign the class (Electric or Petrol) based on the majority vote among these neighbors.

Sensitivity to ‘K’

As demonstrated in the transcript, varying ‘K’ can change the classification outcome. For instance:

K=1: The new point is classified based on its single nearest neighbor.
K=5: The majority vote among five neighbors determines the classification.

5. KNN for Regression

While KNN is predominantly used for classification, it can also perform regression tasks by predicting the average value of the ‘K’ nearest neighbors.

Challenges in Regression

Overfitting: Lower ‘K’ values can lead to overfitting.
Underfitting: Higher ‘K’ values may oversimplify the model.

Implementation Insights

In the provided Jupyter Notebook, KNN regression was applied to predict diamond prices. Here’s a brief overview:

Data Preprocessing:
- Mapped categorical variables to numerical values.
- Scaled features using standardization.
Model Training:
- Trained KNN regressor with varying ‘K’ values to determine optimal performance.
Evaluation:
- Achieved a maximum accuracy score of approximately 98.05% at K=4.
- Visualized actual vs. predicted prices using Plotly for better interpretability.

6. Advantages and Disadvantages of KNN

Advantages

Simple and Intuitive: Easy to understand and implement.
No Training Phase: Reduces computational cost during training.
Adaptable: Suitable for both classification and regression.

Disadvantages

Computationally Intensive: Makes predictions using the entire dataset, which can be slow for large datasets.
Sensitive to Irrelevant Features: Irrelevant or redundant features can degrade performance.
Choosing ‘K’: Selecting the optimal ‘K’ value can be challenging.

7. Implementing KNN in Python

Leveraging Python’s scikit-learn library simplifies the implementation of KNN. Below, we outline the key steps from data preprocessing to model evaluation.

Data Preprocessing

Before applying KNN, it’s essential to prepare the data:

Handling Categorical Variables:
- Convert categorical text data into numerical values using mapping dictionaries.

cut_dict = {'Fair': 1, 'Good': 2, 'Very Good': 3, 'Premium': 4, 'Ideal': 5}
clarity_dict = {'I1': 1, 'SI2': 2, 'SI1': 3, 'VS2': 4, 'VS1': 5, 'VVS2': 6, 'VVS1': 7, 'IF': 8}
color_dict = {'D':7, 'E':6, 'F':5, 'G':4, 'H':3, 'I':2, 'J':1}

df['cut'] = df['cut'].map(cut_dict)
df['clarity'] = df['clarity'].map(clarity_dict)
df['color'] = df['color'].map(color_dict)
df = df.drop('Unnamed: 0', axis=1)

cut_dict = {'Fair': 1, 'Good': 2, 'Very Good': 3, 'Premium': 4, 'Ideal': 5}

clarity_dict = {'I1': 1, 'SI2': 2, 'SI1': 3, 'VS2': 4, 'VS1': 5, 'VVS2': 6, 'VVS1': 7, 'IF': 8}

color_dict = {'D':7, 'E':6, 'F':5, 'G':4, 'H':3, 'I':2, 'J':1}

df['cut'] = df['cut'].map(cut_dict)

df['clarity'] = df['clarity'].map(clarity_dict)

df['color'] = df['color'].map(color_dict)

df = df.drop('Unnamed: 0', axis=1)

Scaling Features:
- Normalize the feature set to ensure all features contribute equally to the distance calculations.

from sklearn import preprocessing

X = df.drop(['price'], axis=1).values
X = preprocessing.scale(X)
y = df['price'].values
y = preprocessing.scale(y)

from sklearn import preprocessing

X = df.drop(['price'], axis=1).values

X = preprocessing.scale(X)

y = df['price'].values

y = preprocessing.scale(y)

Model Training and Evaluation

Splitting the Dataset:

from sklearn.utils import shuffle

df = shuffle(df, random_state=42)
test_size = 200
X_train = X[:-test_size]
y_train = y[:-test_size]
X_test = X[-test_size:]
y_test = y[-test_size:]

from sklearn.utils import shuffle

df = shuffle(df, random_state=42)

test_size = 200

X_train = X[:-test_size]

y_train = y[:-test_size]

X_test = X[-test_size:]

y_test = y[-test_size:]

Training KNN Regressor:

from sklearn.neighbors import KNeighborsRegressor

score = []
for k in range(1, 20):
    clf = KNeighborsRegressor(n_neighbors=k, weights='distance', p=1)
    clf.fit(X_train, y_train)
    score.append(clf.score(X_test, y_test))

from sklearn.neighbors import KNeighborsRegressor

score = []

for k in range(1, 20):

clf = KNeighborsRegressor(n_neighbors=k, weights='distance', p=1)

clf.fit(X_train, y_train)

score.append(clf.score(X_test, y_test))

Visualizing Performance:

import plotly.graph_objs as go
from plotly.offline import iplot

trace0 = go.Scatter(
    y=score,
    x=np.arange(1, len(score)+1),
    mode='lines+markers',
    marker=dict(color='rgb(100, 200, 150)')
)
layout = go.Layout(
    title='K Value vs. Accuracy Score',
    xaxis=dict(title='K Value', tickmode='linear'),
    yaxis=dict(title='Score')
)
fig = go.Figure(data=[trace0], layout=layout)
iplot(fig, filename='basic-line')

import plotly.graph_objs as go

from plotly.offline import iplot

trace0 = go.Scatter(

y=score,

x=np.arange(1, len(score)+1),

mode='lines+markers',

marker=dict(color='rgb(100, 200, 150)')

)

layout = go.Layout(

title='K Value vs. Accuracy Score',

xaxis=dict(title='K Value', tickmode='linear'),

yaxis=dict(title='Score')

)

fig = go.Figure(data=[trace0], layout=layout)

iplot(fig, filename='basic-line')

Determining Optimal ‘K’:

k_max = score.index(max(score)) + 1
print(f"At K = {k_max}, Max Accuracy = {max(score) * 100:.2f}%")

1 2	k_max = score.index(max(score)) + 1 print(f"At K = {k_max}, Max Accuracy = {max(score) * 100:.2f}%")

Output:

At K = 4, Max Accuracy = 98.05%

1	At K = 4, Max Accuracy = 98.05%

Final Model Evaluation:

clf = KNeighborsRegressor(n_neighbors=50)
clf.fit(X_train, y_train)
print(clf.score(X_test, y_test))
y_pred = clf.predict(X_test)

clf = KNeighborsRegressor(n_neighbors=50)

clf.fit(X_train, y_train)

print(clf.score(X_test, y_test))

y_pred = clf.predict(X_test)

Output:

0.9543611406331687

1	0.9543611406331687

Comparing Actual vs. Predicted Prices:

import plotly.graph_objs as go
from plotly.offline import iplot

trace0 = go.Scatter(
    y=y_test,
    x=np.arange(200),
    mode='lines+markers',
    name='Actual Price',
    marker=dict(color='rgb(110, 10, 150)')
)

trace1 = go.Scatter(
    y=y_pred,
    x=np.arange(200),
    mode='lines+markers',
    name='Predicted Price',
    line=dict(color='rgb(200, 50, 10)', dash='dot')
)

layout = go.Layout(
    xaxis=dict(title='Index'),
    yaxis=dict(title='Normalized Price')
)

figure = go.Figure(data=[trace0, trace1], layout=layout)
iplot(figure)

import plotly.graph_objs as go

from plotly.offline import iplot

trace0 = go.Scatter(

y=y_test,

x=np.arange(200),

mode='lines+markers',

name='Actual Price',

marker=dict(color='rgb(110, 10, 150)')

)

trace1 = go.Scatter(

y=y_pred,

x=np.arange(200),

mode='lines+markers',

name='Predicted Price',

line=dict(color='rgb(200, 50, 10)', dash='dot')

)

layout = go.Layout(

xaxis=dict(title='Index'),

yaxis=dict(title='Normalized Price')

)

figure = go.Figure(data=[trace0, trace1], layout=layout)

iplot(figure)

This visualization helps in assessing the model’s prediction accuracy by overlaying actual and predicted price values.

8. Practical Example

Let’s walk through a practical implementation using Python’s scikit-learn library, as outlined in the provided Jupyter Notebook.

Step 1: Importing Necessary Libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import preprocessing, utils
from sklearn.neighbors import KNeighborsRegressor
import plotly.graph_objs as go
from plotly.offline import iplot

import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

import seaborn as sns

from sklearn import preprocessing, utils

from sklearn.neighbors import KNeighborsRegressor

import plotly.graph_objs as go

from plotly.offline import iplot

Step 2: Loading and Exploring the Dataset

df = pd.read_csv('diamonds.csv')
print(df.head())
sns.FacetGrid(df, hue='cut', height=6).map(sns.distplot, 'price').add_legend()
plt.show()

df = pd.read_csv('diamonds.csv')

print(df.head())

sns.FacetGrid(df, hue='cut', height=6).map(sns.distplot, 'price').add_legend()

plt.show()

Step 3: Data Preprocessing

Convert categorical variables to numerical and scale the features.

cut_dict = {'Fair': 1, 'Good': 2, 'Very Good': 3, 'Premium': 4, 'Ideal': 5}
clarity_dict = {'I1': 1, 'SI2': 2, 'SI1': 3, 'VS2': 4, 'VS1': 5, 'VVS2': 6, 'VVS1': 7, 'IF': 8}
color_dict = {'D':7, 'E':6, 'F':5, 'G':4, 'H':3, 'I':2, 'J':1}

df['cut'] = df['cut'].map(cut_dict)
df['clarity'] = df['clarity'].map(clarity_dict)
df['color'] = df['color'].map(color_dict)
df = df.drop('Unnamed: 0', axis=1)

cut_dict = {'Fair': 1, 'Good': 2, 'Very Good': 3, 'Premium': 4, 'Ideal': 5}

clarity_dict = {'I1': 1, 'SI2': 2, 'SI1': 3, 'VS2': 4, 'VS1': 5, 'VVS2': 6, 'VVS1': 7, 'IF': 8}

color_dict = {'D':7, 'E':6, 'F':5, 'G':4, 'H':3, 'I':2, 'J':1}

df['cut'] = df['cut'].map(cut_dict)

df['clarity'] = df['clarity'].map(clarity_dict)

df['color'] = df['color'].map(color_dict)

df = df.drop('Unnamed: 0', axis=1)

Step 4: Feature Scaling and Data Shuffling

df = utils.shuffle(df, random_state=42)
X = df.drop(['price'], axis=1).values
X = preprocessing.scale(X)
y = df['price'].values
y = preprocessing.scale(y)

df = utils.shuffle(df, random_state=42)

X = df.drop(['price'], axis=1).values

X = preprocessing.scale(X)

y = df['price'].values

y = preprocessing.scale(y)

Step 5: Splitting the Dataset

test_size = 200
X_train = X[:-test_size]
y_train = y[:-test_size]
X_test = X[-test_size:]
y_test = y[-test_size:]

test_size = 200

X_train = X[:-test_size]

y_train = y[:-test_size]

X_test = X[-test_size:]

y_test = y[-test_size:]

Step 6: Training the KNN Regressor and Evaluating Performance

score = []
for k in range(1, 20):
    clf = KNeighborsRegressor(n_neighbors=k, weights='distance', p=1)
    clf.fit(X_train, y_train)
    score.append(clf.score(X_test, y_test))

score = []

for k in range(1, 20):

clf = KNeighborsRegressor(n_neighbors=k, weights='distance', p=1)

clf.fit(X_train, y_train)

score.append(clf.score(X_test, y_test))

Step 7: Visualizing the Accuracy Scores

trace0 = go.Scatter(
    y=score,
    x=np.arange(1, len(score)+1),
    mode='lines+markers',
    marker=dict(color='rgb(100, 200, 150)')
)
layout = go.Layout(
    title='K Value vs. Accuracy Score',
    xaxis=dict(title='K Value', tickmode='linear'),
    yaxis=dict(title='Score')
)
fig = go.Figure(data=[trace0], layout=layout)
iplot(fig, filename='basic-line')

trace0 = go.Scatter(

y=score,

x=np.arange(1, len(score)+1),

mode='lines+markers',

marker=dict(color='rgb(100, 200, 150)')

)

layout = go.Layout(

title='K Value vs. Accuracy Score',

xaxis=dict(title='K Value', tickmode='linear'),

yaxis=dict(title='Score')

)

fig = go.Figure(data=[trace0], layout=layout)

iplot(fig, filename='basic-line')

Step 8: Determining the Optimal ‘K’ Value

k_max = score.index(max(score)) + 1
print(f"At K = {k_max}, Max Accuracy = {max(score) * 100:.2f}%")

1 2	k_max = score.index(max(score)) + 1 print(f"At K = {k_max}, Max Accuracy = {max(score) * 100:.2f}%")

Step 9: Final Model Training and Prediction

clf = KNeighborsRegressor(n_neighbors=50)
clf.fit(X_train, y_train)
print(clf.score(X_test, y_test))
y_pred = clf.predict(X_test)

clf = KNeighborsRegressor(n_neighbors=50)

clf.fit(X_train, y_train)

print(clf.score(X_test, y_test))

y_pred = clf.predict(X_test)

Step 10: Comparing Actual vs. Predicted Values

trace0 = go.Scatter(
    y=y_test,
    x=np.arange(200),
    mode='lines+markers',
    name='Actual Price',
    marker=dict(color='rgb(110, 10, 150)')
)

trace1 = go.Scatter(
    y=y_pred,
    x=np.arange(200),
    mode='lines+markers',
    name='Predicted Price',
    line=dict(color='rgb(200, 50, 10)', dash='dot')
)

layout = go.Layout(
    xaxis=dict(title='Index'),
    yaxis=dict(title='Normalized Price')
)

figure = go.Figure(data=[trace0, trace1], layout=layout)
iplot(figure)

trace0 = go.Scatter(

y=y_test,

x=np.arange(200),

mode='lines+markers',

name='Actual Price',

marker=dict(color='rgb(110, 10, 150)')

)

trace1 = go.Scatter(

y=y_pred,

x=np.arange(200),

mode='lines+markers',

name='Predicted Price',

line=dict(color='rgb(200, 50, 10)', dash='dot')

)

layout = go.Layout(

xaxis=dict(title='Index'),

yaxis=dict(title='Normalized Price')

)

figure = go.Figure(data=[trace0, trace1], layout=layout)

iplot(figure)

The plots generated provide a visual representation of how well the KNN model predicts diamond prices based on the selected ‘K’ value.

9. Conclusion

The K-Nearest Neighbors algorithm is a versatile and straightforward machine learning tool suitable for various applications in classification and regression. Its effectiveness largely depends on the choice of ‘K’ and the distance metric used. Proper data preprocessing and feature scaling are crucial steps to enhance model performance. While KNN is computationally intensive for large datasets, its simplicity makes it an excellent starting point for machine learning practitioners.

10. References

We hope this guide has provided you with a clear understanding of the K-Nearest Neighbors algorithm. Stay tuned for more in-depth tutorials and insights into machine learning techniques.

S19L01-KNN Background

Understanding K-Nearest Neighbors (KNN) for Classification and Regression

Table of Contents

1. Introduction to K-Nearest Neighbors

Why KNN?

2. How KNN Works

Data Representation

Distance Metrics

3. Choosing the Right ‘K’ Value

Impact of ‘K’

Best Practices

4. KNN for Classification

Example Scenario

Sensitivity to ‘K’

5. KNN for Regression

Challenges in Regression

Implementation Insights

6. Advantages and Disadvantages of KNN

Advantages

Disadvantages

7. Implementing KNN in Python

Data Preprocessing

Model Training and Evaluation

8. Practical Example

Step 1: Importing Necessary Libraries

Step 2: Loading and Exploring the Dataset

Step 3: Data Preprocessing

Step 4: Feature Scaling and Data Shuffling

Step 5: Splitting the Dataset

Step 6: Training the KNN Regressor and Evaluating Performance

Step 7: Visualizing the Accuracy Scores

Step 8: Determining the Optimal ‘K’ Value

Step 9: Final Model Training and Prediction

Step 10: Comparing Actual vs. Predicted Values

9. Conclusion

10. References