Understanding K-Nearest Neighbors (KNN) for Classification and Regression
Table of Contents
- Introduction to K-Nearest Neighbors
- How KNN Works
- Choosing the Right ‘K’ Value
- KNN for Classification
- KNN for Regression
- Advantages and Disadvantages of KNN
- Implementing KNN in Python
- Practical Example
- Conclusion
- References
1. Introduction to K-Nearest Neighbors
K-Nearest Neighbors (KNN) is a simple, yet powerful, supervised machine learning algorithm used for both classification and regression tasks. The core idea is to predict the label of a new data point based on the labels of its ‘K’ closest neighbors in the feature space.
Why KNN?
- Simplicity: Easy to understand and implement.
- No Training Phase: KNN is a lazy learner, meaning it doesn’t explicitly train a model but makes decisions based on the entire dataset.
- Versatility: Applicable to various types of problems, including classification, regression, and even anomaly detection.
2. How KNN Works
KNN operates on the principle that similar data points are likely to have similar outcomes. Here’s a step-by-step breakdown of how the algorithm functions:
Data Representation
Imagine a two-dimensional space where each data point represents a car based on two features:
- Manufacturing Time (X-axis)
- Manufacturing Cost (Y-axis)
Data points are color-coded:
- Red Dots: Petrol cars
- Blue Dots: Electric cars
Distance Metrics
To determine the “closeness” of data points, KNN utilizes distance metrics. The most commonly used metrics are:
- Euclidean Distance
\[ d(p, q) = \sqrt{\sum_{i=1}^{n} (q_i – p_i)^2} \]
- Used When: Data is in a continuous space.
- Pro Tip: Euclidean distance is the default metric in many KNN implementations, including scikit-learn.
- Manhattan Distance
\[ d(p, q) = \sum_{i=1}^{n} |q_i – p_i| \]
- Used When: Data is grid-like and movement is restricted to horizontal and vertical paths.
- Minkowski Distance
A generalization of both Euclidean and Manhattan distances.
\[ d(p, q) = \left( \sum_{i=1}^{n} |q_i – p_i|^p \right)^{1/p} \]
- When \( p = 1 \): Equivalent to Manhattan distance.
- When \( p = 2 \): Equivalent to Euclidean distance.
3. Choosing the Right ‘K’ Value
The parameter ‘K’ determines the number of neighbors to consider when making a prediction. Selecting the optimal ‘K’ value is crucial for the performance of the KNN algorithm.
Impact of ‘K’
- Small ‘K’ (e.g., K=1):
- More sensitive to noise.
- Can lead to overfitting.
- Large ‘K’ (e.g., K=20):
- Smoother decision boundary.
- May underfit by oversimplifying the data.
Best Practices
- Cross-Validation: Use techniques like cross-validation to find the ‘K’ value that yields the best accuracy.
- Odd Numbers: When dealing with binary classification, using odd ‘K’ values helps in avoiding ties.
4. KNN for Classification
In classification, KNN assigns the class most common among its ‘K’ nearest neighbors to the new data point.
Example Scenario
Consider a new car data point with specific manufacturing time and cost. The KNN algorithm will:
- Calculate the distance from this point to all other points in the dataset.
- Identify the ‘K’ closest neighbors.
- Assign the class (Electric or Petrol) based on the majority vote among these neighbors.
Sensitivity to ‘K’
As demonstrated in the transcript, varying ‘K’ can change the classification outcome. For instance:
- K=1: The new point is classified based on its single nearest neighbor.
- K=5: The majority vote among five neighbors determines the classification.
5. KNN for Regression
While KNN is predominantly used for classification, it can also perform regression tasks by predicting the average value of the ‘K’ nearest neighbors.
Challenges in Regression
- Overfitting: Lower ‘K’ values can lead to overfitting.
- Underfitting: Higher ‘K’ values may oversimplify the model.
Implementation Insights
In the provided Jupyter Notebook, KNN regression was applied to predict diamond prices. Here’s a brief overview:
- Data Preprocessing:
- Mapped categorical variables to numerical values.
- Scaled features using standardization.
- Model Training:
- Trained KNN regressor with varying ‘K’ values to determine optimal performance.
- Evaluation:
- Achieved a maximum accuracy score of approximately 98.05% at K=4.
- Visualized actual vs. predicted prices using Plotly for better interpretability.
6. Advantages and Disadvantages of KNN
Advantages
- Simple and Intuitive: Easy to understand and implement.
- No Training Phase: Reduces computational cost during training.
- Adaptable: Suitable for both classification and regression.
Disadvantages
- Computationally Intensive: Makes predictions using the entire dataset, which can be slow for large datasets.
- Sensitive to Irrelevant Features: Irrelevant or redundant features can degrade performance.
- Choosing ‘K’: Selecting the optimal ‘K’ value can be challenging.
7. Implementing KNN in Python
Leveraging Python’s scikit-learn library simplifies the implementation of KNN. Below, we outline the key steps from data preprocessing to model evaluation.
Data Preprocessing
Before applying KNN, it’s essential to prepare the data:
- Handling Categorical Variables:
- Convert categorical text data into numerical values using mapping dictionaries.
1 2 3 4 5 6 7 8 |
cut_dict = {'Fair': 1, 'Good': 2, 'Very Good': 3, 'Premium': 4, 'Ideal': 5} clarity_dict = {'I1': 1, 'SI2': 2, 'SI1': 3, 'VS2': 4, 'VS1': 5, 'VVS2': 6, 'VVS1': 7, 'IF': 8} color_dict = {'D':7, 'E':6, 'F':5, 'G':4, 'H':3, 'I':2, 'J':1} df['cut'] = df['cut'].map(cut_dict) df['clarity'] = df['clarity'].map(clarity_dict) df['color'] = df['color'].map(color_dict) df = df.drop('Unnamed: 0', axis=1) |
- Scaling Features:
- Normalize the feature set to ensure all features contribute equally to the distance calculations.
1 2 3 4 5 6 |
from sklearn import preprocessing X = df.drop(['price'], axis=1).values X = preprocessing.scale(X) y = df['price'].values y = preprocessing.scale(y) |
Model Training and Evaluation
- Splitting the Dataset:
1 2 3 4 5 6 7 8 |
from sklearn.utils import shuffle df = shuffle(df, random_state=42) test_size = 200 X_train = X[:-test_size] y_train = y[:-test_size] X_test = X[-test_size:] y_test = y[-test_size:] |
- Training KNN Regressor:
1 2 3 4 5 6 7 |
from sklearn.neighbors import KNeighborsRegressor score = [] for k in range(1, 20): clf = KNeighborsRegressor(n_neighbors=k, weights='distance', p=1) clf.fit(X_train, y_train) score.append(clf.score(X_test, y_test)) |
- Visualizing Performance:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
import plotly.graph_objs as go from plotly.offline import iplot trace0 = go.Scatter( y=score, x=np.arange(1, len(score)+1), mode='lines+markers', marker=dict(color='rgb(100, 200, 150)') ) layout = go.Layout( title='K Value vs. Accuracy Score', xaxis=dict(title='K Value', tickmode='linear'), yaxis=dict(title='Score') ) fig = go.Figure(data=[trace0], layout=layout) iplot(fig, filename='basic-line') |
- Determining Optimal ‘K’:
1 2 |
k_max = score.index(max(score)) + 1 print(f"At K = {k_max}, Max Accuracy = {max(score) * 100:.2f}%") |
Output:
1 |
At K = 4, Max Accuracy = 98.05% |
- Final Model Evaluation:
1 2 3 4 |
clf = KNeighborsRegressor(n_neighbors=50) clf.fit(X_train, y_train) print(clf.score(X_test, y_test)) y_pred = clf.predict(X_test) |
Output:
1 |
0.9543611406331687 |
- Comparing Actual vs. Predicted Prices:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 |
import plotly.graph_objs as go from plotly.offline import iplot trace0 = go.Scatter( y=y_test, x=np.arange(200), mode='lines+markers', name='Actual Price', marker=dict(color='rgb(110, 10, 150)') ) trace1 = go.Scatter( y=y_pred, x=np.arange(200), mode='lines+markers', name='Predicted Price', line=dict(color='rgb(200, 50, 10)', dash='dot') ) layout = go.Layout( xaxis=dict(title='Index'), yaxis=dict(title='Normalized Price') ) figure = go.Figure(data=[trace0, trace1], layout=layout) iplot(figure) |
This visualization helps in assessing the model’s prediction accuracy by overlaying actual and predicted price values.
8. Practical Example
Let’s walk through a practical implementation using Python’s scikit-learn library, as outlined in the provided Jupyter Notebook.
Step 1: Importing Necessary Libraries
1 2 3 4 5 6 7 8 |
import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns from sklearn import preprocessing, utils from sklearn.neighbors import KNeighborsRegressor import plotly.graph_objs as go from plotly.offline import iplot |
Step 2: Loading and Exploring the Dataset
1 2 3 4 |
df = pd.read_csv('diamonds.csv') print(df.head()) sns.FacetGrid(df, hue='cut', height=6).map(sns.distplot, 'price').add_legend() plt.show() |
Step 3: Data Preprocessing
Convert categorical variables to numerical and scale the features.
1 2 3 4 5 6 7 8 |
cut_dict = {'Fair': 1, 'Good': 2, 'Very Good': 3, 'Premium': 4, 'Ideal': 5} clarity_dict = {'I1': 1, 'SI2': 2, 'SI1': 3, 'VS2': 4, 'VS1': 5, 'VVS2': 6, 'VVS1': 7, 'IF': 8} color_dict = {'D':7, 'E':6, 'F':5, 'G':4, 'H':3, 'I':2, 'J':1} df['cut'] = df['cut'].map(cut_dict) df['clarity'] = df['clarity'].map(clarity_dict) df['color'] = df['color'].map(color_dict) df = df.drop('Unnamed: 0', axis=1) |
Step 4: Feature Scaling and Data Shuffling
1 2 3 4 5 |
df = utils.shuffle(df, random_state=42) X = df.drop(['price'], axis=1).values X = preprocessing.scale(X) y = df['price'].values y = preprocessing.scale(y) |
Step 5: Splitting the Dataset
1 2 3 4 5 |
test_size = 200 X_train = X[:-test_size] y_train = y[:-test_size] X_test = X[-test_size:] y_test = y[-test_size:] |
Step 6: Training the KNN Regressor and Evaluating Performance
1 2 3 4 5 |
score = [] for k in range(1, 20): clf = KNeighborsRegressor(n_neighbors=k, weights='distance', p=1) clf.fit(X_train, y_train) score.append(clf.score(X_test, y_test)) |
Step 7: Visualizing the Accuracy Scores
1 2 3 4 5 6 7 8 9 10 11 12 13 |
trace0 = go.Scatter( y=score, x=np.arange(1, len(score)+1), mode='lines+markers', marker=dict(color='rgb(100, 200, 150)') ) layout = go.Layout( title='K Value vs. Accuracy Score', xaxis=dict(title='K Value', tickmode='linear'), yaxis=dict(title='Score') ) fig = go.Figure(data=[trace0], layout=layout) iplot(fig, filename='basic-line') |
Step 8: Determining the Optimal ‘K’ Value
1 2 |
k_max = score.index(max(score)) + 1 print(f"At K = {k_max}, Max Accuracy = {max(score) * 100:.2f}%") |
Step 9: Final Model Training and Prediction
1 2 3 4 |
clf = KNeighborsRegressor(n_neighbors=50) clf.fit(X_train, y_train) print(clf.score(X_test, y_test)) y_pred = clf.predict(X_test) |
Step 10: Comparing Actual vs. Predicted Values
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
trace0 = go.Scatter( y=y_test, x=np.arange(200), mode='lines+markers', name='Actual Price', marker=dict(color='rgb(110, 10, 150)') ) trace1 = go.Scatter( y=y_pred, x=np.arange(200), mode='lines+markers', name='Predicted Price', line=dict(color='rgb(200, 50, 10)', dash='dot') ) layout = go.Layout( xaxis=dict(title='Index'), yaxis=dict(title='Normalized Price') ) figure = go.Figure(data=[trace0, trace1], layout=layout) iplot(figure) |
The plots generated provide a visual representation of how well the KNN model predicts diamond prices based on the selected ‘K’ value.
9. Conclusion
The K-Nearest Neighbors algorithm is a versatile and straightforward machine learning tool suitable for various applications in classification and regression. Its effectiveness largely depends on the choice of ‘K’ and the distance metric used. Proper data preprocessing and feature scaling are crucial steps to enhance model performance. While KNN is computationally intensive for large datasets, its simplicity makes it an excellent starting point for machine learning practitioners.
10. References
We hope this guide has provided you with a clear understanding of the K-Nearest Neighbors algorithm. Stay tuned for more in-depth tutorials and insights into machine learning techniques.