🧠 AI with Python – 📊 Feature Scaling Impact on KNN Accuracy


Description:

Feature scaling is one of the most overlooked yet crucial steps in machine learning.

Many algorithms, especially those based on distance metrics, can produce biased or inaccurate results when input features have very different numeric ranges.

In this post, we’ll see how scaling your data can dramatically improve the accuracy of a K-Nearest Neighbors (KNN) classifier.


Why Scaling Matters

KNN works by calculating distances between data points — usually Euclidean distance.

If one feature spans values like 0–1 and another spans 0–1000, the latter will dominate the distance calculation, making the smaller-scaled feature nearly irrelevant.

To ensure each feature contributes equally to the model, we normalize or standardize them using scaling techniques such as StandardScaler or MinMaxScaler.


Dataset and Setup

We’ll use the Wine dataset from scikit-learn, which contains 13 chemical analysis features of wines.

Some features (like alcohol percentage and color intensity) vary widely in scale — making it ideal for this demonstration.

from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Load dataset
data = load_wine()
X, y = data.data, data.target

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

Training Without Scaling

Let’s first train a KNN classifier on the raw dataset without scaling.

# Train KNN without feature scaling
knn_unscaled = KNeighborsClassifier(n_neighbors=5)
knn_unscaled.fit(X_train, y_train)

# Evaluate model
y_pred_unscaled = knn_unscaled.predict(X_test)
acc_unscaled = accuracy_score(y_test, y_pred_unscaled)
print(f"Accuracy without scaling: {acc_unscaled:.2f}")

📊 Result: Accuracy without scaling: 0.74

Without scaling, the model accuracy suffers because certain high-range features dominate distance calculations.


Training With Feature Scaling

Next, we apply StandardScaler to ensure all features have a mean of 0 and a standard deviation of 1.

# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train KNN with scaled data
knn_scaled = KNeighborsClassifier(n_neighbors=5)
knn_scaled.fit(X_train_scaled, y_train)

# Evaluate model
y_pred_scaled = knn_scaled.predict(X_test_scaled)
acc_scaled = accuracy_score(y_test, y_pred_scaled)
print(f"Accuracy with scaling: {acc_scaled:.2f}")

📊 Result: Accuracy with scaling: 0.96

After scaling, each feature contributes proportionally, allowing the KNN model to compute distances accurately — resulting in a significant improvement.


Comparing Results

Model Type Accuracy
Without Scaling 0.74
With Scaling 0.96

The difference speaks for itself — feature scaling dramatically improved performance.


Key Insights

  • KNN, SVM, PCA, and clustering algorithms are highly sensitive to feature scaling.
  • Features with larger numeric ranges can overshadow others in distance-based models.
  • Always scale your input features unless you’re using tree-based models (like RandomForest or XGBoost), which are scale-invariant.

Conclusion

Feature scaling isn’t just a preprocessing step — it’s a necessity for fair and accurate model training.

As shown, applying StandardScaler boosted KNN’s accuracy from 74% to 96%, transforming it from a poorly performing model into a highly reliable one.

Before feeding your data into a model, always check your feature scales — your model’s performance might just thank you for it.


Code Snippet:

from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
import pandas as pd


# Load dataset
wine = load_wine()
X = wine.data
y = wine.target

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)


# Train KNN without scaling
knn_unscaled = KNeighborsClassifier(n_neighbors=5)
knn_unscaled.fit(X_train, y_train)

# Predict and evaluate
y_pred_unscaled = knn_unscaled.predict(X_test)
acc_unscaled = accuracy_score(y_test, y_pred_unscaled)
print(f"Accuracy without scaling: {acc_unscaled:.2f}")


# Apply StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)


# Train KNN with scaled features
knn_scaled = KNeighborsClassifier(n_neighbors=5)
knn_scaled.fit(X_train_scaled, y_train)

# Predict and evaluate
y_pred_scaled = knn_scaled.predict(X_test_scaled)
acc_scaled = accuracy_score(y_test, y_pred_scaled)
print(f"Accuracy with scaling: {acc_scaled:.2f}")

Link copied!

Comments

Add Your Comment

Comment Added!