AW Dev Rethought

⚖️ There are two ways of constructing a software design: one way is to make it so simple that there are obviously no deficiencies - C.A.R. Hoare

⚡️ Saturday ML Sparks – Dimensionality Reduction with PCA 📉🧠


Description:

High-dimensional datasets are common in real-world machine learning problems — especially in domains like finance, healthcare, genomics, and computer vision.

However, working directly with many features can make models harder to train, visualize, and interpret.

In this Saturday ML Spark, we explore Principal Component Analysis (PCA) — a foundational unsupervised technique used to reduce dimensionality while preserving as much information as possible.


Understanding the Problem

As the number of features grows, datasets become:

  • harder to visualize
  • more computationally expensive
  • prone to noise and multicollinearity
  • susceptible to the curse of dimensionality

Dimensionality reduction techniques like PCA address these issues by transforming the data into a lower-dimensional space that still captures the core structure and variance of the original dataset.


1. Load a High-Dimensional Dataset

We use the Wine dataset, which contains multiple numerical features suitable for dimensionality reduction.

from sklearn.datasets import load_wine

data = load_wine()
X = data.data
y = data.target

This dataset is commonly used to demonstrate PCA and visualization techniques.


2. Standardize the Features

PCA is sensitive to feature scale, so standardization is a required preprocessing step.

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

Without scaling, features with larger magnitudes would dominate the principal components.


3. Apply PCA

We reduce the dataset to two principal components for visualization.

from sklearn.decomposition import PCA

pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

Each principal component is a linear combination of the original features.


4. Visualize the Reduced Data

Plotting the data in two dimensions reveals structure that is otherwise hidden.

import matplotlib.pyplot as plt

plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y, cmap="viridis")
plt.xlabel("Principal Component 1")
plt.ylabel("Principal Component 2")
plt.title("PCA – 2D Projection of Wine Dataset")
plt.show()

This visualization helps assess how well classes separate after dimensionality reduction.


5. Explained Variance Ratio

The explained variance ratio shows how much information each component captures.

import numpy as np

print("Explained variance ratio:", pca.explained_variance_ratio_)
print("Total variance captured:", np.sum(pca.explained_variance_ratio_))

A higher cumulative variance indicates better information retention.


When to Use PCA

PCA is especially useful when:

  • datasets have many correlated features
  • visualization of high-dimensional data is required
  • noise reduction is beneficial
  • clustering performance needs improvement
  • model training speed is a concern

Key Takeaways

  1. PCA reduces dimensionality while preserving maximum variance.
  2. Feature scaling is mandatory before applying PCA.
  3. Principal components are orthogonal and uncorrelated.
  4. Explained variance helps decide how many components to keep.
  5. PCA is often used before clustering and visualization tasks.

Conclusion

Principal Component Analysis is a powerful unsupervised learning technique that simplifies complex datasets without discarding essential information.

By reducing dimensionality, PCA makes data easier to visualize, analyze, and model — forming a critical step in many real-world ML pipelines.

Whether you’re preparing data for clustering, speeding up models, or gaining insight into feature relationships, PCA remains a foundational tool every ML practitioner should understand.


Code Snippet:

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_wine
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA


data = load_wine()
X = data.data
y = data.target


scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)


pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)


plt.figure(figsize=(8, 5))
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y, cmap="viridis")
plt.xlabel("Principal Component 1")
plt.ylabel("Principal Component 2")
plt.title("PCA – 2D Projection of Wine Dataset")
plt.grid(True)
plt.show()


print("Explained variance ratio:", pca.explained_variance_ratio_)
print("Total variance captured:", np.sum(pca.explained_variance_ratio_))

Link copied!

Comments

Add Your Comment

Comment Added!