⚡️ Saturday ML Sparks – Dimensionality Reduction with PCA 📉🧠
Posted on: December 27, 2025
Description:
High-dimensional datasets are common in real-world machine learning problems — especially in domains like finance, healthcare, genomics, and computer vision.
However, working directly with many features can make models harder to train, visualize, and interpret.
In this Saturday ML Spark, we explore Principal Component Analysis (PCA) — a foundational unsupervised technique used to reduce dimensionality while preserving as much information as possible.
Understanding the Problem
As the number of features grows, datasets become:
- harder to visualize
- more computationally expensive
- prone to noise and multicollinearity
- susceptible to the curse of dimensionality
Dimensionality reduction techniques like PCA address these issues by transforming the data into a lower-dimensional space that still captures the core structure and variance of the original dataset.
1. Load a High-Dimensional Dataset
We use the Wine dataset, which contains multiple numerical features suitable for dimensionality reduction.
from sklearn.datasets import load_wine
data = load_wine()
X = data.data
y = data.target
This dataset is commonly used to demonstrate PCA and visualization techniques.
2. Standardize the Features
PCA is sensitive to feature scale, so standardization is a required preprocessing step.
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
Without scaling, features with larger magnitudes would dominate the principal components.
3. Apply PCA
We reduce the dataset to two principal components for visualization.
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)
Each principal component is a linear combination of the original features.
4. Visualize the Reduced Data
Plotting the data in two dimensions reveals structure that is otherwise hidden.
import matplotlib.pyplot as plt
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y, cmap="viridis")
plt.xlabel("Principal Component 1")
plt.ylabel("Principal Component 2")
plt.title("PCA – 2D Projection of Wine Dataset")
plt.show()
This visualization helps assess how well classes separate after dimensionality reduction.
5. Explained Variance Ratio
The explained variance ratio shows how much information each component captures.
import numpy as np
print("Explained variance ratio:", pca.explained_variance_ratio_)
print("Total variance captured:", np.sum(pca.explained_variance_ratio_))
A higher cumulative variance indicates better information retention.
When to Use PCA
PCA is especially useful when:
- datasets have many correlated features
- visualization of high-dimensional data is required
- noise reduction is beneficial
- clustering performance needs improvement
- model training speed is a concern
Key Takeaways
- PCA reduces dimensionality while preserving maximum variance.
- Feature scaling is mandatory before applying PCA.
- Principal components are orthogonal and uncorrelated.
- Explained variance helps decide how many components to keep.
- PCA is often used before clustering and visualization tasks.
Conclusion
Principal Component Analysis is a powerful unsupervised learning technique that simplifies complex datasets without discarding essential information.
By reducing dimensionality, PCA makes data easier to visualize, analyze, and model — forming a critical step in many real-world ML pipelines.
Whether you’re preparing data for clustering, speeding up models, or gaining insight into feature relationships, PCA remains a foundational tool every ML practitioner should understand.
Code Snippet:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_wine
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
data = load_wine()
X = data.data
y = data.target
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)
plt.figure(figsize=(8, 5))
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y, cmap="viridis")
plt.xlabel("Principal Component 1")
plt.ylabel("Principal Component 2")
plt.title("PCA – 2D Projection of Wine Dataset")
plt.grid(True)
plt.show()
print("Explained variance ratio:", pca.explained_variance_ratio_)
print("Total variance captured:", np.sum(pca.explained_variance_ratio_))
No comments yet. Be the first to comment!