⚡️ Saturday ML Sparks – Elbow & Silhouette Scores 📊🧠
Posted on: December 20, 2025
Description:
Choosing the right number of clusters is one of the most important—and most confusing—parts of unsupervised learning.
Unlike supervised models, clustering algorithms don’t come with labels to validate results automatically.
In this ML Spark, we explore two widely used techniques—the Elbow Method and the Silhouette Score—to evaluate and select the optimal number of clusters when using KMeans.
Understanding the Problem
KMeans requires you to specify the number of clusters (k) before training.
Choosing a value that’s too small oversimplifies the data, while too many clusters can lead to overfitting and meaningless groupings.
To make this decision more objective, we rely on evaluation metrics that quantify clustering quality instead of guessing.
1. Generate Unlabeled Data
We start by creating a synthetic dataset with natural cluster structure.
from sklearn.datasets import make_blobs
X, _ = make_blobs(
n_samples=400,
centers=4,
cluster_std=0.60,
random_state=42
)
This simulates a real-world scenario where labels are unavailable.
2. Elbow Method – Measuring Inertia
The Elbow Method evaluates how compact clusters are using inertia, which represents the sum of squared distances between points and their assigned centroids.
inertias = []
for k in range(2, 11):
kmeans = KMeans(n_clusters=k, random_state=42)
kmeans.fit(X)
inertias.append(kmeans.inertia_)
As k increases, inertia always decreases—but at a diminishing rate.
3. Visualizing the Elbow Curve
Plotting inertia against the number of clusters reveals the “elbow”.
plt.plot(range(2, 11), inertias, marker="o")
plt.xlabel("Number of Clusters (k)")
plt.ylabel("Inertia")
plt.title("Elbow Method for Optimal k")
plt.show()
The elbow point indicates where adding more clusters no longer yields significant improvement.
4. Silhouette Score – Measuring Cluster Separation
The Silhouette Score evaluates how well-separated clusters are by comparing intra-cluster and inter-cluster distances.
from sklearn.metrics import silhouette_score
scores = []
for k in range(2, 11):
labels = KMeans(n_clusters=k, random_state=42).fit_predict(X)
scores.append(silhouette_score(X, labels))
Higher silhouette scores indicate better-defined clusters.
5. Visualizing Silhouette Scores
plt.plot(range(2, 11), scores, marker="o")
plt.xlabel("Number of Clusters (k)")
plt.ylabel("Silhouette Score")
plt.title("Silhouette Score vs k")
plt.show()
The value of k with the highest score is often the best choice.
How to Use Both Together
- Use the Elbow Method to narrow down a reasonable range
- Use the Silhouette Score to validate cluster separation
- Prefer values where both methods agree
This combined approach leads to more confident clustering decisions.
Key Takeaways
- Selecting the correct number of clusters is critical in unsupervised learning.
- The Elbow Method highlights diminishing returns as clusters increase.
- Silhouette Score quantifies how well clusters are separated.
- Both metrics should be used together for reliable evaluation.
- These techniques are essential before finalizing any KMeans model.
Conclusion
Clustering without evaluation can easily lead to misleading insights.
By using Elbow and Silhouette methods, you replace guesswork with measurable evidence—making unsupervised learning more reliable and interpretable.
These techniques are foundational tools for anyone working with clustering algorithms and real-world unlabeled data.
Code Snippet:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
X, _ = make_blobs(
n_samples=400,
centers=4,
cluster_std=0.60,
random_state=42
)
inertias = []
k_range = range(2, 11)
for k in k_range:
kmeans = KMeans(n_clusters=k, random_state=42)
kmeans.fit(X)
inertias.append(kmeans.inertia_)
plt.figure(figsize=(8, 5))
plt.plot(k_range, inertias, marker="o")
plt.xlabel("Number of Clusters (k)")
plt.ylabel("Inertia")
plt.title("Elbow Method for Optimal k")
plt.grid(True)
plt.show()
silhouette_scores = []
for k in k_range:
kmeans = KMeans(n_clusters=k, random_state=42)
labels = kmeans.fit_predict(X)
score = silhouette_score(X, labels)
silhouette_scores.append(score)
plt.figure(figsize=(8, 5))
plt.plot(k_range, silhouette_scores, marker="o", color="green")
plt.xlabel("Number of Clusters (k)")
plt.ylabel("Silhouette Score")
plt.title("Silhouette Score vs Number of Clusters")
plt.grid(True)
plt.show()
No comments yet. Be the first to comment!