⚡️ Saturday ML Spark – ⚖️ Imbalanced Data (SMOTE vs class_weight)
Posted on: April 18, 2026
Description:
In many real-world machine learning problems, data is not evenly distributed across classes. One class often dominates, while the other — usually the more important one — appears rarely.
This is known as class imbalance, and it can severely impact model performance.
In this project, we explore two practical techniques to handle imbalanced data: SMOTE and class_weight.
Understanding the Problem
In an imbalanced dataset:
- Majority class → appears frequently
- Minority class → appears rarely
For example:
- Fraud detection → very few fraudulent transactions
- Medical diagnosis → rare disease cases
- Churn prediction → fewer churned customers
A model trained on such data may achieve high accuracy while completely ignoring the minority class.
Why Imbalance Is a Problem
Consider a dataset with:
- 90% class A
- 10% class B
A model predicting only class A achieves 90% accuracy — but is useless in practice.
This is why we must focus on:
- Precision
- Recall
- F1 Score
instead of just accuracy.
Baseline Model (No Handling)
We first train a model without addressing imbalance.
model = LogisticRegression()
model.fit(X_train, y_train)
This usually results in:
- high accuracy
- poor recall for minority class
Approach 1: SMOTE (Synthetic Oversampling)
SMOTE generates synthetic examples for the minority class.
from imblearn.over_sampling import SMOTE
smote = SMOTE()
X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)
This balances the dataset before training.
model.fit(X_train_smote, y_train_smote)
SMOTE helps the model learn patterns from minority class better.
Approach 2: class_weight
Instead of changing the data, we adjust the model’s learning behaviour.
model = LogisticRegression(class_weight="balanced")
model.fit(X_train, y_train)
This increases the penalty for misclassifying the minority class.
SMOTE vs class_weight
Both methods aim to improve minority class performance, but they work differently:
- SMOTE
- modifies the data
- increases minority samples
- may introduce synthetic noise
- class_weight
- keeps data unchanged
- adjusts model training
- simpler and faster
Choosing between them depends on the dataset and use case.
Why This Matters
Handling imbalance is critical in:
- fraud detection systems
- healthcare models
- anomaly detection
- recommendation systems
Ignoring imbalance can lead to models that look good on paper but fail in real-world scenarios.
Key Takeaways
- Imbalanced data biases models toward the majority class.
- Accuracy alone is not a reliable metric.
- SMOTE creates synthetic minority samples.
- class_weight adjusts model learning without changing data.
- Both techniques improve real-world model performance.
Conclusion
Class imbalance is one of the most common challenges in machine learning. By using techniques like SMOTE and class_weight, we can build models that better handle rare but critical cases, making them more useful in real-world applications.
This makes imbalance handling an essential topic in Saturday ML Spark ⚡️ – Advanced & Practical.
Code Snippet:
# 📦 Import Required Libraries
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix
from imblearn.over_sampling import SMOTE
# 🧩 Create an Imbalanced Dataset
X, y = make_classification(
n_samples=5000,
n_features=10,
n_informative=6,
n_redundant=2,
weights=[0.90, 0.10],
random_state=42
)
X = pd.DataFrame(X, columns=[f"feature_{i}" for i in range(10)])
y = pd.Series(y, name="target")
# ✂️ Split Data
X_train, X_test, y_train, y_test = train_test_split(
X,
y,
test_size=0.3,
random_state=42,
stratify=y
)
# =========================================================
# 🚨 Baseline Model (No Imbalance Handling)
# =========================================================
baseline_model = LogisticRegression(max_iter=5000)
baseline_model.fit(X_train, y_train)
baseline_pred = baseline_model.predict(X_test)
print("=== Baseline Model ===")
print(classification_report(y_test, baseline_pred))
print(confusion_matrix(y_test, baseline_pred))
# =========================================================
# 🔁 Approach 1 – SMOTE
# =========================================================
smote = SMOTE(random_state=42)
X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)
smote_model = LogisticRegression(max_iter=5000)
smote_model.fit(X_train_smote, y_train_smote)
smote_pred = smote_model.predict(X_test)
print("\n=== SMOTE Model ===")
print(classification_report(y_test, smote_pred))
print(confusion_matrix(y_test, smote_pred))
# =========================================================
# ⚖️ Approach 2 – class_weight
# =========================================================
weighted_model = LogisticRegression(
max_iter=5000,
class_weight="balanced"
)
weighted_model.fit(X_train, y_train)
weighted_pred = weighted_model.predict(X_test)
print("\n=== class_weight Model ===")
print(classification_report(y_test, weighted_pred))
print(confusion_matrix(y_test, weighted_pred))
No comments yet. Be the first to comment!