⚡️ Saturday ML Spark – ⚖️ Imbalanced Data (SMOTE vs class_weight)

Posted on: April 18, 2026

Description:

In many real-world machine learning problems, data is not evenly distributed across classes. One class often dominates, while the other — usually the more important one — appears rarely.

This is known as class imbalance, and it can severely impact model performance.

In this project, we explore two practical techniques to handle imbalanced data: SMOTE and class_weight.

Understanding the Problem

In an imbalanced dataset:

Majority class → appears frequently
Minority class → appears rarely

For example:

Fraud detection → very few fraudulent transactions
Medical diagnosis → rare disease cases
Churn prediction → fewer churned customers

A model trained on such data may achieve high accuracy while completely ignoring the minority class.

Why Imbalance Is a Problem

Consider a dataset with:

90% class A
10% class B

A model predicting only class A achieves 90% accuracy — but is useless in practice.

This is why we must focus on:

Precision
Recall
F1 Score

instead of just accuracy.

Baseline Model (No Handling)

We first train a model without addressing imbalance.

model = LogisticRegression()
model.fit(X_train, y_train)

This usually results in:

high accuracy
poor recall for minority class

Approach 1: SMOTE (Synthetic Oversampling)

SMOTE generates synthetic examples for the minority class.

from imblearn.over_sampling import SMOTE

smote = SMOTE()
X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)

This balances the dataset before training.

model.fit(X_train_smote, y_train_smote)

SMOTE helps the model learn patterns from minority class better.

Approach 2: class_weight

Instead of changing the data, we adjust the model’s learning behaviour.

model = LogisticRegression(class_weight="balanced")
model.fit(X_train, y_train)

This increases the penalty for misclassifying the minority class.

SMOTE vs class_weight

Both methods aim to improve minority class performance, but they work differently:

SMOTE
- modifies the data
- increases minority samples
- may introduce synthetic noise
class_weight
- keeps data unchanged
- adjusts model training
- simpler and faster

Choosing between them depends on the dataset and use case.

Why This Matters

Handling imbalance is critical in:

fraud detection systems
healthcare models
anomaly detection
recommendation systems

Ignoring imbalance can lead to models that look good on paper but fail in real-world scenarios.

Key Takeaways

Imbalanced data biases models toward the majority class.
Accuracy alone is not a reliable metric.
SMOTE creates synthetic minority samples.
class_weight adjusts model learning without changing data.
Both techniques improve real-world model performance.

Conclusion

Class imbalance is one of the most common challenges in machine learning. By using techniques like SMOTE and class_weight, we can build models that better handle rare but critical cases, making them more useful in real-world applications.

This makes imbalance handling an essential topic in Saturday ML Spark ⚡️ – Advanced & Practical.

Code Snippet:

# 📦 Import Required Libraries
import pandas as pd

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix

from imblearn.over_sampling import SMOTE


# 🧩 Create an Imbalanced Dataset
X, y = make_classification(
    n_samples=5000,
    n_features=10,
    n_informative=6,
    n_redundant=2,
    weights=[0.90, 0.10],
    random_state=42
)

X = pd.DataFrame(X, columns=[f"feature_{i}" for i in range(10)])
y = pd.Series(y, name="target")


# ✂️ Split Data
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.3,
    random_state=42,
    stratify=y
)


# =========================================================
# 🚨 Baseline Model (No Imbalance Handling)
# =========================================================

baseline_model = LogisticRegression(max_iter=5000)
baseline_model.fit(X_train, y_train)

baseline_pred = baseline_model.predict(X_test)

print("=== Baseline Model ===")
print(classification_report(y_test, baseline_pred))
print(confusion_matrix(y_test, baseline_pred))


# =========================================================
# 🔁 Approach 1 – SMOTE
# =========================================================

smote = SMOTE(random_state=42)
X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)

smote_model = LogisticRegression(max_iter=5000)
smote_model.fit(X_train_smote, y_train_smote)

smote_pred = smote_model.predict(X_test)

print("\n=== SMOTE Model ===")
print(classification_report(y_test, smote_pred))
print(confusion_matrix(y_test, smote_pred))


# =========================================================
# ⚖️ Approach 2 – class_weight
# =========================================================

weighted_model = LogisticRegression(
    max_iter=5000,
    class_weight="balanced"
)

weighted_model.fit(X_train, y_train)

weighted_pred = weighted_model.predict(X_test)

print("\n=== class_weight Model ===")
print(classification_report(y_test, weighted_pred))
print(confusion_matrix(y_test, weighted_pred))

← →	move
↑	rotate
↓	soft drop
Space	hard drop
P	pause / resume

⚡️ Saturday ML Spark – ⚖️ Imbalanced Data (SMOTE vs class_weight)

Description:

Understanding the Problem

Why Imbalance Is a Problem

Baseline Model (No Handling)

Approach 1: SMOTE (Synthetic Oversampling)

Approach 2: class_weight

SMOTE vs class_weight

Why This Matters

Key Takeaways

Conclusion

Code Snippet:

Comments

Add Your Comment

⚡️ Saturday ML Spark – ⚖️ Imbalanced Data (SMOTE vs class_weight)

Description:

Understanding the Problem

Why Imbalance Is a Problem

Baseline Model (No Handling)

Approach 1: SMOTE (Synthetic Oversampling)

Approach 2: class_weight

SMOTE vs class_weight

Why This Matters

Key Takeaways

Conclusion

Code Snippet:

Comments Show Comments

Add Your Comment

Related Posts

⚡️ Saturday ML Spark – 📊 Feature Drift Detection using Population Stability Index (PSI)

⚡️ Saturday ML Spark – 📉 Concept Drift Detection Basics

⚡️ Saturday ML Spark – 🧪 A/B Testing for ML Models

7-Day AI Crash Course

Comments