AW Dev Rethought

Truth can only be found in one place: the code - Robert C. Martin

⚡️ Saturday ML Spark – ⚖️ Imbalanced Data (SMOTE vs class_weight)


Description:

In many real-world machine learning problems, data is not evenly distributed across classes. One class often dominates, while the other — usually the more important one — appears rarely.

This is known as class imbalance, and it can severely impact model performance.

In this project, we explore two practical techniques to handle imbalanced data: SMOTE and class_weight.


Understanding the Problem

In an imbalanced dataset:

  • Majority class → appears frequently
  • Minority class → appears rarely

For example:

  • Fraud detection → very few fraudulent transactions
  • Medical diagnosis → rare disease cases
  • Churn prediction → fewer churned customers

A model trained on such data may achieve high accuracy while completely ignoring the minority class.


Why Imbalance Is a Problem

Consider a dataset with:

  • 90% class A
  • 10% class B

A model predicting only class A achieves 90% accuracy — but is useless in practice.

This is why we must focus on:

  • Precision
  • Recall
  • F1 Score

instead of just accuracy.


Baseline Model (No Handling)

We first train a model without addressing imbalance.

model = LogisticRegression()
model.fit(X_train, y_train)

This usually results in:

  • high accuracy
  • poor recall for minority class

Approach 1: SMOTE (Synthetic Oversampling)

SMOTE generates synthetic examples for the minority class.

from imblearn.over_sampling import SMOTE

smote = SMOTE()
X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)

This balances the dataset before training.

model.fit(X_train_smote, y_train_smote)

SMOTE helps the model learn patterns from minority class better.


Approach 2: class_weight

Instead of changing the data, we adjust the model’s learning behaviour.

model = LogisticRegression(class_weight="balanced")
model.fit(X_train, y_train)

This increases the penalty for misclassifying the minority class.


SMOTE vs class_weight

Both methods aim to improve minority class performance, but they work differently:

  • SMOTE
    • modifies the data
    • increases minority samples
    • may introduce synthetic noise
  • class_weight
    • keeps data unchanged
    • adjusts model training
    • simpler and faster

Choosing between them depends on the dataset and use case.


Why This Matters

Handling imbalance is critical in:

  • fraud detection systems
  • healthcare models
  • anomaly detection
  • recommendation systems

Ignoring imbalance can lead to models that look good on paper but fail in real-world scenarios.


Key Takeaways

  1. Imbalanced data biases models toward the majority class.
  2. Accuracy alone is not a reliable metric.
  3. SMOTE creates synthetic minority samples.
  4. class_weight adjusts model learning without changing data.
  5. Both techniques improve real-world model performance.

Conclusion

Class imbalance is one of the most common challenges in machine learning. By using techniques like SMOTE and class_weight, we can build models that better handle rare but critical cases, making them more useful in real-world applications.

This makes imbalance handling an essential topic in Saturday ML Spark ⚡️ – Advanced & Practical.


Code Snippet:

# 📦 Import Required Libraries
import pandas as pd

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix

from imblearn.over_sampling import SMOTE


# 🧩 Create an Imbalanced Dataset
X, y = make_classification(
    n_samples=5000,
    n_features=10,
    n_informative=6,
    n_redundant=2,
    weights=[0.90, 0.10],
    random_state=42
)

X = pd.DataFrame(X, columns=[f"feature_{i}" for i in range(10)])
y = pd.Series(y, name="target")


# ✂️ Split Data
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.3,
    random_state=42,
    stratify=y
)


# =========================================================
# 🚨 Baseline Model (No Imbalance Handling)
# =========================================================

baseline_model = LogisticRegression(max_iter=5000)
baseline_model.fit(X_train, y_train)

baseline_pred = baseline_model.predict(X_test)

print("=== Baseline Model ===")
print(classification_report(y_test, baseline_pred))
print(confusion_matrix(y_test, baseline_pred))


# =========================================================
# 🔁 Approach 1 – SMOTE
# =========================================================

smote = SMOTE(random_state=42)
X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)

smote_model = LogisticRegression(max_iter=5000)
smote_model.fit(X_train_smote, y_train_smote)

smote_pred = smote_model.predict(X_test)

print("\n=== SMOTE Model ===")
print(classification_report(y_test, smote_pred))
print(confusion_matrix(y_test, smote_pred))


# =========================================================
# ⚖️ Approach 2 – class_weight
# =========================================================

weighted_model = LogisticRegression(
    max_iter=5000,
    class_weight="balanced"
)

weighted_model.fit(X_train, y_train)

weighted_pred = weighted_model.predict(X_test)

print("\n=== class_weight Model ===")
print(classification_report(y_test, weighted_pred))
print(confusion_matrix(y_test, weighted_pred))

Link copied!

Comments

Add Your Comment

Comment Added!