AW Dev Rethought

Truth can only be found in one place: the code - Robert C. Martin

⚡️ Saturday ML Spark – 🚨 Data Leakage Detection


Description:

A machine learning model showing high accuracy can feel like a win — but sometimes, that performance is an illusion.

One of the most common reasons behind unrealistically good results is data leakage.

In this project, we explore how data leakage happens, why it’s dangerous, and how to detect and prevent it in machine learning pipelines.


Understanding the Problem

Machine learning models are supposed to learn patterns from training data only.

However, in practice:

  • preprocessing is applied to the entire dataset
  • future information leaks into training
  • test data indirectly influences the model

This creates a situation where the model has already “seen” part of the answers.


What Is Data Leakage?

Data leakage occurs when information that should not be available during training is used to build the model.

This leads to:

  • artificially high accuracy
  • misleading evaluation results
  • poor performance in real-world usage

Common Mistake: Preprocessing Before Splitting

One of the most common causes of leakage is applying transformations before splitting the data.

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)  # ❌ uses full dataset

Here, the scaler learns statistics from both training and test data.

This means the model indirectly gains access to test data information.


Correct Approach: Split First, Then Transform

To prevent leakage, always split the data before applying transformations.

X_train, X_test, y_train, y_test = train_test_split(X, y)

scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train)  # ✅ train only
X_test_scaled = scaler.transform(X_test)        # ✅ test only

Now the model only learns from training data.


Best Practice: Use Pipelines

The safest way to avoid leakage is by using pipelines.

pipeline = Pipeline([
    ("scaler", StandardScaler()),
    ("model", LogisticRegression())
])

Using:

pipeline.fit(X_train, y_train)
pipeline.predict(X_test)

ensures that preprocessing is applied correctly every time.


How to Detect Data Leakage

Detecting leakage is not always straightforward, but common signs include:

  • unusually high accuracy
  • performance drops in real-world data
  • mismatch between training and production results

If something looks too good to be true, it probably is.


Why This Matters

Data leakage can lead to:

  • incorrect model evaluation
  • bad business decisions
  • unreliable ML systems
  • loss of trust in predictions

It is one of the most critical issues in machine learning.


Key Takeaways

  1. Data leakage occurs when test information enters training.
  2. It leads to unrealistic and misleading performance.
  3. Preprocessing before splitting is a common mistake.
  4. Always split data before applying transformations.
  5. Pipelines help prevent leakage automatically.

Conclusion

Data leakage is a silent but dangerous problem in machine learning. While it can make models appear highly accurate during development, it ultimately leads to poor real-world performance. By following proper workflows and using pipelines, we can build models that are reliable, trustworthy, and production-ready.

This strengthens your understanding in Saturday ML Spark ⚡️ – Feature Engineering, helping you avoid critical mistakes in real-world ML systems.


Code Snippet:

# 📦 Import Required Libraries
import pandas as pd

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score


# 🧩 Load Dataset
data = load_breast_cancer()

X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target


# =========================================================
# ❌ WRONG APPROACH – DATA LEAKAGE
# =========================================================

# Applying scaling before splitting (leakage)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)  # ❌ Uses entire dataset

X_train_leak, X_test_leak, y_train_leak, y_test_leak = train_test_split(
    X_scaled,
    y,
    test_size=0.3,
    random_state=42,
    stratify=y
)

model_leak = LogisticRegression(max_iter=5000)
model_leak.fit(X_train_leak, y_train_leak)

y_pred_leak = model_leak.predict(X_test_leak)

print("❌ Accuracy with Data Leakage:", accuracy_score(y_test_leak, y_pred_leak))


# =========================================================
# ✅ CORRECT APPROACH – NO LEAKAGE
# =========================================================

# Split first
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.3,
    random_state=42,
    stratify=y
)

# Fit scaler only on training data
scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

model = LogisticRegression(max_iter=5000)
model.fit(X_train_scaled, y_train)

y_pred = model.predict(X_test_scaled)

print("✅ Accuracy without Leakage:", accuracy_score(y_test, y_pred))


# =========================================================
# 🚀 BEST PRACTICE – USING PIPELINE
# =========================================================

pipeline = Pipeline([
    ("scaler", StandardScaler()),
    ("model", LogisticRegression(max_iter=5000))
])

# Cross-validation ensures no leakage across folds
scores = cross_val_score(pipeline, X, y, cv=5)

print("🚀 Cross-validated Accuracy (Pipeline):", scores.mean())

Link copied!

Comments

Add Your Comment

Comment Added!