AW Dev Rethought

Truth can only be found in one place: the code - Robert C. Martin

🧠 AI with Python – 💾 Saving Pipeline vs Model


Description:

Training a machine learning model is only part of the journey. The real challenge begins when we try to deploy and use that model in production.

One of the most common mistakes developers make is saving only the trained model — forgetting that preprocessing is just as important.

In this project, we explore why saving only the model can break your system and why saving the full pipeline is the correct approach.


Understanding the Problem

During training:

  • Data is cleaned
  • Features are scaled or transformed
  • Model learns from processed data

But in production:

  • Raw data is passed to the model
  • Preprocessing steps are missing

This leads to a mismatch:

Model trained on processed data → receives raw data


What Happens When You Save Only the Model?

Consider this workflow:

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)

model.fit(X_train_scaled, y_train)
joblib.dump(model, "model.pkl")

Later during inference:

loaded_model = joblib.load("model.pkl")
loaded_model.predict(X_test)  # ❌ raw data

The model expects scaled data but receives unscaled input.

This results in:

  • incorrect predictions
  • degraded performance
  • unreliable outputs

Correct Approach: Save the Full Pipeline

Instead of saving just the model, we combine preprocessing and model into a pipeline.

pipeline = Pipeline([
    ("scaler", StandardScaler()),
    ("model", LogisticRegression())
])

Now train the pipeline:

pipeline.fit(X_train, y_train)

Saving and Loading the Pipeline

joblib.dump(pipeline, "pipeline.pkl")

loaded_pipeline = joblib.load("pipeline.pkl")
loaded_pipeline.predict(X_test)

The pipeline ensures:

  • preprocessing is applied automatically
  • predictions are consistent
  • no manual steps are required

Why This Matters in Production

Saving only the model leads to:

  • missing preprocessing logic
  • inconsistent predictions
  • fragile deployment pipelines

Saving the full pipeline ensures:

  • consistent transformations
  • reliable inference
  • simpler deployment
  • fewer human errors

Key Takeaways

  1. Models depend on preprocessing, not just training.
  2. Saving only the model leads to incorrect predictions.
  3. Pipelines bundle preprocessing and modeling together.
  4. Saving the full pipeline ensures consistency.
  5. A critical best practice for production ML systems.

Conclusion

Saving only the trained model is a common but critical mistake in machine learning. By saving the full pipeline, we ensure that all transformations and modeling steps are preserved, making our system reliable and production-ready.

This is a key concept in the Production ML track of the AI with Python series — helping you build systems that work not just in notebooks, but in the real world.


Code Snippet:

# 📦 Import Required Libraries
import joblib
import pandas as pd

from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score


# 🧩 Load Dataset
data = load_wine()

X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target


# ✂️ Split Data
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.3,
    random_state=42,
    stratify=y
)


# =========================================================
# ❌ Approach 1 – Saving Only the Model
# =========================================================

# Train on scaled training data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)

model = LogisticRegression(max_iter=5000)
model.fit(X_train_scaled, y_train)

# Save ONLY the model
joblib.dump(model, "model_only.pkl")


# =========================================================
# ⚠️ Inference Problem – Missing Preprocessing
# =========================================================

loaded_model = joblib.load("model_only.pkl")

# ❌ Wrong: raw test data passed directly to model trained on scaled data
y_pred_wrong = loaded_model.predict(X_test)

print("❌ Accuracy (model only):", accuracy_score(y_test, y_pred_wrong))


# =========================================================
# ✅ Approach 2 – Saving Full Pipeline
# =========================================================

pipeline = Pipeline([
    ("scaler", StandardScaler()),
    ("model", LogisticRegression(max_iter=5000))
])

pipeline.fit(X_train, y_train)

# Save full pipeline
joblib.dump(pipeline, "full_pipeline.pkl")


# =========================================================
# 🚀 Load and Use Full Pipeline
# =========================================================

loaded_pipeline = joblib.load("full_pipeline.pkl")

# ✅ Correct: preprocessing + model both applied automatically
y_pred_correct = loaded_pipeline.predict(X_test)

print("✅ Accuracy (full pipeline):", accuracy_score(y_test, y_pred_correct))

Link copied!

Comments

Add Your Comment

Comment Added!