🧠 AI with Python – 💾 Saving Pipeline vs Model
Posted on: April 16, 2026
Description:
Training a machine learning model is only part of the journey. The real challenge begins when we try to deploy and use that model in production.
One of the most common mistakes developers make is saving only the trained model — forgetting that preprocessing is just as important.
In this project, we explore why saving only the model can break your system and why saving the full pipeline is the correct approach.
Understanding the Problem
During training:
- Data is cleaned
- Features are scaled or transformed
- Model learns from processed data
But in production:
- Raw data is passed to the model
- Preprocessing steps are missing
This leads to a mismatch:
Model trained on processed data → receives raw data
What Happens When You Save Only the Model?
Consider this workflow:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
model.fit(X_train_scaled, y_train)
joblib.dump(model, "model.pkl")
Later during inference:
loaded_model = joblib.load("model.pkl")
loaded_model.predict(X_test) # ❌ raw data
The model expects scaled data but receives unscaled input.
This results in:
- incorrect predictions
- degraded performance
- unreliable outputs
Correct Approach: Save the Full Pipeline
Instead of saving just the model, we combine preprocessing and model into a pipeline.
pipeline = Pipeline([
("scaler", StandardScaler()),
("model", LogisticRegression())
])
Now train the pipeline:
pipeline.fit(X_train, y_train)
Saving and Loading the Pipeline
joblib.dump(pipeline, "pipeline.pkl")
loaded_pipeline = joblib.load("pipeline.pkl")
loaded_pipeline.predict(X_test)
The pipeline ensures:
- preprocessing is applied automatically
- predictions are consistent
- no manual steps are required
Why This Matters in Production
Saving only the model leads to:
- missing preprocessing logic
- inconsistent predictions
- fragile deployment pipelines
Saving the full pipeline ensures:
- consistent transformations
- reliable inference
- simpler deployment
- fewer human errors
Key Takeaways
- Models depend on preprocessing, not just training.
- Saving only the model leads to incorrect predictions.
- Pipelines bundle preprocessing and modeling together.
- Saving the full pipeline ensures consistency.
- A critical best practice for production ML systems.
Conclusion
Saving only the trained model is a common but critical mistake in machine learning. By saving the full pipeline, we ensure that all transformations and modeling steps are preserved, making our system reliable and production-ready.
This is a key concept in the Production ML track of the AI with Python series — helping you build systems that work not just in notebooks, but in the real world.
Code Snippet:
# 📦 Import Required Libraries
import joblib
import pandas as pd
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
# 🧩 Load Dataset
data = load_wine()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target
# ✂️ Split Data
X_train, X_test, y_train, y_test = train_test_split(
X,
y,
test_size=0.3,
random_state=42,
stratify=y
)
# =========================================================
# ❌ Approach 1 – Saving Only the Model
# =========================================================
# Train on scaled training data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
model = LogisticRegression(max_iter=5000)
model.fit(X_train_scaled, y_train)
# Save ONLY the model
joblib.dump(model, "model_only.pkl")
# =========================================================
# ⚠️ Inference Problem – Missing Preprocessing
# =========================================================
loaded_model = joblib.load("model_only.pkl")
# ❌ Wrong: raw test data passed directly to model trained on scaled data
y_pred_wrong = loaded_model.predict(X_test)
print("❌ Accuracy (model only):", accuracy_score(y_test, y_pred_wrong))
# =========================================================
# ✅ Approach 2 – Saving Full Pipeline
# =========================================================
pipeline = Pipeline([
("scaler", StandardScaler()),
("model", LogisticRegression(max_iter=5000))
])
pipeline.fit(X_train, y_train)
# Save full pipeline
joblib.dump(pipeline, "full_pipeline.pkl")
# =========================================================
# 🚀 Load and Use Full Pipeline
# =========================================================
loaded_pipeline = joblib.load("full_pipeline.pkl")
# ✅ Correct: preprocessing + model both applied automatically
y_pred_correct = loaded_pipeline.predict(X_test)
print("✅ Accuracy (full pipeline):", accuracy_score(y_test, y_pred_correct))
No comments yet. Be the first to comment!