🧠 AI with Python – 🐍⚙️ End-to-End Pipeline with Pipeline()
Posted On: September 23, 2025
Description:
Most ML projects require preprocessing (e.g., scaling, encoding) before model training. If you apply those steps manually, it’s easy to forget to apply the same transforms to the test set — introducing data leakage and inconsistent results.
Scikit-learn’s Pipeline solves this by chaining preprocessing and modeling into a single object that you can fit, predict, and evaluate — consistently across datasets and CV folds.
Why use a Pipeline?
- Consistency: The same transforms applied to train data are applied to test data.
- No leakage: Fit preprocessing only on the training set automatically.
- Clean code: One object for the entire workflow.
- Interop: Works seamlessly with cross-validation and hyperparameter search (e.g., GridSearchCV).
Minimal Implementation
Define a pipeline that scales features and trains a classifier:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
clf = Pipeline([
("scaler", StandardScaler()),
("logreg", LogisticRegression(max_iter=200, random_state=42))
])
Fit and predict with the same object:
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
Interpreting Results
Evaluate accuracy and per-class metrics:
from sklearn.metrics import accuracy_score
print("Accuracy:", accuracy_score(y_test, y_pred))
A classification report highlights precision/recall/F1 for each class and helps you spot class-specific errors.
Key Takeaways
- Pipeline avoids common pitfalls (like scaling with test data).
- It’s production-friendly and simplifies experimentation.
- You can nest pipelines inside GridSearchCV to tune preprocessing + model hyperparameters together.
Code Snippet:
# Data & utilities
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
# Preprocessing & model
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
# Pipeline & metrics
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score, classification_report
# Load the Iris dataset (4 features, 3 classes)
iris = load_iris()
X, y = iris.data, iris.target
# Train/test split (stratified for balanced classes)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.25, random_state=42, stratify=y
)
# Define a two-step pipeline: scale features, then fit classifier
clf = Pipeline(steps=[
("scaler", StandardScaler()), # step 1: standardize features
("logreg", LogisticRegression(max_iter=200, n_jobs=None, random_state=42)) # step 2: classifier
])
# Train the full pipeline on the training set
clf.fit(X_train, y_train)
# Predict on test data (transforms + predict happen inside)
y_pred = clf.predict(X_test)
# Evaluate results
acc = accuracy_score(y_test, y_pred)
print(f"Test Accuracy: {acc:.4f}\n")
print("Classification Report:\n", classification_report(y_test, y_pred, target_names=iris.target_names))
# Access the logistic regression step and inspect learned coefficients
lr = clf.named_steps["logreg"]
print("Model classes:", lr.classes_)
print("Coef shape:", lr.coef_.shape) # (n_classes, n_features)
Link copied!
Comments
Add Your Comment
Comment Added!
No comments yet. Be the first to comment!