🧠 AI with Python – 🐍⚙️ End-to-End Pipeline with Pipeline()


Description:

Most ML projects require preprocessing (e.g., scaling, encoding) before model training. If you apply those steps manually, it’s easy to forget to apply the same transforms to the test set — introducing data leakage and inconsistent results.

Scikit-learn’s Pipeline solves this by chaining preprocessing and modeling into a single object that you can fit, predict, and evaluate — consistently across datasets and CV folds.


Why use a Pipeline?

  • Consistency: The same transforms applied to train data are applied to test data.
  • No leakage: Fit preprocessing only on the training set automatically.
  • Clean code: One object for the entire workflow.
  • Interop: Works seamlessly with cross-validation and hyperparameter search (e.g., GridSearchCV).

Minimal Implementation

Define a pipeline that scales features and trains a classifier:

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

clf = Pipeline([
    ("scaler", StandardScaler()),
    ("logreg", LogisticRegression(max_iter=200, random_state=42))
])

Fit and predict with the same object:

clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

Interpreting Results

Evaluate accuracy and per-class metrics:

from sklearn.metrics import accuracy_score
print("Accuracy:", accuracy_score(y_test, y_pred))

A classification report highlights precision/recall/F1 for each class and helps you spot class-specific errors.


Key Takeaways

  • Pipeline avoids common pitfalls (like scaling with test data).
  • It’s production-friendly and simplifies experimentation.
  • You can nest pipelines inside GridSearchCV to tune preprocessing + model hyperparameters together.

Code Snippet:

# Data & utilities
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

# Preprocessing & model
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

# Pipeline & metrics
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score, classification_report


# Load the Iris dataset (4 features, 3 classes)
iris = load_iris()
X, y = iris.data, iris.target

# Train/test split (stratified for balanced classes)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42, stratify=y
)


# Define a two-step pipeline: scale features, then fit classifier
clf = Pipeline(steps=[
    ("scaler", StandardScaler()),            # step 1: standardize features
    ("logreg", LogisticRegression(max_iter=200, n_jobs=None, random_state=42))  # step 2: classifier
])


# Train the full pipeline on the training set
clf.fit(X_train, y_train)


# Predict on test data (transforms + predict happen inside)
y_pred = clf.predict(X_test)

# Evaluate results
acc = accuracy_score(y_test, y_pred)
print(f"Test Accuracy: {acc:.4f}\n")
print("Classification Report:\n", classification_report(y_test, y_pred, target_names=iris.target_names))


# Access the logistic regression step and inspect learned coefficients
lr = clf.named_steps["logreg"]
print("Model classes:", lr.classes_)
print("Coef shape:", lr.coef_.shape)  # (n_classes, n_features)

Link copied!

Comments

Add Your Comment

Comment Added!