AW Dev Rethought

🌟 The best way to predict the future is to invent it - Alan Kay

🧠 AI with Python – 🚨 Handling Unseen Categories


Description:

Machine learning models often perform well in controlled environments, but real-world deployment introduces new challenges. One of the most common and critical issues is handling unseen categories in categorical features.

For example, a model trained on cities A, B, and C may suddenly receive a new value like D during inference. If not handled properly, this can break the entire pipeline.

In this project, we learn how to safely handle unseen categories using a production-ready preprocessing approach.


Understanding the Problem

Categorical features must be converted into numeric form before feeding them into a model.

A common approach is One-Hot Encoding, where each category becomes a binary column.

However, during production:

  • New categories may appear
  • The encoder has never seen them
  • The system throws an error

This creates a critical failure point in deployed ML systems.


What Happens Without Proper Handling?

If we use a standard encoder:

OneHotEncoder()

and pass unseen categories during inference, it results in:

❌ Runtime error: unknown category encountered

This means the model cannot serve predictions — a serious issue in production systems.


The Solution: handle_unknown=“ignore”

To fix this, we configure the encoder as follows:

from sklearn.preprocessing import OneHotEncoder

OneHotEncoder(handle_unknown="ignore")

This ensures:

  • Unseen categories are safely ignored
  • No new columns are created
  • The pipeline continues working without failure

1. Building a Safe Preprocessing Pipeline

We integrate the encoder into a proper pipeline setup.

categorical_transformer = Pipeline([
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("encoder", OneHotEncoder(handle_unknown="ignore"))
])

This ensures both missing values and unseen categories are handled correctly.


2. Combining with ColumnTransformer

We apply preprocessing selectively to different column types.

preprocessor = ColumnTransformer([
    ("num", numeric_transformer, numeric_features),
    ("cat", categorical_transformer, categorical_features)
])

Now numeric and categorical data are handled separately and correctly.


3. Integrating with a Model Pipeline

model_pipeline = Pipeline([
    ("preprocessor", preprocessor),
    ("model", RandomForestClassifier())
])

This creates a fully production-safe workflow.


4. Handling New Categories in Production

When new data arrives:

new_data = pd.DataFrame({
    "gender": ["Other"],
    "city": ["D"]
})

The pipeline:

  • ignores unseen categories
  • applies known transformations
  • still produces predictions

No crashes. No failures.


Why This Matters in Real Systems

In production, new values appear constantly:

  • New users
  • New locations
  • New product categories
  • New behavioral patterns

If not handled correctly, your system will fail.

Handling unseen categories ensures:

  • System reliability
  • Continuous inference
  • Better user experience
  • Reduced operational risk

Key Takeaways

  1. Unseen categories are a common production challenge.
  2. Default encoders fail when new categories appear.
  3. handle_unknown="ignore" prevents runtime errors.
  4. Pipelines ensure safe and consistent preprocessing.
  5. Essential for building reliable ML systems.

Conclusion

Handling unseen categories is a critical step in making machine learning systems production-ready. By configuring encoders correctly and integrating them into pipelines, we ensure that models continue to function reliably even when new data patterns emerge.

This is a key concept in the Production ML track of the AI with Python series — bridging the gap between experimentation and real-world deployment.


Code Snippet:

# 📦 Import Required Libraries
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report


# 🧩 Create a Mixed-Type Dataset
df = pd.DataFrame({
    "age": [25, 32, 47, 51, 62, 23, 36, 44],
    "income": [30000, 50000, 70000, 65000, 90000, 28000, 52000, 61000],
    "gender": ["Male", "Female", "Female", "Male", "Female", "Male", "Female", "Male"],
    "city": ["A", "B", "A", "C", "B", "A", "C", "B"],
    "purchased": [0, 1, 1, 0, 1, 0, 1, 0]
})


# ✂️ Split Features and Target
X = df.drop("purchased", axis=1)
y = df["purchased"]

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.25,
    random_state=42,
    stratify=y
)


# 🔢 Define Numeric and Categorical Columns
numeric_features = ["age", "income"]
categorical_features = ["gender", "city"]


# 🔧 Build Preprocessing Pipelines
numeric_transformer = Pipeline([
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler", StandardScaler())
])

categorical_transformer = Pipeline([
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("encoder", OneHotEncoder(handle_unknown="ignore"))
])


# 🧱 Combine with ColumnTransformer
preprocessor = ColumnTransformer([
    ("num", numeric_transformer, numeric_features),
    ("cat", categorical_transformer, categorical_features)
])


# 🤖 Build Full Pipeline with Model
model_pipeline = Pipeline([
    ("preprocessor", preprocessor),
    ("model", RandomForestClassifier(n_estimators=200, random_state=42))
])


# 🚀 Train the Pipeline
model_pipeline.fit(X_train, y_train)


# 📊 Evaluate the Model
y_pred = model_pipeline.predict(X_test)

print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nClassification Report:\n")
print(classification_report(y_test, y_pred))


# 🔍 Test Unseen Categories in Production
new_data = pd.DataFrame({
    "age": [40, 29],
    "income": [58000, 32000],
    "gender": ["Other", "Male"],   # "Other" not seen during training
    "city": ["D", "A"]             # "D" not seen during training
})

predictions = model_pipeline.predict(new_data)
print("Predictions on new data with unseen categories:", predictions)

Link copied!

Comments

Add Your Comment

Comment Added!