🧠 AI with Python – 🚨 Handling Unseen Categories

Posted on: April 3, 2026

Description:

Machine learning models often perform well in controlled environments, but real-world deployment introduces new challenges. One of the most common and critical issues is handling unseen categories in categorical features.

For example, a model trained on cities A, B, and C may suddenly receive a new value like D during inference. If not handled properly, this can break the entire pipeline.

In this project, we learn how to safely handle unseen categories using a production-ready preprocessing approach.

Understanding the Problem

Categorical features must be converted into numeric form before feeding them into a model.

A common approach is One-Hot Encoding, where each category becomes a binary column.

However, during production:

New categories may appear
The encoder has never seen them
The system throws an error

This creates a critical failure point in deployed ML systems.

What Happens Without Proper Handling?

If we use a standard encoder:

OneHotEncoder()

and pass unseen categories during inference, it results in:

❌ Runtime error: unknown category encountered

This means the model cannot serve predictions — a serious issue in production systems.

The Solution: handle_unknown=“ignore”

To fix this, we configure the encoder as follows:

from sklearn.preprocessing import OneHotEncoder

OneHotEncoder(handle_unknown="ignore")

This ensures:

Unseen categories are safely ignored
No new columns are created
The pipeline continues working without failure

1. Building a Safe Preprocessing Pipeline

We integrate the encoder into a proper pipeline setup.

categorical_transformer = Pipeline([
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("encoder", OneHotEncoder(handle_unknown="ignore"))
])

This ensures both missing values and unseen categories are handled correctly.

2. Combining with ColumnTransformer

We apply preprocessing selectively to different column types.

preprocessor = ColumnTransformer([
    ("num", numeric_transformer, numeric_features),
    ("cat", categorical_transformer, categorical_features)
])

Now numeric and categorical data are handled separately and correctly.

3. Integrating with a Model Pipeline

model_pipeline = Pipeline([
    ("preprocessor", preprocessor),
    ("model", RandomForestClassifier())
])

This creates a fully production-safe workflow.

4. Handling New Categories in Production

When new data arrives:

new_data = pd.DataFrame({
    "gender": ["Other"],
    "city": ["D"]
})

The pipeline:

ignores unseen categories
applies known transformations
still produces predictions

No crashes. No failures.

Why This Matters in Real Systems

In production, new values appear constantly:

New users
New locations
New product categories
New behavioral patterns

If not handled correctly, your system will fail.

Handling unseen categories ensures:

System reliability
Continuous inference
Better user experience
Reduced operational risk

Key Takeaways

Unseen categories are a common production challenge.
Default encoders fail when new categories appear.
handle_unknown="ignore" prevents runtime errors.
Pipelines ensure safe and consistent preprocessing.
Essential for building reliable ML systems.

Conclusion

Handling unseen categories is a critical step in making machine learning systems production-ready. By configuring encoders correctly and integrating them into pipelines, we ensure that models continue to function reliably even when new data patterns emerge.

This is a key concept in the Production ML track of the AI with Python series — bridging the gap between experimentation and real-world deployment.

Code Snippet:

# 📦 Import Required Libraries
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report


# 🧩 Create a Mixed-Type Dataset
df = pd.DataFrame({
    "age": [25, 32, 47, 51, 62, 23, 36, 44],
    "income": [30000, 50000, 70000, 65000, 90000, 28000, 52000, 61000],
    "gender": ["Male", "Female", "Female", "Male", "Female", "Male", "Female", "Male"],
    "city": ["A", "B", "A", "C", "B", "A", "C", "B"],
    "purchased": [0, 1, 1, 0, 1, 0, 1, 0]
})


# ✂️ Split Features and Target
X = df.drop("purchased", axis=1)
y = df["purchased"]

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.25,
    random_state=42,
    stratify=y
)


# 🔢 Define Numeric and Categorical Columns
numeric_features = ["age", "income"]
categorical_features = ["gender", "city"]


# 🔧 Build Preprocessing Pipelines
numeric_transformer = Pipeline([
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler", StandardScaler())
])

categorical_transformer = Pipeline([
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("encoder", OneHotEncoder(handle_unknown="ignore"))
])


# 🧱 Combine with ColumnTransformer
preprocessor = ColumnTransformer([
    ("num", numeric_transformer, numeric_features),
    ("cat", categorical_transformer, categorical_features)
])


# 🤖 Build Full Pipeline with Model
model_pipeline = Pipeline([
    ("preprocessor", preprocessor),
    ("model", RandomForestClassifier(n_estimators=200, random_state=42))
])


# 🚀 Train the Pipeline
model_pipeline.fit(X_train, y_train)


# 📊 Evaluate the Model
y_pred = model_pipeline.predict(X_test)

print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nClassification Report:\n")
print(classification_report(y_test, y_pred))


# 🔍 Test Unseen Categories in Production
new_data = pd.DataFrame({
    "age": [40, 29],
    "income": [58000, 32000],
    "gender": ["Other", "Male"],   # "Other" not seen during training
    "city": ["D", "A"]             # "D" not seen during training
})

predictions = model_pipeline.predict(new_data)
print("Predictions on new data with unseen categories:", predictions)

← →	move
↑	rotate
↓	soft drop
Space	hard drop
P	pause / resume

🧠 AI with Python – 🚨 Handling Unseen Categories

Description:

Understanding the Problem

What Happens Without Proper Handling?

The Solution: handle_unknown=“ignore”

1. Building a Safe Preprocessing Pipeline

2. Combining with ColumnTransformer

3. Integrating with a Model Pipeline

4. Handling New Categories in Production

Why This Matters in Real Systems

Key Takeaways

Conclusion

Code Snippet:

Comments

Add Your Comment

🧠 AI with Python – 🚨 Handling Unseen Categories

Description:

Understanding the Problem

What Happens Without Proper Handling?

The Solution: handle_unknown=“ignore”

1. Building a Safe Preprocessing Pipeline

2. Combining with ColumnTransformer

3. Integrating with a Model Pipeline

4. Handling New Categories in Production

Why This Matters in Real Systems

Key Takeaways

Conclusion

Code Snippet:

Comments Show Comments

Add Your Comment

Related Posts

🧠 AI with Python – 🔁 Model Versioning Strategy

🧠 AI with Python – 🔄 Train vs Inference Consistency

🧠 AI with Python – 🚨 Data Leakage Explained

Comments