🧠 AI with Python – 🚨 Handling Unseen Categories
Posted on: April 3, 2026
Description:
Machine learning models often perform well in controlled environments, but real-world deployment introduces new challenges. One of the most common and critical issues is handling unseen categories in categorical features.
For example, a model trained on cities A, B, and C may suddenly receive a new value like D during inference. If not handled properly, this can break the entire pipeline.
In this project, we learn how to safely handle unseen categories using a production-ready preprocessing approach.
Understanding the Problem
Categorical features must be converted into numeric form before feeding them into a model.
A common approach is One-Hot Encoding, where each category becomes a binary column.
However, during production:
- New categories may appear
- The encoder has never seen them
- The system throws an error
This creates a critical failure point in deployed ML systems.
What Happens Without Proper Handling?
If we use a standard encoder:
OneHotEncoder()
and pass unseen categories during inference, it results in:
❌ Runtime error: unknown category encountered
This means the model cannot serve predictions — a serious issue in production systems.
The Solution: handle_unknown=“ignore”
To fix this, we configure the encoder as follows:
from sklearn.preprocessing import OneHotEncoder
OneHotEncoder(handle_unknown="ignore")
This ensures:
- Unseen categories are safely ignored
- No new columns are created
- The pipeline continues working without failure
1. Building a Safe Preprocessing Pipeline
We integrate the encoder into a proper pipeline setup.
categorical_transformer = Pipeline([
("imputer", SimpleImputer(strategy="most_frequent")),
("encoder", OneHotEncoder(handle_unknown="ignore"))
])
This ensures both missing values and unseen categories are handled correctly.
2. Combining with ColumnTransformer
We apply preprocessing selectively to different column types.
preprocessor = ColumnTransformer([
("num", numeric_transformer, numeric_features),
("cat", categorical_transformer, categorical_features)
])
Now numeric and categorical data are handled separately and correctly.
3. Integrating with a Model Pipeline
model_pipeline = Pipeline([
("preprocessor", preprocessor),
("model", RandomForestClassifier())
])
This creates a fully production-safe workflow.
4. Handling New Categories in Production
When new data arrives:
new_data = pd.DataFrame({
"gender": ["Other"],
"city": ["D"]
})
The pipeline:
- ignores unseen categories
- applies known transformations
- still produces predictions
No crashes. No failures.
Why This Matters in Real Systems
In production, new values appear constantly:
- New users
- New locations
- New product categories
- New behavioral patterns
If not handled correctly, your system will fail.
Handling unseen categories ensures:
- System reliability
- Continuous inference
- Better user experience
- Reduced operational risk
Key Takeaways
- Unseen categories are a common production challenge.
- Default encoders fail when new categories appear.
- handle_unknown="ignore" prevents runtime errors.
- Pipelines ensure safe and consistent preprocessing.
- Essential for building reliable ML systems.
Conclusion
Handling unseen categories is a critical step in making machine learning systems production-ready. By configuring encoders correctly and integrating them into pipelines, we ensure that models continue to function reliably even when new data patterns emerge.
This is a key concept in the Production ML track of the AI with Python series — bridging the gap between experimentation and real-world deployment.
Code Snippet:
# 📦 Import Required Libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report
# 🧩 Create a Mixed-Type Dataset
df = pd.DataFrame({
"age": [25, 32, 47, 51, 62, 23, 36, 44],
"income": [30000, 50000, 70000, 65000, 90000, 28000, 52000, 61000],
"gender": ["Male", "Female", "Female", "Male", "Female", "Male", "Female", "Male"],
"city": ["A", "B", "A", "C", "B", "A", "C", "B"],
"purchased": [0, 1, 1, 0, 1, 0, 1, 0]
})
# ✂️ Split Features and Target
X = df.drop("purchased", axis=1)
y = df["purchased"]
X_train, X_test, y_train, y_test = train_test_split(
X,
y,
test_size=0.25,
random_state=42,
stratify=y
)
# 🔢 Define Numeric and Categorical Columns
numeric_features = ["age", "income"]
categorical_features = ["gender", "city"]
# 🔧 Build Preprocessing Pipelines
numeric_transformer = Pipeline([
("imputer", SimpleImputer(strategy="median")),
("scaler", StandardScaler())
])
categorical_transformer = Pipeline([
("imputer", SimpleImputer(strategy="most_frequent")),
("encoder", OneHotEncoder(handle_unknown="ignore"))
])
# 🧱 Combine with ColumnTransformer
preprocessor = ColumnTransformer([
("num", numeric_transformer, numeric_features),
("cat", categorical_transformer, categorical_features)
])
# 🤖 Build Full Pipeline with Model
model_pipeline = Pipeline([
("preprocessor", preprocessor),
("model", RandomForestClassifier(n_estimators=200, random_state=42))
])
# 🚀 Train the Pipeline
model_pipeline.fit(X_train, y_train)
# 📊 Evaluate the Model
y_pred = model_pipeline.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nClassification Report:\n")
print(classification_report(y_test, y_pred))
# 🔍 Test Unseen Categories in Production
new_data = pd.DataFrame({
"age": [40, 29],
"income": [58000, 32000],
"gender": ["Other", "Male"], # "Other" not seen during training
"city": ["D", "A"] # "D" not seen during training
})
predictions = model_pipeline.predict(new_data)
print("Predictions on new data with unseen categories:", predictions)
No comments yet. Be the first to comment!