🧠 AI with Python – 🔧 ColumnTransformer for Mixed Data
Posted on: March 26, 2026
Description:
Real-world datasets are rarely clean or uniform. Most of the time, they contain a mix of numeric and categorical features — such as age, income, gender, location, product type, and more.
Handling these different data types correctly is one of the most important steps in building a production-ready machine learning system.
In this project, we use ColumnTransformer to apply appropriate preprocessing steps to different types of features within a single pipeline.
Understanding the Problem
Machine learning models typically expect numeric input.
However, real-world datasets contain:
- Numeric features → require scaling, normalisation
- Categorical features → require encoding
- Missing values → require different handling strategies
If preprocessing is done manually:
- It becomes error-prone
- It may differ between training and inference
- It leads to inconsistent results
We need a structured way to apply different transformations to different columns.
What Is ColumnTransformer?
ColumnTransformer allows us to:
- Select specific columns
- Apply different preprocessing pipelines to each group
- Combine the results into a single transformed dataset
In simple terms:
“Apply the right transformation to the right column — automatically.”
1. Identifying Column Types
We first separate numeric and categorical columns.
numeric_features = ["age", "income"]
categorical_features = ["gender", "city"]
This step is essential for applying the correct transformations.
2. Building Separate Preprocessing Pipelines
We define transformations for each type of data.
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
numeric_transformer = Pipeline([
("imputer", SimpleImputer(strategy="median")),
("scaler", StandardScaler())
])
categorical_transformer = Pipeline([
("imputer", SimpleImputer(strategy="most_frequent")),
("encoder", OneHotEncoder(handle_unknown="ignore"))
])
Each pipeline is specialised for its data type.
3. Combining with ColumnTransformer
We merge both pipelines into one unified preprocessing step.
from sklearn.compose import ColumnTransformer
preprocessor = ColumnTransformer([
("num", numeric_transformer, numeric_features),
("cat", categorical_transformer, categorical_features)
])
Now preprocessing is structured, reusable, and automated.
4. Integrating with a Model Pipeline
We attach the preprocessing step to a machine learning model.
model_pipeline = Pipeline([
("preprocessor", preprocessor),
("model", RandomForestClassifier())
])
This creates a full end-to-end pipeline.
Why This Approach Matters
ColumnTransformer solves several real-world challenges:
- Ensures consistent preprocessing
- Prevents manual errors
- Handles mixed data types correctly
- Supports deployment workflows
- Works seamlessly with hyperparameter tuning
It is a fundamental building block for production ML systems.
Key Takeaways
- Real-world datasets contain mixed feature types.
- ColumnTransformer applies correct transformations per column group.
- Separate pipelines keep preprocessing clean and modular.
- Combining with Pipeline ensures end-to-end consistency.
- A critical concept for scalable production ML systems.
Conclusion
ColumnTransformer enables us to handle mixed data types in a structured and production-ready way. By applying specialized preprocessing to numeric and categorical features within a unified workflow, we eliminate inconsistencies and build more reliable machine learning systems.
This makes it a key component in the Production ML track of the AI with Python series — bridging the gap between experimentation and real-world deployment.
Code Snippet:
# 📦 Import Required Libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report
# 🧩 Create a Mixed-Type Dataset
df = pd.DataFrame({
"age": [25, 32, 47, 51, 62, 23, 36, 44],
"income": [30000, 50000, 70000, 65000, 90000, 28000, 52000, 61000],
"gender": ["Male", "Female", "Female", "Male", "Female", "Male", "Female", "Male"],
"city": ["A", "B", "A", "C", "B", "A", "C", "B"],
"purchased": [0, 1, 1, 0, 1, 0, 1, 0]
})
# ✂️ Split Features and Target
X = df.drop("purchased", axis=1)
y = df["purchased"]
X_train, X_test, y_train, y_test = train_test_split(
X,
y,
test_size=0.25,
random_state=42,
stratify=y
)
# 🔢 Define Numeric and Categorical Columns
numeric_features = ["age", "income"]
categorical_features = ["gender", "city"]
# 🔧 Build Preprocessing Pipelines
numeric_transformer = Pipeline([
("imputer", SimpleImputer(strategy="median")),
("scaler", StandardScaler())
])
categorical_transformer = Pipeline([
("imputer", SimpleImputer(strategy="most_frequent")),
("encoder", OneHotEncoder(handle_unknown="ignore"))
])
# 🧱 Combine with ColumnTransformer
preprocessor = ColumnTransformer([
("num", numeric_transformer, numeric_features),
("cat", categorical_transformer, categorical_features)
])
# 🤖 Build Full Pipeline with Model
model_pipeline = Pipeline([
("preprocessor", preprocessor),
("model", RandomForestClassifier(n_estimators=200, random_state=42))
])
# 🚀 Train the Pipeline
model_pipeline.fit(X_train, y_train)
# 📊 Evaluate the Model
y_pred = model_pipeline.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nClassification Report:\n")
print(classification_report(y_test, y_pred))
# 🔍 Predict on New Mixed-Type Data
new_data = pd.DataFrame({
"age": [40, 29],
"income": [58000, 32000],
"gender": ["Female", "Male"],
"city": ["C", "A"]
})
predictions = model_pipeline.predict(new_data)
print("Predictions:", predictions)
No comments yet. Be the first to comment!