🧠 AI with Python – 🔧 ColumnTransformer for Mixed Data

Posted on: March 26, 2026

Description:

Real-world datasets are rarely clean or uniform. Most of the time, they contain a mix of numeric and categorical features — such as age, income, gender, location, product type, and more.

Handling these different data types correctly is one of the most important steps in building a production-ready machine learning system.

In this project, we use ColumnTransformer to apply appropriate preprocessing steps to different types of features within a single pipeline.

Understanding the Problem

Machine learning models typically expect numeric input.

However, real-world datasets contain:

Numeric features → require scaling, normalisation
Categorical features → require encoding
Missing values → require different handling strategies

If preprocessing is done manually:

It becomes error-prone
It may differ between training and inference
It leads to inconsistent results

We need a structured way to apply different transformations to different columns.

What Is ColumnTransformer?

ColumnTransformer allows us to:

Select specific columns
Apply different preprocessing pipelines to each group
Combine the results into a single transformed dataset

In simple terms:

“Apply the right transformation to the right column — automatically.”

1. Identifying Column Types

We first separate numeric and categorical columns.

numeric_features = ["age", "income"]
categorical_features = ["gender", "city"]

This step is essential for applying the correct transformations.

2. Building Separate Preprocessing Pipelines

We define transformations for each type of data.

from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder

numeric_transformer = Pipeline([
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler", StandardScaler())
])

categorical_transformer = Pipeline([
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("encoder", OneHotEncoder(handle_unknown="ignore"))
])

Each pipeline is specialised for its data type.

3. Combining with ColumnTransformer

We merge both pipelines into one unified preprocessing step.

from sklearn.compose import ColumnTransformer

preprocessor = ColumnTransformer([
    ("num", numeric_transformer, numeric_features),
    ("cat", categorical_transformer, categorical_features)
])

Now preprocessing is structured, reusable, and automated.

4. Integrating with a Model Pipeline

We attach the preprocessing step to a machine learning model.

model_pipeline = Pipeline([
    ("preprocessor", preprocessor),
    ("model", RandomForestClassifier())
])

This creates a full end-to-end pipeline.

Why This Approach Matters

ColumnTransformer solves several real-world challenges:

Ensures consistent preprocessing
Prevents manual errors
Handles mixed data types correctly
Supports deployment workflows
Works seamlessly with hyperparameter tuning

It is a fundamental building block for production ML systems.

Key Takeaways

Real-world datasets contain mixed feature types.
ColumnTransformer applies correct transformations per column group.
Separate pipelines keep preprocessing clean and modular.
Combining with Pipeline ensures end-to-end consistency.
A critical concept for scalable production ML systems.

Conclusion

ColumnTransformer enables us to handle mixed data types in a structured and production-ready way. By applying specialized preprocessing to numeric and categorical features within a unified workflow, we eliminate inconsistencies and build more reliable machine learning systems.

This makes it a key component in the Production ML track of the AI with Python series — bridging the gap between experimentation and real-world deployment.

Code Snippet:

# 📦 Import Required Libraries
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report


# 🧩 Create a Mixed-Type Dataset
df = pd.DataFrame({
    "age": [25, 32, 47, 51, 62, 23, 36, 44],
    "income": [30000, 50000, 70000, 65000, 90000, 28000, 52000, 61000],
    "gender": ["Male", "Female", "Female", "Male", "Female", "Male", "Female", "Male"],
    "city": ["A", "B", "A", "C", "B", "A", "C", "B"],
    "purchased": [0, 1, 1, 0, 1, 0, 1, 0]
})


# ✂️ Split Features and Target
X = df.drop("purchased", axis=1)
y = df["purchased"]

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.25,
    random_state=42,
    stratify=y
)


# 🔢 Define Numeric and Categorical Columns
numeric_features = ["age", "income"]
categorical_features = ["gender", "city"]


# 🔧 Build Preprocessing Pipelines
numeric_transformer = Pipeline([
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler", StandardScaler())
])

categorical_transformer = Pipeline([
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("encoder", OneHotEncoder(handle_unknown="ignore"))
])


# 🧱 Combine with ColumnTransformer
preprocessor = ColumnTransformer([
    ("num", numeric_transformer, numeric_features),
    ("cat", categorical_transformer, categorical_features)
])


# 🤖 Build Full Pipeline with Model
model_pipeline = Pipeline([
    ("preprocessor", preprocessor),
    ("model", RandomForestClassifier(n_estimators=200, random_state=42))
])


# 🚀 Train the Pipeline
model_pipeline.fit(X_train, y_train)


# 📊 Evaluate the Model
y_pred = model_pipeline.predict(X_test)

print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nClassification Report:\n")
print(classification_report(y_test, y_pred))


# 🔍 Predict on New Mixed-Type Data
new_data = pd.DataFrame({
    "age": [40, 29],
    "income": [58000, 32000],
    "gender": ["Female", "Male"],
    "city": ["C", "A"]
})

predictions = model_pipeline.predict(new_data)
print("Predictions:", predictions)

← →	move
↑	rotate
↓	soft drop
Space	hard drop
P	pause / resume

🧠 AI with Python – 🔧 ColumnTransformer for Mixed Data

Description:

Understanding the Problem

What Is ColumnTransformer?

1. Identifying Column Types

2. Building Separate Preprocessing Pipelines

3. Combining with ColumnTransformer

4. Integrating with a Model Pipeline

Why This Approach Matters

Key Takeaways

Conclusion

Code Snippet:

Comments

Add Your Comment

🧠 AI with Python – 🔧 ColumnTransformer for Mixed Data

Description:

Understanding the Problem

What Is ColumnTransformer?

1. Identifying Column Types

2. Building Separate Preprocessing Pipelines

3. Combining with ColumnTransformer

4. Integrating with a Model Pipeline

Why This Approach Matters

Key Takeaways

Conclusion

Code Snippet:

Comments Show Comments

Add Your Comment

Related Posts

🧠 AI with Python – ❓ Question Answering with Hugging Face

🧠 AI with Python – 🤗 Introduction to Hugging Face Transformers

🧠 AI with Python – 🤖 Text Generation with GPT-2

7-Day AI Crash Course

Comments