AW Dev Rethought

Truth can only be found in one place: the code - Robert C. Martin

⚡️ Saturday ML Spark – 🏷️ Encoding High Cardinality Categories


Description:

Categorical features are everywhere in machine learning:

  • city names
  • product IDs
  • customer IDs
  • transaction categories
  • search keywords

While categorical data is extremely useful, problems arise when a feature contains too many unique values. This is known as high cardinality.

In this project, we explore practical ways to encode high-cardinality categorical features for scalable machine learning systems.


Understanding the Problem

A categorical feature with only a few values is easy to handle.

Example:

Red, Blue, Green

But real-world datasets often look like:

city_1, city_2, city_3 ... city_5000

This creates major challenges during preprocessing.


Why One-Hot Encoding Becomes a Problem

A common encoding approach is one-hot encoding.

pd.get_dummies(data["city"])

For high-cardinality features, this creates:

  • thousands of columns
  • sparse matrices
  • high memory consumption
  • slower training pipelines

This becomes inefficient for large-scale machine learning systems.


What Is High Cardinality?

High cardinality means:

A feature contains a very large number of unique categories.

Examples include:

  • user IDs
  • zip codes
  • product SKUs
  • URLs
  • search queries

Handling them efficiently is important for scalable ML workflows.


Frequency Encoding

One practical solution is frequency encoding.

Instead of creating many columns, we replace each category with how frequently it appears.

city_freq = data["city"].value_counts()

data["city_freq_encoded"] = data["city"].map(city_freq)

Now the feature becomes numerical while still preserving useful distribution information.


Training the Model

We train the model using encoded features.

model.fit(X_train, y_train)

This keeps the feature space compact and efficient.


Why Frequency Encoding Works

Frequency encoding helps because:

  • common categories may contain stronger signals
  • feature dimensions remain small
  • memory usage stays manageable
  • models train faster

It is especially useful in tabular machine learning pipelines.


Other High-Cardinality Encoding Techniques

Besides frequency encoding, common alternatives include:

  • Target Encoding
  • Hash Encoding
  • Leave-One-Out Encoding
  • Embedding Layers (deep learning)

The best method depends on the dataset and model type.


Where This Is Used

High-cardinality encoding is common in:

  • recommendation systems
  • e-commerce ML
  • fraud detection
  • ad-tech systems
  • customer analytics

These systems often contain massive categorical spaces.


Key Takeaways

  1. High-cardinality features contain many unique categories.
  2. One-hot encoding becomes inefficient at scale.
  3. Frequency encoding is a compact alternative.
  4. Efficient encoding improves scalability and training speed.
  5. Proper categorical handling is critical in real-world ML systems.

Conclusion

Encoding high-cardinality categorical features is an important challenge in practical machine learning. By using techniques like frequency encoding, we can efficiently represent large categorical spaces without exploding feature dimensions.


Code Snippet:

# 📦 Import Required Libraries
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score


# 🧩 Create Sample Dataset
np.random.seed(42)

n_samples = 1000

data = pd.DataFrame({
    "user_id": [f"user_{i}" for i in range(n_samples)],

    "city": np.random.choice(
        [f"city_{i}" for i in range(200)],
        size=n_samples
    ),

    "age": np.random.randint(18, 60, size=n_samples),

    "purchased": np.random.randint(0, 2, size=n_samples)
})


# =========================================================
# 🚨 Problem with One-Hot Encoding
# =========================================================

# Uncomment below to observe dimensional explosion
# city_one_hot = pd.get_dummies(data["city"])
# print(city_one_hot.shape)


# =========================================================
# ✅ Frequency Encoding
# =========================================================

city_freq = data["city"].value_counts()

data["city_freq_encoded"] = data["city"].map(city_freq)


# =========================================================
# ✂️ Prepare Features
# =========================================================

X = data[["age", "city_freq_encoded"]]
y = data["purchased"]


# =========================================================
# ✂️ Split Data
# =========================================================

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.3,
    random_state=42
)


# =========================================================
# 🤖 Train Model
# =========================================================

model = LogisticRegression(max_iter=5000)

model.fit(X_train, y_train)


# =========================================================
# 📊 Evaluate Model
# =========================================================

y_pred = model.predict(X_test)

print("Accuracy:", accuracy_score(y_test, y_pred))


# =========================================================
# 🔍 View Encoded Dataset
# =========================================================

print("\nSample Encoded Data:")
print(data.head())

Link copied!

Comments

Add Your Comment

Comment Added!