How to Handle High-Cardinality Categorical Columns Efficiently

May 19, 2026

Machine learning projects often fail quietly because of one overlooked issue: high-cardinality categorical variables.

Columns such as customer_id, product_name, city, device_id, or merchant_code may contain hundreds, thousands, or even millions of unique values.

If handled poorly, these features can:

Explode memory usage
Slow model training
Create sparse matrices
Cause severe overfitting
Reduce model interpretability

Efficient handling of high-cardinality columns is therefore a core feature engineering skill for any serious data scientist or ML engineer.

We will use this Amazon Sales Dataset.

What Is a High-Cardinality Column?

A categorical column has high cardinality when it contains many unique categories relative to the dataset size.

Examples:

Column	Unique Values
Country	54
Department	12
ProductID	250,000
UserEmail	1.2 million

A column like Country is manageable.
A column like ProductID becomes computationally expensive.

Why One-Hot Encoding Fails

Many beginners immediately apply:

pd.get_dummies(df['ProductID'])

This becomes disastrous at scale.

If ProductID has 100,000 unique values, one-hot encoding creates 100,000 new columns.

Problems include:

Massive RAM consumption
Extremely sparse datasets
Longer training times
Weak generalization
Difficulty deploying models

For high-cardinality data, smarter encoding strategies are required.

1. Frequency Encoding

Frequency encoding replaces each category with how often it appears.

Example:

Product	Count
A	1200
B	560
C	34

Implementation

freq = df['Product'].value_counts()

df['Product_freq'] = df['Product'].map(freq)

Advantages

Very memory efficient
Preserves distribution information
Works well with tree-based models

Best For

XGBoost
LightGBM
Random Forests

2. Target Encoding

Target encoding replaces categories with the mean target value.

Example:

City	Average Sales
Nairobi	420
Mombasa	315
Kisumu	210

Implementation

# Clean the 'discounted_price' column and convert it to numeric
df['discounted_price_numeric'] = df['discounted_price'].str.replace('₹', '').str.replace(',', '').astype(float)

# Group by 'category' and calculate the mean of the new numeric discounted_price
target_mean = df.groupby('category')['discounted_price_numeric'].mean()

# Map the target mean back to a new column 'category_encoded'
df['category_encoded'] = df['category'].map(target_mean)

Important Warning

Target encoding can cause data leakage if done incorrectly.

Always compute encodings using:

Cross-validation folds
Training data only

Never use the full dataset before splitting.

3. Hash Encoding

Hash encoding maps categories into a fixed number of bins using a hash function.

Instead of creating 100,000 columns, you may create only 64 or 128.

Benefits

Fixed memory size
Very scalable
Excellent for streaming systems

Example

from sklearn.feature_extraction import FeatureHasher

hasher = FeatureHasher(
    n_features=32,
    input_type='string'
)

# Split the category string into a list of strings for each row
hashed = hasher.transform(df['category'].astype(str).apply(lambda x: x.split('|')))

Best For

Large-scale ML systems
Real-time pipelines
Recommendation systems

4. Rare Category Grouping

Many categories appear only a few times.

Instead of preserving all rare labels, group them into "Other".

Example

counts = df['product_name'].value_counts()

rare = counts[counts < 10].index

df['Brand_clean'] = df['product_name'].replace(rare, 'Other')

This:

Reduces noise
Improves generalization
Makes models more stable

5. Embedding Layers for Deep Learning

Neural networks can learn dense vector representations of categories.

Instead of one-hot vectors:

Category	Embedding
Product A	[0.12, -0.55, 0.81]
Product B	[0.77, 0.11, -0.44]

Embeddings capture semantic similarity automatically.

Best For

Recommendation engines
NLP systems
Deep learning pipelines

Frameworks:

6. Use Native Categorical Support

Some modern ML libraries handle categorical features directly.

Excellent options include:

These libraries:

Reduce preprocessing complexity
Handle high-cardinality features efficiently
Often outperform manual one-hot encoding

Choosing the Right Strategy

Method	Memory Efficient	Leakage Risk	Best For
One-Hot Encoding	No	Low	Low-cardinality features
Frequency Encoding	Yes	Low	Tree models
Target Encoding	Yes	High	Predictive power
Hash Encoding	Yes	Low	Massive datasets
Embeddings	Yes	Medium	Deep learning
Rare Grouping	Yes	Low	Data cleanup

Practical Rule of Thumb

Use this simple decision framework:

Fewer than 20 categories → One-hot encoding
20–500 categories → Frequency or target encoding
Thousands of categories → Hashing or embeddings
Extremely sparse labels → Rare category grouping

Final Thoughts

High-cardinality categorical features are common in modern datasets:

E-commerce transactions
Telecom logs
Banking systems
Ad-tech platforms
User analytics pipelines

Efficient encoding strategies improve:

Training speed
Memory efficiency
Model accuracy
Production scalability

The best ML engineers do not simply “encode categories.” They choose encoding methods based on:

Dataset scale
Model architecture
Leakage risk
Inference constraints
Production requirements

That distinction separates prototype notebooks from production-grade machine learning systems.

Advance Your Career With 16 Python Projects in Data & ML — All for $288.

Search This Blog

Practical Python for Data Engineering, Data Analysis & Machine Learning

How to Handle High-Cardinality Categorical Columns Efficiently

What Is a High-Cardinality Column?

Why One-Hot Encoding Fails

1. Frequency Encoding

Implementation

Advantages

Best For

2. Target Encoding

Implementation

Important Warning

3. Hash Encoding

Benefits

Example

Best For

4. Rare Category Grouping

Example

5. Embedding Layers for Deep Learning

Best For

6. Use Native Categorical Support

Choosing the Right Strategy

Practical Rule of Thumb

Final Thoughts

Comments

Post a Comment

Popular posts from this blog

How to Filter Rows Using Boolean Indexing in Pandas (Afrobarometer Kenya Dataset)

How to Decide Whether to Drop or Fill Missing Value

How to create your first line chart with World Bank Kenya GDP data