How to Handle High-Cardinality Categorical Columns Efficiently

Machine learning projects often fail quietly because of one overlooked issue: high-cardinality categorical variables.

 


Columns such as customer_id, product_name, city, device_id, or merchant_code may contain hundreds, thousands, or even millions of unique values.

If handled poorly, these features can:

  • Explode memory usage

  • Slow model training

  • Create sparse matrices

  • Cause severe overfitting

  • Reduce model interpretability

Efficient handling of high-cardinality columns is therefore a core feature engineering skill for any serious data scientist or ML engineer.

We will use this Amazon Sales Dataset.


What Is a High-Cardinality Column?

A categorical column has high cardinality when it contains many unique categories relative to the dataset size.

Examples:

Column                        Unique Values
Country54
Department12
ProductID250,000
UserEmail1.2 million

A column like Country is manageable.
A column like ProductID becomes computationally expensive.


Why One-Hot Encoding Fails

Many beginners immediately apply:

pd.get_dummies(df['ProductID'])

This becomes disastrous at scale.

If ProductID has 100,000 unique values, one-hot encoding creates 100,000 new columns.


Problems include:

  • Massive RAM consumption

  • Extremely sparse datasets

  • Longer training times

  • Weak generalization

  • Difficulty deploying models


For high-cardinality data, smarter encoding strategies are required.


1. Frequency Encoding

Frequency encoding replaces each category with how often it appears.

Example:

Product                Count
A1200
B560
C34

Implementation

freq = df['Product'].value_counts()

df['Product_freq'] = df['Product'].map(freq)




Advantages

  • Very memory efficient

  • Preserves distribution information

  • Works well with tree-based models

Best For

  • XGBoost

  • LightGBM

  • Random Forests


2. Target Encoding

Target encoding replaces categories with the mean target value.

Example:

City                        Average Sales
Nairobi420
Mombasa315
Kisumu210

Implementation

# Clean the 'discounted_price' column and convert it to numeric
df['discounted_price_numeric'] = df['discounted_price'].str.replace('₹', '').str.replace(',', '').astype(float)

# Group by 'category' and calculate the mean of the new numeric discounted_price
target_mean = df.groupby('category')['discounted_price_numeric'].mean()

# Map the target mean back to a new column 'category_encoded'
df['category_encoded'] = df['category'].map(target_mean)

Important Warning

Target encoding can cause data leakage if done incorrectly.

Always compute encodings using:

  • Cross-validation folds

  • Training data only

Never use the full dataset before splitting.


3. Hash Encoding

Hash encoding maps categories into a fixed number of bins using a hash function.

Instead of creating 100,000 columns, you may create only 64 or 128.

Benefits

  • Fixed memory size

  • Very scalable

  • Excellent for streaming systems

Example

from sklearn.feature_extraction import FeatureHasher

hasher = FeatureHasher(
    n_features=32,
    input_type='string'
)

# Split the category string into a list of strings for each row
hashed = hasher.transform(df['category'].astype(str).apply(lambda x: x.split('|')))



Best For

  • Large-scale ML systems

  • Real-time pipelines

  • Recommendation systems


4. Rare Category Grouping

Many categories appear only a few times.

Instead of preserving all rare labels, group them into "Other".

Example

counts = df['product_name'].value_counts()

rare = counts[counts < 10].index

df['Brand_clean'] = df['product_name'].replace(rare, 'Other')

This:

  • Reduces noise

  • Improves generalization

  • Makes models more stable


5. Embedding Layers for Deep Learning

Neural networks can learn dense vector representations of categories.

Instead of one-hot vectors:

Category                Embedding
Product A[0.12, -0.55, 0.81]
Product B[0.77, 0.11, -0.44]

Embeddings capture semantic similarity automatically.

Best For

  • Recommendation engines

  • NLP systems

  • Deep learning pipelines

Frameworks:


6. Use Native Categorical Support

Some modern ML libraries handle categorical features directly.

Excellent options include:

These libraries:

  • Reduce preprocessing complexity

  • Handle high-cardinality features efficiently

  • Often outperform manual one-hot encoding


Choosing the Right Strategy

Method                            Memory Efficient            Leakage Risk        Best For
One-Hot EncodingNoLowLow-cardinality features
Frequency EncodingYesLowTree models
Target EncodingYesHighPredictive power
Hash EncodingYesLowMassive datasets
EmbeddingsYesMediumDeep learning
Rare GroupingYesLowData cleanup


Practical Rule of Thumb

Use this simple decision framework:

  • Fewer than 20 categories → One-hot encoding

  • 20–500 categories → Frequency or target encoding

  • Thousands of categories → Hashing or embeddings

  • Extremely sparse labels → Rare category grouping


Final Thoughts

High-cardinality categorical features are common in modern datasets:

  • E-commerce transactions

  • Telecom logs

  • Banking systems

  • Ad-tech platforms

  • User analytics pipelines

Efficient encoding strategies improve:

  • Training speed

  • Memory efficiency

  • Model accuracy

  • Production scalability


The best ML engineers do not simply “encode categories.” They choose encoding methods based on:

  • Dataset scale

  • Model architecture

  • Leakage risk

  • Inference constraints

  • Production requirements


That distinction separates prototype notebooks from production-grade machine learning systems.


Advance Your Career With 16 Python Projects in Data & ML — All for $288.

Comments

Popular posts from this blog

How to Filter Rows Using Boolean Indexing in Pandas (Afrobarometer Kenya Dataset)

How to Decide Whether to Drop or Fill Missing Value

How to create your first line chart with World Bank Kenya GDP data