How to Handle High-Cardinality Categorical Columns Efficiently
Machine learning projects often fail quietly because of one overlooked issue: high-cardinality categorical variables.
Columns such as customer_id, product_name, city, device_id, or merchant_code may contain hundreds, thousands, or even millions of unique values.
If handled poorly, these features can:
Explode memory usage
Slow model training
Create sparse matrices
Cause severe overfitting
Reduce model interpretability
Efficient handling of high-cardinality columns is therefore a core feature engineering skill for any serious data scientist or ML engineer.
We will use this Amazon Sales Dataset.
What Is a High-Cardinality Column?
A categorical column has high cardinality when it contains many unique categories relative to the dataset size.
Examples:
| Column | Unique Values |
|---|---|
| Country | 54 |
| Department | 12 |
| ProductID | 250,000 |
| UserEmail | 1.2 million |
A column like Country is manageable.
A column like ProductID becomes computationally expensive.
Why One-Hot Encoding Fails
Many beginners immediately apply:
pd.get_dummies(df['ProductID'])
This becomes disastrous at scale.
If ProductID has 100,000 unique values, one-hot encoding creates 100,000 new columns.
Problems include:
Massive RAM consumption
Extremely sparse datasets
Longer training times
Weak generalization
Difficulty deploying models
For high-cardinality data, smarter encoding strategies are required.
1. Frequency Encoding
Frequency encoding replaces each category with how often it appears.
Example:
| Product | Count |
|---|---|
| A | 1200 |
| B | 560 |
| C | 34 |
Implementation
freq = df['Product'].value_counts()
df['Product_freq'] = df['Product'].map(freq)
Advantages
Very memory efficient
Preserves distribution information
Works well with tree-based models
Best For
XGBoost
LightGBM
Random Forests
2. Target Encoding
Target encoding replaces categories with the mean target value.
Example:
| City | Average Sales |
|---|---|
| Nairobi | 420 |
| Mombasa | 315 |
| Kisumu | 210 |
Implementation
# Clean the 'discounted_price' column and convert it to numericdf['discounted_price_numeric'] = df['discounted_price'].str.replace('₹', '').str.replace(',', '').astype(float)
# Group by 'category' and calculate the mean of the new numeric discounted_pricetarget_mean = df.groupby('category')['discounted_price_numeric'].mean()
# Map the target mean back to a new column 'category_encoded'df['category_encoded'] = df['category'].map(target_mean)
Important Warning
Target encoding can cause data leakage if done incorrectly.
Always compute encodings using:
Cross-validation folds
Training data only
Never use the full dataset before splitting.
3. Hash Encoding
Hash encoding maps categories into a fixed number of bins using a hash function.
Instead of creating 100,000 columns, you may create only 64 or 128.
Benefits
Fixed memory size
Very scalable
Excellent for streaming systems
Example
Best For
Large-scale ML systems
Real-time pipelines
Recommendation systems
4. Rare Category Grouping
Many categories appear only a few times.
Instead of preserving all rare labels, group them into "Other".
Example
counts = df['product_name'].value_counts()
rare = counts[counts < 10].index
df['Brand_clean'] = df['product_name'].replace(rare, 'Other')
This:
Reduces noise
Improves generalization
Makes models more stable
5. Embedding Layers for Deep Learning
Neural networks can learn dense vector representations of categories.
Instead of one-hot vectors:
| Category | Embedding |
|---|---|
| Product A | [0.12, -0.55, 0.81] |
| Product B | [0.77, 0.11, -0.44] |
Embeddings capture semantic similarity automatically.
Best For
Recommendation engines
NLP systems
Deep learning pipelines
Frameworks:
6. Use Native Categorical Support
Some modern ML libraries handle categorical features directly.
Excellent options include:
These libraries:
Reduce preprocessing complexity
Handle high-cardinality features efficiently
Often outperform manual one-hot encoding
Choosing the Right Strategy
| Method | Memory Efficient | Leakage Risk | Best For |
|---|---|---|---|
| One-Hot Encoding | No | Low | Low-cardinality features |
| Frequency Encoding | Yes | Low | Tree models |
| Target Encoding | Yes | High | Predictive power |
| Hash Encoding | Yes | Low | Massive datasets |
| Embeddings | Yes | Medium | Deep learning |
| Rare Grouping | Yes | Low | Data cleanup |
Practical Rule of Thumb
Use this simple decision framework:
Fewer than 20 categories → One-hot encoding
20–500 categories → Frequency or target encoding
Thousands of categories → Hashing or embeddings
Extremely sparse labels → Rare category grouping
Final Thoughts
High-cardinality categorical features are common in modern datasets:
E-commerce transactions
Telecom logs
Banking systems
Ad-tech platforms
User analytics pipelines
Efficient encoding strategies improve:
Training speed
Memory efficiency
Model accuracy
Production scalability
The best ML engineers do not simply “encode categories.” They choose encoding methods based on:
Dataset scale
Model architecture
Leakage risk
Inference constraints
Production requirements
That distinction separates prototype notebooks from production-grade machine learning systems.
Advance Your Career With 16 Python Projects in Data & ML — All for $288.
Comments
Post a Comment