How to Detect Class Imbalance in Your Training Data

Class imbalance is one of the most common, and most quietly destructive problems in machine learning. 



Your model trains, your accuracy looks great, and then it completely fails in production. 

The culprit is almost always a dataset where one class dominates the others, and your model learned to cheat by predicting the majority class almost every time.

Here's how to catch it before it catches you.


What Is Class Imbalance?

Class imbalance occurs when the distribution of labels in your training data is not roughly equal. 


In a binary classification problem, a dataset where 95% of examples are labeled "not fraud" and only 5% are labeled "fraud" is severely imbalanced. 

In this case, the model quickly learns that predicting "not fraud" every single time gives it 95% accuracy — while being completely useless at the one task it was built for.

Imbalance shows up in many domains: fraud detection, medical diagnosis, churn prediction, defect detection in manufacturing, and spam classification. 

The problem is almost universal, which makes it all the more important to detect early.


Step 1: Count Your Class Frequencies

The first and most obvious check is a simple frequency count. Before you do anything else, look at how many samples belong to each class.

import pandas as pd

print(df_raw["high_trust_president"].value_counts())
print(df_raw["high_trust_president"].value_counts(normalize=True))




The normalized output gives you proportions. If any class is below 10–15% of your total dataset, start paying close attention. Below 5%, you almost certainly have a problem.

Rule of thumb:

  • 40/60 split → mild, usually fine
  • 20/80 split → moderate, worth addressing
  • 10/90 or worse → severe, requires intervention


Step 2: Visualize the Distribution

Numbers alone can obscure the severity of imbalance. A bar chart of class frequencies makes the skew immediately obvious — and is far more persuasive when you need to explain the problem to a stakeholder.

import matplotlib.pyplot as plt

df_raw["high_trust_president"].value_counts().plot(kind="bar", color=["#378ADD", "#E24B4A"])
plt.title("Class Distribution")
plt.xlabel("Class")
plt.ylabel("Count")
plt.tight_layout()
plt.show()



For multiclass problems, the same logic applies. Look for any class with a bar that barely registers on the chart — that class will likely be ignored by your model.


Step 3: Check the Imbalance Ratio

The imbalance ratio (IR) formalizes the degree of skew:

IR = (count of majority class) / (count of minority class)


An IR of 1 means perfect balance. An IR of 10 means the majority class is 10 times larger. An IR of 100 is extreme and typically demands aggressive intervention.


This single number is useful for documentation and for choosing the right correction strategy.


Step 4: Look at Your Model's Confusion Matrix — Not Just Accuracy

If you've already trained a model, the confusion matrix is your most revealing diagnostic tool. 

A model suffering from class imbalance will typically show one of two failure patterns:

  • High false negatives: The model misses almost all minority class examples (common in fraud or disease detection).
  • Collapsed predictions: The model predicts only one class, virtually never predicting the minority.
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

cm = confusion_matrix(y_test, y_pred)
disp = ConfusionMatrixDisplay(confusion_matrix=cm)
disp.plot()
plt.show()



A dead giveaway: if one entire row of your confusion matrix is near-zero, your model has essentially given up on that class.


Step 5: Evaluate with the Right Metrics

Accuracy is blind to class imbalance. Instead, rely on metrics that actually surface minority-class performance:

Precision = TP / (TP + FP) Answers: of all the times the model predicted positive, how often was it right?


Recall = TP / (TP + FN) Answers: of all the actual positives, how many did the model catch?


F1 Score = 2 × (Precision × Recall) / (Precision + Recall) Balances precision and recall into a single number.


ROC-AUC Measures the model's ability to separate classes across all thresholds. A score near 0.5 on an imbalanced dataset is a strong signal of a struggling model.


Matthews Correlation Coefficient (MCC) Often the most reliable single metric for imbalanced binary classification. Ranges from -1 to +1, with +1 being perfect. Unlike F1, it accounts for all four cells of the confusion matrix.

from sklearn.metrics import classification_report, matthews_corrcoef

print(classification_report(y_test, y_pred))
print(f"MCC: {matthews_corrcoef(y_test, y_pred):.3f}")



If your model has 95% accuracy but an F1 of 0.12 on the minority class, you have a class imbalance problem.


Step 6: Use imbalanced-learn to Quantify and Compare

The imbalanced-learn library (built on scikit-learn) is the standard toolkit for this problem. Even before applying any corrections, it helps you audit your dataset:

from collections import Counter
from imblearn.datasets import make_imbalance
from sklearn.datasets import make_classification

X, y = make_classification(n_samples=1000, weights=[0.95, 0.05], random_state=42)
print("Class distribution:", Counter(y))



Once you've confirmed imbalance, imbalanced-learn gives you tools to address it: SMOTE for oversampling, RandomUnderSampler for undersampling, and pipeline integrations that keep your cross-validation honest.


Step 7: Check for Imbalance Across Data Splits

A subtle but important mistake: your overall class distribution might look acceptable, but your train/validation/test splits may have drifted. Always use stratified splitting to preserve class proportions across every split.


from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

print("Train distribution:", Counter(y_train))
print("Test distribution:", Counter(y_test))



Without stratify=y, a small minority class can vanish entirely from your test set — meaning you have no way to evaluate model performance on the class that matters most.


A Detection Checklist

Before training any classification model, run through these checks:

  • [ ] Counted class frequencies and computed proportions
  • [ ] Visualized the class distribution
  • [ ] Calculated the imbalance ratio
  • [ ] Verified that train/validation/test splits are stratified
  • [ ] Confirmed that evaluation metrics go beyond accuracy (F1, AUC, MCC)
  • [ ] Inspected the confusion matrix for collapsed predictions


What Comes Next

Detecting imbalance is the first battle. Fixing it is the second. The main strategies — roughly ordered from simplest to most complex — are:

  • Class weighting: pass class_weight='balanced' to scikit-learn models to penalize errors on the minority class more heavily
  • Oversampling: generate synthetic minority examples with SMOTE
  • Undersampling: reduce the majority class to restore balance
  • Threshold tuning: adjust the decision threshold to trade precision for recall
  • Ensemble methods: BalancedRandomForest and EasyEnsemble from imbalanced-learn

Each has trade-offs, and the right choice depends on your domain, your data size, and the relative cost of false positives versus false negatives. 


But none of those decisions matter if you haven't first confirmed that imbalance is actually present — and understood how severe it is.


Build a Job‑Ready Portfolio in 16 Python Projects — Proven, Practical, and Profitable for $288.




Comments

Popular posts from this blog

How to Filter Rows Using Boolean Indexing in Pandas (Afrobarometer Kenya Dataset)

How to Build a Pivot Table From Our World in Data Demographics

How to Decide Whether to Drop or Fill Missing Value