How to Use SMOTE to Handle Imbalanced African Survey Data

Survey data collected across African contexts, that is, household welfare assessments, health outcome studies, agricultural censuses, financial inclusion surveys — almost always arrives imbalanced.

The households that experienced food insecurity; the smallholders who adopted a new crop variety; the women who accessed formal credit: these are the groups your model most needs to understand, and they are almost always the minority class.

Standard oversampling (duplicating minority rows) overfits.

Undersampling (discarding majority rows) wastes hard-won field data.

SMOTE — Synthetic Minority Over-sampling Technique — offers a smarter path: it generates new, synthetic minority examples by interpolating between real ones.

Used carefully and with an understanding of your survey's structure, it can substantially improve model performance on the people and outcomes that matter most.

Understanding SMOTE Before Applying It

SMOTE was introduced by Chawla et al. (2002) and works by:

Selecting a minority class sample
Finding its k nearest neighbors (among minority samples)
Drawing a random point along the line segment between the original sample and one of its neighbors
Adding that synthetic point to the training set

The result is new examples that are plausible interpolations of real data — not duplicates, not noise.

from imblearn.over_sampling import SMOTE

sm = SMOTE(k_neighbors=5, random_state=42)
X_resampled, y_resampled = sm.fit_resample(X_train, y_train)

Simple enough. But African survey data has structural features that make naive SMOTE application dangerous.

The Structural Challenges of African Survey Data

Before running a single line of code, understand what makes your dataset different from a generic tabular benchmark.

Stratified multi-stage sampling. Surveys like LSMS-ISA (Living Standards Measurement Study — Integrated Surveys on Agriculture), DHS (Demographic and Health Surveys), and FinScope use complex sampling designs. Households are not drawn with equal probability. Some are upweighted; some are downweighted. SMOTE is blind to survey weights — it treats every row as equally real.

Mixed variable types. African survey datasets typically combine continuous variables (land area cultivated, monthly expenditure, livestock value), ordinal variables (education level, asset index quintile), binary indicators (owns mobile phone, has bank account), and nominal categoricals (region, crop type, household head occupation). Standard Euclidean-distance SMOTE was designed for continuous features. Interpolating between "maize" and "cassava" produces nonsense.

Clustered structure. Observations are nested within villages, within enumeration areas, within districts. Generating synthetic samples that ignore this clustering can leak geographic information across clusters, inflating performance estimates.

High missingness. Survey data often arrives with substantial missing values — sometimes structurally (a question only asked in rural areas) and sometimes randomly (enumerator error, refusal). SMOTE requires complete feature matrices. Imputing first, then applying SMOTE, in the wrong order can compound errors.

Small absolute minority counts. In a nationally representative survey of 5,000 households where 4% experienced acute food insecurity, you have roughly 200 minority examples. SMOTE with k=5 works, but the synthetic space is tightly constrained — you cannot generate wildly diverse examples from a small seed set.

Step 1: Prepare Your Data Correctly

Handle missingness before SMOTE, after splitting

Always split your data into train and test sets before doing anything else. Imputation and resampling must only see training data. The test set is untouched.

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer

# Use df_raw as the base DataFrame for current processing
df_current = df_raw.copy()

# Define features and target using existing feature list
features = [
    "Q1_AGE_GROUP",
    "Q4_EDUCATION",
    "Q5_EMPLOYMENT",
    "Q6_RESIDENCE",
    "Q19_ELECTRICITY_ACCESS",
    "Q7_ECON_CONDITION" # Q7_ECON_CONDITION is a categorical feature with NaNs
]
X = df_current[features]
y = df_current["high_trust_president"]

# Perform train-test split before imputation and encoding
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

# Impute missing values for categorical features using 'most_frequent' strategy
imputer = SimpleImputer(strategy="most_frequent")

# Fit on training data and transform both train and test sets
X_train_imp = pd.DataFrame(imputer.fit_transform(X_train), columns=X_train.columns, index=X_train.index)
X_test_imp = pd.DataFrame(imputer.transform(X_test), columns=X_test.columns, index=X_test.index)

# Now, one-hot encode the imputed categorical features
X_train_encoded = pd.get_dummies(X_train_imp, drop_first=True)
X_test_encoded = pd.get_dummies(X_test_imp, drop_first=True)

# Update X_train and X_test for subsequent steps in the notebook
X_train = X_train_encoded
X_test = X_test_encoded

# Print shapes to confirm
print("Shape of X_train after imputation and encoding:", X_train.shape)
print("Shape of X_test after imputation and encoding:", X_test.shape)
print("Shape of y_train:", y_train.shape)
print("Shape of y_test:", y_test.shape)

Never fit the imputer on the full dataset — that leaks test statistics into training.

Encode categoricals appropriately

SMOTE cannot interpolate between category labels. You must encode them before passing to SMOTE, but choose your encoding carefully:

Binary indicators (owns mobile phone: yes/no): encode as 0/1. SMOTE interpolation produces values between 0 and 1 — re-round after resampling.

Ordinal variables (education: none/primary/secondary/tertiary): encode as integers 0–3. Interpolation is meaningful here.

Nominal categoricals (region, crop type): one-hot encode, or use SMOTE-NC (see below) which handles them natively.

from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer
import pandas as pd

# All features from the df_raw simulation are categorical.
# Define columns for OrdinalEncoder and OneHotEncoder
ordinal_features_names = ["Q1_AGE_GROUP", "Q4_EDUCATION", "Q7_ECON_CONDITION"]
nominal_features_names = ["Q5_EMPLOYMENT", "Q6_RESIDENCE", "Q19_ELECTRICITY_ACCESS"]

# These lists (age_groups, education_options, condition_scale) are available in the kernel's global scope.
# SimpleImputer with 'most_frequent' strategy would have replaced 'Refused' and NaN values,
# so the original ordered lists should correctly represent the categories after imputation.

preprocessor = ColumnTransformer(
    transformers=[
        ('ord', OrdinalEncoder(categories=[
            age_groups,
            education_options,
            condition_scale
        ]), ordinal_features_names),
        ('nom', OneHotEncoder(handle_unknown='ignore', drop='first', sparse_output=False), nominal_features_names)
    ],
    remainder='drop' # Drop any columns not specified in transformers
)

# Apply the ColumnTransformer to the imputed dataframes
X_train_enc = preprocessor.fit_transform(X_train_imp)
X_test_enc = preprocessor.transform(X_test_imp)

# Construct column names for the transformed DataFrame
output_column_names = ordinal_features_names.copy()
ohe_feature_names = preprocessor.named_transformers_['nom'].get_feature_names_out(nominal_features_names)
output_column_names.extend(ohe_feature_names)

# Convert the transformed arrays back to DataFrames with meaningful column names
X_train = pd.DataFrame(X_train_enc, columns=output_column_names, index=X_train_imp.index)
X_test = pd.DataFrame(X_test_enc, columns=output_column_names, index=X_test_imp.index)

print("Shape of X_train after ColumnTransformer:", X_train.shape)
print("Shape of X_test after ColumnTransformer:", X_test.shape)



This example takes simulated data.

Step 2: Choose the Right SMOTE Variant

Standard SMOTE assumes all features are continuous. For African survey data — with its mix of types — you almost certainly want a variant.

SMOTE-NC (Nominal and Continuous)

The most practical choice for mixed survey data. It handles continuous and categorical features simultaneously, using the median of standard deviations for continuous features to weight the distance calculation.

from imblearn.over_sampling import SMOTENC

# The first `len(ordinal_features_names)` columns are ordinal-encoded (numerical for SMOTENC)
# The remaining columns are one-hot encoded (categorical for SMOTENC)
categorical_feature_indices = list(range(len(ordinal_features_names), X_train_enc.shape[1]))

sm = SMOTENC(
    categorical_features=categorical_feature_indices,
    k_neighbors=5,
    random_state=42
)

X_res, y_res = sm.fit_resample(X_train_enc, y_train)

ADASYN (Adaptive Synthetic Sampling)

ADASYN generates more synthetic samples in regions of feature space where the minority class is harder to learn.

It's most useful when your minority class has high internal variance — for example, food insecurity driven by very different underlying causes (drought in one region, conflict in another, market failure in a third).

from imblearn.over_sampling import ADASYN

ada = ADASYN(n_neighbors=5, random_state=42)
X_res, y_res = ada.fit_resample(X_train_enc, y_train)

BorderlineSMOTE

Focuses synthesis on minority examples near the decision boundary — the ambiguous cases. It is worth trying when you have a clean majority cluster but a diffuse minority.

from imblearn.over_sampling import BorderlineSMOTE

bsmote = BorderlineSMOTE(k_neighbors=5, random_state=42)
X_res, y_res = bsmote.fit_resample(X_train_enc, y_train)

Step 3: Apply SMOTE Inside Cross-Validation

This is the most commonly violated rule. Applying SMOTE before cross-validation leaks synthetic data into validation folds, producing optimistically biased performance estimates. Always apply SMOTE inside each fold.

The imblearn library provides Pipeline for exactly this:

from imblearn.pipeline import Pipeline
from imblearn.over_sampling import SMOTENC
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import StratifiedKFold, cross_validate

pipeline = Pipeline([
    ("smote", SMOTENC(
        categorical_features=categorical_feature_indices,
        k_neighbors=5,
        random_state=42
    )),
    ("clf", RandomForestClassifier(n_estimators=200, random_state=42))
])

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

results = cross_validate(
    pipeline, X_train_enc, y_train,
    cv=cv,
    scoring=["f1", "roc_auc", "average_precision"],
    return_train_score=True
)

print("F1:", results["test_f1"].mean().round(3))
print("ROC-AUC:", results["test_roc_auc"].mean().round(3))
print("Avg Precision:", results["test_average_precision"].mean().round(3))

Using imblearn.pipeline.Pipeline (not sklearn.pipeline.Pipeline) ensures SMOTE is called during fit only, never during transform or prediction.

Step 4: Account for Survey Weights

If your data comes with sampling weights — as DHS, LSMS, and FinScope data do — SMOTE alone is insufficient. Synthetic samples inherit no weight. You have two practical options:

Option A: Weight the classifier, not the data. Pass sampling weights to the classifier directly. Most scikit-learn classifiers accept sample_weight in fit(). This respects the survey design without touching the sample composition.

from sklearn.ensemble import GradientBoostingClassifier

weights_train = df.loc[X_train.index, "sampling_weight"].values

clf = GradientBoostingClassifier(n_estimators=200, random_state=42)
clf.fit(X_train_enc, y_train, sample_weight=weights_train)

Option B: Apply SMOTE, then assign synthetic samples a weight of 1. After resampling, the original rows retain their survey weights and the synthetic rows receive weight 1 (representing an average, unweighted observation). This is a reasonable approximation for large surveys.

Step 5: Tune the Sampling Strategy

By default, SMOTE balances classes to 1:1. This is often too aggressive and can introduce too many synthetic samples relative to the real minority. The sampling_strategy parameter lets you control the final ratio.

sm = SMOTENC(
    categorical_features=categorical_feature_indices,
    sampling_strategy=0.3,
    k_neighbors=5,
    random_state=42
)

Here, sampling_strategy=0.3 means the minority class will be resampled to 30% of the majority class size — a moderate correction rather than full balancing. Experiment with values between 0.2 and 1.0 and evaluate using F1 and average precision, not accuracy.

Step 6: Validate on Real Data, Never Synthetic

Your test set must contain only real survey observations — no synthetic rows. Evaluate all final metrics on this held-out set.

from sklearn.metrics import classification_report, average_precision_score

pipeline.fit(X_train_enc, y_train)
y_pred = pipeline.predict(X_test_enc)
y_prob = pipeline.predict_proba(X_test_enc)[:, 1]

print(classification_report(y_test, y_pred))
print(f"Average Precision: {average_precision_score(y_test, y_prob):.3f}")

For survey data, average precision (the area under the precision-recall curve) is often more informative than ROC-AUC, especially when the minority class prevalence is very low.

ROC-AUC is optimistic in severely imbalanced settings; average precision is not.

Real World Example: Financial Inclusion in East Africa

Consider a FinScope-style survey across Kenya, Tanzania, and Uganda with the following structure:

8,000 households total
Target: formal_credit_access (1 = accessed formal credit in past 12 months)
Class distribution: 7% positive (560 households), 93% negative (7,440 households)
Features: income, mobile money usage, land ownership, education, distance to bank, region, primary livelihood

from imblearn.pipeline import Pipeline
from imblearn.over_sampling import SMOTENC
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import StratifiedKFold, cross_validate
from sklearn.metrics import average_precision_score

categorical_idx = [5, 6, 7, 8, 9]

pipeline = Pipeline([
    ("smote", SMOTENC(
        categorical_features=categorical_idx,
        sampling_strategy=0.25,
        k_neighbors=5,
        random_state=42
    )),
    ("clf", RandomForestClassifier(
        n_estimators=300,
        class_weight="balanced",
        random_state=42
    ))
])

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

scores = cross_validate(
    pipeline, X_train_enc, y_train,
    cv=cv,
    scoring=["f1", "average_precision"],
)

print(f"F1 (CV): {scores['test_f1'].mean():.3f} ± {scores['test_f1'].std():.3f}")
print(f"Avg Precision (CV): {scores['test_average_precision'].mean():.3f}")

Note the use of class_weight="balanced" on the RandomForest alongside SMOTE. The two approaches are complementary: SMOTE addresses the data imbalance; class weighting addresses the loss function. Using both typically outperforms either alone on severely imbalanced survey data.

Common Mistakes to Avoid

1. Applying SMOTE to the full dataset before splitting. This allows synthetic samples — derived from test set observations — to appear in training. Your validation metrics become meaningless.

2. Using standard SMOTE on categorical features. Interpolating between "Nairobi" and "Kisumu" to get a synthetic household that is somehow 60% Nairobi and 40% Kisumu is statistically incoherent. Use SMOTE-NC.

3. Ignoring k_neighbors relative to minority class size. If you have 200 minority examples and set k_neighbors=5, each synthetic sample is interpolated from 5 real neighbors. That's fine. If you have 30 minority examples and set k_neighbors=5, you're generating a synthetic space from a very narrow seed. Reduce k to 3 and interpret results cautiously.

4. Evaluating on accuracy. A model that predicts "no formal credit access" for everyone achieves 93% accuracy. It is useless. Never report accuracy as a primary metric on imbalanced survey data.

5. Forgetting to stratify splits. With a 7% minority class, a random 20% test split of 8,000 rows gives you about 112 minority test examples. A non-stratified split may give you far fewer — or zero — by chance. Always use stratify=y.

When SMOTE Is Not Enough

SMOTE improves model performance on imbalanced data, but it is not a substitute for better data collection. If your minority class has fewer than 50 real examples, SMOTE's synthetic space is too constrained to be reliable. In these cases:

Consider targeted oversampling at the data collection stage (oversample high-risk strata intentionally)
Use Bayesian approaches that can incorporate informative priors
Combine SMOTE with cost-sensitive learning and threshold optimization
Consult survey methodologists about whether the sampling design itself can be adjusted in future waves

SMOTE is a tool. Like all tools, it works best in the hands of someone who understands both the instrument and the material.

Summary

Step	What to do
Split first	Train/test split with `stratify=y` before any resampling
Impute	Fit imputer on train only, transform both
Encode	Use SMOTE-NC for mixed variable types
Pipeline	Wrap SMOTE + classifier in `imblearn.Pipeline`
Cross-validate	Apply SMOTE inside each fold, never outside
Survey weights	Pass to classifier as `sample_weight`, or retain for real rows
Evaluate	Use F1, average precision, MCC — never accuracy alone
Test set	Real data only — no synthetic rows

African survey data is hard-won. Every observation represents a household that a field team visited, a question that was answered, a life that was documented.

SMOTE helps your model listen more carefully to the voices that raw counts would otherwise drown out.

Advance Your Career With 16 Python Projects in Data & ML — All for $288.

Search This Blog

Practical Python for Data Engineering, Data Analysis & Machine Learning