How to Decide Whether to Drop or Fill Missing Value

Learn how to decide whether to drop or fill missing values in datasets using practical, production-ready strategies. Understand trade-offs, bias risks, and data engineering best practices.



Missing values are not just a data cleaning nuisance—they’re a signal.


The decision to drop or fill them has direct consequences on model performance, statistical validity, and downstream business decisions.


If you treat missing data casually, you risk introducing bias, noise, or false confidence into your analysis.


This guide gives you a structured, decision-oriented framework you can use in real-world data workflows.


1. First Principle: Understand Why Data Is Missing

Before deciding what to do, you need to understand why it’s missing. In data science, this is formalized into three categories:

1. Missing Completely at Random (MCAR)

  • Missingness has no relationship to any variable.

  • Example: A sensor randomly fails.

Implication:
Safe to drop rows without introducing bias.


2. Missing at Random (MAR)

  • Missingness depends on other observed variables.

  • Example: Income is missing more often for younger respondents.

Implication:
Filling is usually better, but must account for relationships in data.


3. Missing Not at Random (MNAR)

  • Missingness depends on the value itself.

  • Example: High earners choose not to disclose income.

Implication:
Dropping or naive filling introduces bias. You need more advanced handling.


2. When You Should Drop Missing Values

Dropping is the simplest approach—but only correct under specific conditions.

✔️ Drop Rows When:

  • Missingness is very small (< 5% of dataset)

  • Data is MCAR

  • Dataset is large enough to absorb the loss

  • The missing column is not critical

df = df.dropna()


✔️ Drop Columns When:

  • Column has too many missing values (> 40–60%)

  • The feature has low predictive importance

  • Imputation would be mostly guesswork

df = df.drop(columns=['unreliable_column'])


❌ Avoid Dropping When:

  • Dataset is small

  • Missingness is systematic (MAR or MNAR)

  • You risk losing key patterns

Dropping data reduces variance—but can increase bias.


3. When You Should Fill (Impute) Missing Values

Filling is about preserving information, but it must be done carefully.

✔️ Fill Numerical Data When:

  • The column is important

  • Missingness is moderate

  • Distribution matters


Common Strategies:

Mean (sensitive to outliers):

df['age'].fillna(df['age'].mean(), inplace=True)

Median (robust):

df['income'].fillna(df['income'].median(), inplace=True)

Mode (for categorical):

df['city'].fillna(df['city'].mode()[0], inplace=True)


✔️ Advanced Imputation Methods

For production-grade systems, basic filling is often insufficient:

1. Forward/Backward Fill (Time Series)

df.fillna(method='ffill', inplace=True)

2. K-Nearest Neighbors (KNN)

  • Uses similarity between rows

  • Preserves relationships

3. Model-Based Imputation

  • Predict missing values using ML models

Example:

from sklearn.impute import KNNImputer

imputer = KNNImputer(n_neighbors=5)
df_imputed = imputer.fit_transform(df)


4. The Trade-Off: Bias vs Variance

This is where most people go wrong.

DecisionEffect
Dropping data                            Reduces variance, risks losing signal
Filling data                            Preserves data, risks introducing bias

Rule of Thumb:

  • High missingness + weak feature → Drop

  • Low/moderate missingness + strong feature → Fill


5. Add a Missing Indicator (Often Overlooked)

Sometimes, the fact that data is missing is informative.

Instead of just filling, add a flag:

df['income_missing'] = df['income'].isnull().astype(int)

This is especially useful in:

  • Credit scoring

  • Healthcare data

  • Behavioral datasets


6. Decision Framework You Can Apply Immediately

Use this checklist:

  1. How much data is missing?

    • <5% → Drop or simple fill

    • 5–30% → Careful imputation

    • 40% → Consider dropping column

  2. Why is it missing?

    • MCAR → Drop is fine

    • MAR → Impute intelligently

    • MNAR → Investigate deeper

  3. How important is the feature?

    • Low importance → Drop

    • High importance → Impute

  4. What is your dataset size?

    • Large → Dropping is safer

    • Small → Prefer imputation


7. Production Perspective (What Actually Happens in Industry)

In real systems (especially in AWS-based pipelines), teams often:

  • Use simple imputation in ETL pipelines (fast, scalable)

  • Add missing indicators for ML models

  • Apply advanced imputation only for high-impact features

  • Track missingness as a data quality metric

For example:

  • AWS Glue handles null transformations

  • Feature engineering pipelines preserve missing signals

  • Models are trained to handle imperfect data—not perfect datasets


Final Takeaway

There is no universal rule—but there is a disciplined approach:

  • Drop when data is negligible or unreliable

  • Fill when the feature matters and patterns exist

  • Always consider the cause of missingness

  • Treat missing data as information—not just a problem


If you approach missing values strategically, you move from data cleaning to data engineering.


Advance Your Career With 16 Python Projects in Data & ML — All for $288.


Comments

Popular posts from this blog

How to Filter Rows Using Boolean Indexing in Pandas (Afrobarometer Kenya Dataset)

How to create your first line chart with World Bank Kenya GDP data