How to Decide Whether to Drop or Fill Missing Value

Learn how to decide whether to drop or fill missing values in datasets using practical, production-ready strategies. Understand trade-offs, bias risks, and data engineering best practices.

Missing values are not just a data cleaning nuisance—they’re a signal.

The decision to drop or fill them has direct consequences on model performance, statistical validity, and downstream business decisions.

If you treat missing data casually, you risk introducing bias, noise, or false confidence into your analysis.

This guide gives you a structured, decision-oriented framework you can use in real-world data workflows.

1. First Principle: Understand Why Data Is Missing

Before deciding what to do, you need to understand why it’s missing. In data science, this is formalized into three categories:

1. Missing Completely at Random (MCAR)

Missingness has no relationship to any variable.
Example: A sensor randomly fails.

Implication:
Safe to drop rows without introducing bias.

2. Missing at Random (MAR)

Missingness depends on other observed variables.
Example: Income is missing more often for younger respondents.

Implication:
Filling is usually better, but must account for relationships in data.

3. Missing Not at Random (MNAR)

Missingness depends on the value itself.
Example: High earners choose not to disclose income.

Implication:
Dropping or naive filling introduces bias. You need more advanced handling.

2. When You Should Drop Missing Values

Dropping is the simplest approach—but only correct under specific conditions.

✔️ Drop Rows When:

Missingness is very small (< 5% of dataset)
Data is MCAR
Dataset is large enough to absorb the loss
The missing column is not critical

df = df.dropna()

✔️ Drop Columns When:

Column has too many missing values (> 40–60%)
The feature has low predictive importance
Imputation would be mostly guesswork

df = df.drop(columns=['unreliable_column'])

❌ Avoid Dropping When:

Dataset is small
Missingness is systematic (MAR or MNAR)
You risk losing key patterns

Dropping data reduces variance—but can increase bias.

3. When You Should Fill (Impute) Missing Values

Filling is about preserving information, but it must be done carefully.

✔️ Fill Numerical Data When:

The column is important
Missingness is moderate
Distribution matters

Common Strategies:

Mean (sensitive to outliers):

df['age'].fillna(df['age'].mean(), inplace=True)

Median (robust):

df['income'].fillna(df['income'].median(), inplace=True)

Mode (for categorical):

df['city'].fillna(df['city'].mode()[0], inplace=True)

✔️ Advanced Imputation Methods

For production-grade systems, basic filling is often insufficient:

1. Forward/Backward Fill (Time Series)

df.fillna(method='ffill', inplace=True)

2. K-Nearest Neighbors (KNN)

Uses similarity between rows
Preserves relationships

3. Model-Based Imputation

Predict missing values using ML models

Example:

from sklearn.impute import KNNImputer

imputer = KNNImputer(n_neighbors=5)
df_imputed = imputer.fit_transform(df)

4. The Trade-Off: Bias vs Variance

This is where most people go wrong.

Decision	Effect
Dropping data	Reduces variance, risks losing signal
Filling data	Preserves data, risks introducing bias

Rule of Thumb:

High missingness + weak feature → Drop
Low/moderate missingness + strong feature → Fill

5. Add a Missing Indicator (Often Overlooked)

Sometimes, the fact that data is missing is informative.

Instead of just filling, add a flag:

df['income_missing'] = df['income'].isnull().astype(int)

This is especially useful in:

Credit scoring
Healthcare data
Behavioral datasets

6. Decision Framework You Can Apply Immediately

Use this checklist:

How much data is missing?
- <5% → Drop or simple fill
- 5–30% → Careful imputation
- 40% → Consider dropping column
Why is it missing?
- MCAR → Drop is fine
- MAR → Impute intelligently
- MNAR → Investigate deeper
How important is the feature?
- Low importance → Drop
- High importance → Impute
What is your dataset size?

Large → Dropping is safer
Small → Prefer imputation

7. Production Perspective (What Actually Happens in Industry)

In real systems (especially in AWS-based pipelines), teams often:

Use simple imputation in ETL pipelines (fast, scalable)
Add missing indicators for ML models
Apply advanced imputation only for high-impact features
Track missingness as a data quality metric

For example:

AWS Glue handles null transformations
Feature engineering pipelines preserve missing signals
Models are trained to handle imperfect data—not perfect datasets

Final Takeaway

There is no universal rule—but there is a disciplined approach:

Drop when data is negligible or unreliable
Fill when the feature matters and patterns exist
Always consider the cause of missingness
Treat missing data as information—not just a problem

If you approach missing values strategically, you move from data cleaning to data engineering.

Advance Your Career With 16 Python Projects in Data & ML — All for $288.

Search This Blog

Practical Python for Data Engineering, Data Analysis & Machine Learning