How to Find and Count Missing Values in a Real Dataset (The Right Way)

Learn how to detect, quantify, and analyze missing values in real datasets using Python and Pandas with practical, production-ready techniques.


Why Missing Data Matters More Than You Think

In real-world datasets, missing values aren’t rare—they’re structural. 

Whether you're ingesting data from APIs, CSVs, or databases, null values can distort analytics, break pipelines, and corrupt machine learning models if not handled correctly.

In Pandas, missing values are typically represented as NaN, None, or NaT. Your job is not just to find them—but to quantify and understand their distribution.


World Bank Dataset: https://data.worldbank.org/indicator/NY.GDP.MKTP.CD


Step 1: Load and Inspect the Dataset

import pandas as pd

df = pd.read_csv("your_dataset.csv")

df.head()
df.info()

df.info() is your first diagnostic tool—it quickly shows non-null counts per column.










Step 2: Find Missing Values (Boolean Mask)

df.isnull()


This returns a DataFrame of True (missing) and False (present). It’s useful for filtering but not for summarization.



Step 3: Count Missing Values Per Column

df.isnull().sum()


This is the most practical command. It gives you a column-wise count of missing values—critical for feature selection and cleaning decisions.



Step 4: Calculate Missing Value Percentages

Raw counts don’t tell the full story. Normalize them:

missing_percent = df.isnull().sum() / len(df) * 100
print(missing_percent)


This helps you identify columns that are:

  • Safe to keep (<5% missing)

  • Require imputation (5–30%)

  • Candidates for removal (>30%)



Step 5: Total Missing Values in the Dataset

df.isnull().sum().sum()



This gives you a global view of data quality across the entire dataset.


Step 6: You can Visualize Missing Data (Optional but Powerful)

import seaborn as sns
import matplotlib.pyplot as plt

sns.heatmap(df.isnull(), cbar=False)
plt.show()


A heatmap reveals patterns—missingness isn’t always random.


Step 7: Filter Rows with Missing Values

df[df.isnull().any(axis=1)]


This isolates problematic rows for deeper inspection or targeted cleaning.


Strategic Insight: Missingness is a Signal

In advanced analytics, missing data can indicate:

  • System failures (e.g., logging gaps)

  • User behavior (e.g., optional fields skipped)

  • Data pipeline issues

Before dropping or imputing, ask: Why is this data missing?


Final Takeaway

Handling missing values is not a cleanup task—it’s a data quality audit step

In production-grade pipelines (especially on AWS using tools like Glue or Athena), missing value analysis should be automated and monitored.

If you treat missing data as noise, you’ll miss the signal.



Advance Your Career With 16 Python Projects in Data & ML — All for $288.

Comments

Popular posts from this blog

How to Filter Rows Using Boolean Indexing in Pandas (Afrobarometer Kenya Dataset)

How to Decide Whether to Drop or Fill Missing Value

How to create your first line chart with World Bank Kenya GDP data