How to Find and Count Missing Values in a Real Dataset (The Right Way)
Learn how to detect, quantify, and analyze missing values in real datasets using Python and Pandas with practical, production-ready techniques.
Why Missing Data Matters More Than You Think
In real-world datasets, missing values aren’t rare—they’re structural.
Whether you're ingesting data from APIs, CSVs, or databases, null values can distort analytics, break pipelines, and corrupt machine learning models if not handled correctly.
In Pandas, missing values are typically represented as NaN, None, or NaT. Your job is not just to find them—but to quantify and understand their distribution.
World Bank Dataset: https://data.worldbank.org/indicator/NY.GDP.MKTP.CD
Step 1: Load and Inspect the Dataset
import pandas as pd
df = pd.read_csv("your_dataset.csv")
df.head()
df.info()
df.info() is your first diagnostic tool—it quickly shows non-null counts per column.
Step 2: Find Missing Values (Boolean Mask)
df.isnull()
This returns a DataFrame of True (missing) and False (present). It’s useful for filtering but not for summarization.
Step 3: Count Missing Values Per Column
df.isnull().sum()
This is the most practical command. It gives you a column-wise count of missing values—critical for feature selection and cleaning decisions.
Step 4: Calculate Missing Value Percentages
Raw counts don’t tell the full story. Normalize them:
missing_percent = df.isnull().sum() / len(df) * 100
print(missing_percent)
This helps you identify columns that are:
Safe to keep (<5% missing)
Require imputation (5–30%)
Candidates for removal (>30%)
Step 5: Total Missing Values in the Dataset
df.isnull().sum().sum()
This gives you a global view of data quality across the entire dataset.
Step 6: You can Visualize Missing Data (Optional but Powerful)
import seaborn as sns
import matplotlib.pyplot as plt
sns.heatmap(df.isnull(), cbar=False)
plt.show()
A heatmap reveals patterns—missingness isn’t always random.
Step 7: Filter Rows with Missing Values
df[df.isnull().any(axis=1)]
This isolates problematic rows for deeper inspection or targeted cleaning.
Strategic Insight: Missingness is a Signal
In advanced analytics, missing data can indicate:
System failures (e.g., logging gaps)
User behavior (e.g., optional fields skipped)
Data pipeline issues
Before dropping or imputing, ask: Why is this data missing?
Final Takeaway
Handling missing values is not a cleanup task—it’s a data quality audit step.
In production-grade pipelines (especially on AWS using tools like Glue or Athena), missing value analysis should be automated and monitored.
If you treat missing data as noise, you’ll miss the signal.
Advance Your Career With 16 Python Projects in Data & ML — All for $288.
Comments
Post a Comment