How to Create a Before-and-After Comparison Table for Cleaned Data (Google Colab)
Learn how to create a before-and-after comparison table in Python using pandas to clearly show the impact of your data cleaning steps.
Step 1: Set Up Google Colab and Load Data
import pandas as pd
from google.colab import files
uploaded = files.upload()
file_name = list(uploaded.keys())[0]
df = pd.read_csv(file_name)
df_original = df.copy() # Preserve original data
Step 2: Apply Your Cleaning Steps
Example cleaning:
def clean_data(df):
df = df.dropna(subset=['country'])
for col in df.select_dtypes(include='number').columns:
df[col] = df[col].fillna(df[col].median())
for col in df.select_dtypes(include='object').columns:
df[col] = df[col].str.strip().str.lower()
return df
df_clean = clean_data(df)
Step 3: Create a Summary Comparison Table
Compare key metrics before vs after cleaning.
comparison = pd.DataFrame({
"Metric": [
"Row Count",
"Column Count",
"Missing Values",
"Duplicate Rows"
],
"Before Cleaning": [
df_original.shape[0],
df_original.shape[1],
df_original.isna().sum().sum(),
df_original.duplicated().sum()
],
"After Cleaning": [
df_clean.shape[0],
df_clean.shape[1],
df_clean.isna().sum().sum(),
df_clean.duplicated().sum()
]
})
comparison
Step 4: Compare Specific Columns (Optional)
Track changes in important columns.
column_comparison = pd.DataFrame({
"Original": df_original['Country Name'].head(10),
"Cleaned": df_clean['Country Name'].head(10)
})
column_comparison
Step 5: Highlight Changes (Optional)
Show only rows that changed.
changes = df_original.compare(df_clean)
changes.head()
Step 6: Export the Comparison
comparison.to_csv("comparison_summary.csv", index=False)
files.download("comparison_summary.csv")
Key Rules
Always keep a copy of original data (
df_original)Compare structure (shape), quality (missing values), and duplicates
Use
.compare()for row-level differencesFocus on the metrics stakeholders understand easily
What You Just Achieved
Clear visibility of cleaning impact
Reusable comparison framework
Stakeholder-ready reporting
This approach makes your data cleaning measurable, not just assumed.
Advance Your Career With 16 Python Projects in Data & ML — All for $288.
Comments
Post a Comment