How to Create a Before-and-After Comparison Table for Cleaned Data (Google Colab)

Learn how to create a before-and-after comparison table in Python using pandas to clearly show the impact of your data cleaning steps.





Step 1: Set Up Google Colab and Load Data

import pandas as pd
from google.colab import files

uploaded = files.upload()
file_name = list(uploaded.keys())[0]

df = pd.read_csv(file_name)
df_original = df.copy()  # Preserve original data




Step 2: Apply Your Cleaning Steps

Example cleaning:

def clean_data(df):
    df = df.dropna(subset=['country'])
    
    for col in df.select_dtypes(include='number').columns:
        df[col] = df[col].fillna(df[col].median())
    
    for col in df.select_dtypes(include='object').columns:
        df[col] = df[col].str.strip().str.lower()
    
    return df

df_clean = clean_data(df)



Step 3: Create a Summary Comparison Table

Compare key metrics before vs after cleaning.

comparison = pd.DataFrame({
    "Metric": [
        "Row Count",
        "Column Count",
        "Missing Values",
        "Duplicate Rows"
    ],
    "Before Cleaning": [
        df_original.shape[0],
        df_original.shape[1],
        df_original.isna().sum().sum(),
        df_original.duplicated().sum()
    ],
    "After Cleaning": [
        df_clean.shape[0],
        df_clean.shape[1],
        df_clean.isna().sum().sum(),
        df_clean.duplicated().sum()
    ]
})

comparison




Step 4: Compare Specific Columns (Optional)

Track changes in important columns.

column_comparison = pd.DataFrame({
    "Original": df_original['Country Name'].head(10),
    "Cleaned": df_clean['Country Name'].head(10)
})

column_comparison







Step 5: Highlight Changes (Optional)

Show only rows that changed.

changes = df_original.compare(df_clean)
changes.head()



Step 6: Export the Comparison

comparison.to_csv("comparison_summary.csv", index=False)
files.download("comparison_summary.csv")



Key Rules

  • Always keep a copy of original data (df_original)

  • Compare structure (shape), quality (missing values), and duplicates

  • Use .compare() for row-level differences

  • Focus on the metrics stakeholders understand easily


What You Just Achieved

  • Clear visibility of cleaning impact

  • Reusable comparison framework

  • Stakeholder-ready reporting

This approach makes your data cleaning measurable, not just assumed.



Advance Your Career With 16 Python Projects in Data & ML — All for $288.

Comments

Popular posts from this blog

How to Filter Rows Using Boolean Indexing in Pandas (Afrobarometer Kenya Dataset)

How to Decide Whether to Drop or Fill Missing Value

How to create your first line chart with World Bank Kenya GDP data