How to Document Your Data Cleaning Decisions So Others Can Reproduce Them

 Your production notebook shouldn't just clean data — it should explain every choice you made along the way.



Data cleaning is where most of the real work happens. It's also where reproducibility dies. You drop some rows, fill some nulls, rename a few columns — and six months later, no one (including you) knows why.

This isn't about being tidy for the sake of it. If a teammate picks up your notebook and runs it, they should get the same result — and understand every decision you made. 

Here's how to do that in Colab.

1. Use Markdown Cells as Decision Logs

Most people use markdown cells as section titles. That's not enough. 

Each cleaning step should have a markdown cell that answers: what did you find, what did you decide, and why?




This isn't documentation overhead. It's the reasoning that makes your notebook a reproducible artifact rather than a mystery script.


2. Log What You Found Before You Clean It

Always print a snapshot of the problem you're solving before you fix it. This gives anyone reading the notebook proof that the issue existed.

import pandas as pd df = pd.read_csv('orders.csv') # Snapshot BEFORE cleaning print("Missing values per column:") print(df.isnull().sum()) print(f"\nDuplicate rows: {df.duplicated().sum()}") print(f"Total rows: {len(df)}") print(df.dtypes)

In this case, run this cell and leave its output visible. Don't clear output before sharing. That printed state is evidence.

3. Make Sure to Wrap Every Cleaning Step in a Function with a DocString


Instead of chaining operations inline, write small functions. The docstring is where you log the decision permanently — it lives in the code, not a separate document.

def remove_test_accounts(df): """ Remove rows where email contains '@test.com' or '@example.com'. Reason: These are internal QA accounts and should not appear in any user-facing analysis. Identified manually via a sample audit on 2024-03-10. 47 rows removed in the current dataset. Returns a cleaned copy of the dataframe. """ mask = ~df['email'].str.contains( r'@test\.com|@example\.com', regex=True, na=False ) cleaned = df[mask].copy() print(f"Removed {len(df) - len(cleaned)} test account rows.") return cleaned df = remove_test_accounts(df)


4. Build a Cleaning Log Dictionary

For datasets with many cleaning steps, maintain a running log as a Python dict. This compiles into a readable summary at the end of your notebook.

cleaning_log = [] def log_step(step, before, after, reason): cleaning_log.append({ "step": step, "rows_before": before, "rows_after": after, "rows_removed": before - after, "reason": reason }) # Use it at each step before = len(df) df = df.drop_duplicates(subset=['order_id']) log_step( step="Drop duplicate order IDs", before=before, after=len(df), reason="order_id should be a primary key. Duplicates indicate " "upstream ETL bug. Keeping first occurrence." ) # At the end of your notebook pd.DataFrame(cleaning_log)

In case, this outputs a clean table showing every transformation in sequence — exactly what a reviewer or collaborator needs to audit your work.


5. Pin Your Assumptions with Assertions

After cleaning, add assertions that enforce what you expect to be true. These are not just tests — they're documented assumptions. 

If someone changes the upstream data and reruns the notebook, they'll know exactly which assumption broke.

# Document your assumptions explicitly assert df['order_id'].is_unique, \ "order_id must be unique after deduplication" assert df['amount'].min() >= 0, \ "No negative amounts expected — flag for upstream review" assert df['status'].isin(['complete', 'pending', 'refunded']).all(), \ "Unexpected status value found — update allowed values list" print("All assertions passed.")


6. Always Save a Before and After Snapshot

At the top of your notebook, save the raw data shape. At the bottom, save the clean data shape alongside your log. This makes it trivial to verify the pipeline end-to-end.

# Top of notebook — capture raw state raw_shape = df_raw.shape print(f"Raw data: {raw_shape[0]} rows, {raw_shape[1]} columns") # ... all cleaning steps ... # Bottom of notebook — final state + summary print(f"\n--- Cleaning Summary ---") print(f"Raw: {raw_shape[0]} rows") print(f"Clean: {len(df)} rows") print(f"Removed: {raw_shape[0] - len(df)} rows total\n") pd.DataFrame(cleaning_log)


The Rule of Thumb

For every decision you make while cleaning data, ask yourself: if I deleted this cell, would someone be able to figure out what I did and why? If the answer is no, document it.

A well-documented cleaning notebook is one where the markdown cells tell the story and the code cells prove it. Together, that's reproducibility.


Advance Your Career With 16 Python Projects in Data & ML — All for $288.

Comments

Popular posts from this blog

How to Filter Rows Using Boolean Indexing in Pandas (Afrobarometer Kenya Dataset)

How to Decide Whether to Drop or Fill Missing Value

How to create your first line chart with World Bank Kenya GDP data