Posts

Showing posts from April, 2026

How to Use Pandas pipe() to Chain Cleaning Steps Cleanly

Image
Data cleaning code can quickly become messy when every transformation is written on a separate line. pandas.pipe() helps you build cleaner, reusable, and more readable workflows. Instead of nesting functions or rewriting DataFrames repeatedly, you can chain transformations step by step. Why Use pipe() ? pipe() lets you pass a DataFrame through custom functions in sequence. Benefits: Cleaner code Easier debugging Reusable cleaning functions Better readability for pipelines Sample Dataset Look for any Dataset here: FREE DATASETS import pandas as pd df = pd.read_csv("survey_data.csv") print(df.head()) Step 1: Create Cleaning Functions Each function takes a DataFrame and returns a cleaned DataFrame. def remove_duplicates(df): return df.drop_duplicates() def standardize_country(df): df["country"] = df["country"].str.strip().str.title() return df def fill_missing_age(df): df["age"] = df["age"].fillna(df["age"]...

How to Deal With Outliers in Census Data — Drop, Cap, or Keep?

Image
Afrobarometer uses Likert scales, categorical codes, and a few continuous variables.  Most "outliers" here aren't extreme values — they're refusals, don't-knows, and skip patterns encoded as numbers. Before you touch a single row, understand this: Afrobarometer encodes non-responses as  -1 ,  8 ,  9 , and  98 / 99  depending on the variable. These will show up as outliers in any statistical check. They are not — they are missing data in disguise. Do this before any outlier check. Otherwise your IQR and percentile calculations are poisoned by refusal codes. After handling sentinels, true impossible values are rare. A trust variable scored 0–3 should have nothing above 3. A poverty frequency variable scored 0–4 should have nothing at 7. # Q56A: Trust in President — valid range 0 (not at all) to 3 (a lot) # Anything outside this after sentinel removal is a data entry error. before = df[ 'Q56A' ].notna().sum() df.loc[~df[ 'Q56A' ].between(0, 3), ...

How to Document Your Data Cleaning Decisions So Others Can Reproduce Them

Image
  Your production notebook shouldn't just clean data — it should explain every choice you made along the way. Data cleaning is where most of the real work happens. It's also where reproducibility dies. You drop some rows, fill some nulls, rename a few columns — and six months later, no one (including you) knows why. This isn't about being tidy for the sake of it. If a teammate picks up your notebook and runs it, they should get the same result — and understand every decision you made.  Here's how to do that in Colab. 1. Use Markdown Cells as Decision Logs Most people use markdown cells as section titles. That's not enough.  Each cleaning step should have a markdown cell that answers:  what did you find, what did you decide, and why? This isn't documentation overhead. It's the reasoning that makes your notebook a reproducible artifact rather than a mystery script. 2. Log What You Found Before You Clean It Always print a snapshot of the problem you're sol...

How to Standardise Inconsistent Text Values (e.g. 'Nairobi' vs 'nairobi')

Image
In real-world datasets, text columns are rarely clean. You’ll often see variations like "Kenya" , "kenya " , "KENYA" , or "Ken-ya" .  Standardising these values is essential for reliable grouping, analysis, and reporting. Below is a minimal, practical workflow using a CSV file in Python (Pandas). 1. Load the Data import pandas as pd df = pd.read_csv("survey_data.csv") df.head() 2. Inspect the inconsistent column df['country'].unique() This helps you identify variations like: extra spaces inconsistent casing spelling differences 3. Standardise text values Step 1: Lowercase everything Step 2: Strip spaces Step 3: Replace known inconsistencies df['country_clean'] = ( df['country'] .str.lower() .str.strip() .replace({ 'ken-ya': 'kenya', 'k e n y a': 'kenya' }) ) 4. Optional: Capitalise properly If you want clean reporting format: df['country_c...

How to Handle Inconsistent Date Formats in Any Survey Data

Image
Most real-world survey datasets don’t come clean—especially date fields.  You’ll often get a .txt file with mixed formats, broken rows, and inconsistent structure. This guide shows how to: Load a messy .txt dataset Inspect raw date values Standardize them into a usable format Step 1: Load the .txt File Correctly A .txt file is not always a clean CSV. Use a flexible parser. import pandas as pd url = "YOUR_URL_HERE" df = pd.read_csv( url, header=None, dtype=str, engine='python', on_bad_lines='skip' # ← THIS fixes your error ) df.head() Here is the DATE DATASETS Why this works: engine='python' → handles inconsistent rows dtype=str → prevents pandas from guessing formats header=None → treats all rows as raw data Step 2: Name Your Column df.columns = ['raw_date'] Now you have a clean starting point: df.head() Step 3: Inspect the Date Formats df['raw_date'].unique()[:10] Typical outputs: 2024-01-05 01/05/2024...

How to Remove Duplicate Rows from a Survey Dataset

Image
Learn how to remove duplicate rows from a survey dataset using Python and pandas. Clean your data by identifying and dropping repeated entries to ensure accurate analysis. Duplicate rows in survey data occur when the same response is recorded more than once, often due to system errors or repeated submissions.  These duplicates can distort analysis results and must be removed before processing. Step 0: Load the Data import pandas as pd from google.colab import files uploaded = files.upload() file_name = list (uploaded.keys())[ 0 ] df = pd.read_excel(file_name, skiprows= 4 ) df = df.dropna(axis= 1 , how= 'all' ) df.head() Step 1: Identify duplicates df.duplicated().sum() Step 2: Remove duplicates df = df.drop_duplicates() print(df) Step 3: Confirm removal df.duplicated().sum() print(df) Key point Always deduplicate early in your pipeline to ensure each survey response is counted only once. Advance Your Career With 16 Python Projects in Data & ML — All for $288.

How to Create a Before-and-After Comparison Table for Cleaned Data (Google Colab)

Image
Learn how to create a before-and-after comparison table in Python using pandas to clearly show the impact of your data cleaning steps. Step 1: Set Up Google Colab and Load Data import pandas as pd from google.colab import files uploaded = files.upload() file_name = list(uploaded.keys())[0] df = pd.read_csv(file_name) df_original = df.copy() # Preserve original data Step 2: Apply Your Cleaning Steps Example cleaning: def clean_data(df): df = df.dropna(subset=['country']) for col in df.select_dtypes(include='number').columns: df[col] = df[col].fillna(df[col].median()) for col in df.select_dtypes(include='object').columns: df[col] = df[col].str.strip().str.lower() return df df_clean = clean_data(df) Step 3: Create a Summary Comparison Table Compare key metrics before vs after cleaning. comparison = pd.DataFrame({ "Metric": [ "Row Count", "Column Count", ...

How to Build a Reusable Data Cleaning Pipeline in Python

Image
Learn how to build a reusable data cleaning pipeline in Python using Pandas and Scikit-learn. Create scalable, production-ready workflows for consistent data preprocessing. Why Reusable Pipelines Matter In real-world data engineering, cleaning data once is useless. You need a repeatable system that guarantees consistency across: Training vs production data Multiple datasets Team workflows Using Pandas alone leads to scattered scripts. A pipeline enforces structure, traceability, and reproducibility—critical in any serious ML or analytics workflow. Core Concept: A Pipeline = Ordered Transformations A data cleaning pipeline is simply a sequence of steps: Raw Data → Clean Missing Values → Fix Types → Encode → Output Clean Data Each step should be: Modular Reusable Deterministic Step 1: Define Reusable Cleaning Functions Start by encapsulating logic into functions: def drop_invalid_rows(df): return df[df["income"] > 0] def fill_missing(df): df["age"] = df[...

How to Filter Rows Using Boolean Indexing in Pandas (Afrobarometer Kenya Dataset)

Image
Learn how to filter rows in Pandas using boolean indexing with real Afrobarometer Kenya survey data. Build clean, ML-ready features with practical examples. Why Boolean Indexing Matters in Feature Engineering In Pandas, boolean indexing is how you select the exact slices of data that become features .  In Module 04 of the course, you’ll no longer exploring—you’ll be deciding what the model sees. In this case, with a dataset like Afrobarometer Kenya Round 10 , filtering is how you: Remove invalid survey responses Isolate specific demographic groups Prepare clean subsets for encoding Load and Inspect the Dataset This assumes that you’ve already converted the .sav file and loaded it: import pandas as pd df = pd.read_csv("afrobarometer_kenya_r10.csv") df.head() 1. Filter Valid Survey Responses Survey datasets contain codes like -1 , 8 , 9 for missing or “Don’t know”. df = df[df["trust_president"] >= 0] This keeps only valid numeric responses for trust in the pres...

How to Clean Messy String Columns with .str Methods in Pandas

Image
Learn how to clean messy string columns in pandas using .str methods.  You will learn to standardize text, remove whitespace, fix inconsistencies, and prepare reliable data for analysis with practical, production-ready examples. Messy string data breaks filtering, grouping, and joins. In pandas , string operations are handled through the .str accessor—a vectorized interface for applying string transformations across entire columns efficiently. 1. What .str Is .str provides vectorized string operations on a pandas Series. Instead of looping: [x.lower() for x in df["county"]] Use: df["county"].str.lower() This is faster, scalable, and consistent. 2. Inspect the Column First print(df["county"].unique()) Look for: inconsistent casing → "Nairobi" , "NAIROBI" extra spaces → " Meru " noise characters → "Garissa\n" inconsistent formats → "THARAKANITHI" vs "Tharaka Nithi" 3. Standardize Case Defi...