How to Filter Rows Using Boolean Indexing in Pandas (Afrobarometer Kenya Dataset)

Learn how to filter rows in Pandas using boolean indexing with real Afrobarometer Kenya survey data. Build clean, ML-ready features with practical examples.



Why Boolean Indexing Matters in Feature Engineering

In Pandas, boolean indexing is how you select the exact slices of data that become features

In Module 04 of the course, you’ll no longer exploring—you’ll be deciding what the model sees.

In this case, with a dataset like Afrobarometer Kenya Round 10, filtering is how you:

  • Remove invalid survey responses

  • Isolate specific demographic groups

  • Prepare clean subsets for encoding


Load and Inspect the Dataset

This assumes that you’ve already converted the .sav file and loaded it:

import pandas as pd

df = pd.read_csv("afrobarometer_kenya_r10.csv")
df.head()



1. Filter Valid Survey Responses

Survey datasets contain codes like -1, 8, 9 for missing or “Don’t know”.

df = df[df["trust_president"] >= 0]

This keeps only valid numeric responses for trust in the president.




2. Filter by Demographics (Gender Example)

Raw values might be "Male" / "Female":

df_female = df[df["gender"] == "Female"]

This subset can later be encoded into binary (0/1).




3. Filter Age Groups Before Conversion

Afrobarometer often stores age as ranges like "18-25":

df_youth = df[df["age_group"] == "18-25"]

You isolate the group before transforming it into numeric features.



4. Combining Conditions (Real Feature Logic)

Example: Young respondents in Nairobi with valid income:

df_segment = df[
    (df["age_group"] == "18-25") &
    (df["location"] == "Nairobi") &
    (df["income"] > 0)
]

This is how real segmentation happens before feature creation.


5. Filter Multiple Categories with .isin()

For regional grouping:

df_urban = df[df["location"].isin(["Nairobi", "Mombasa", "Kisumu"])]

Useful before one-hot encoding cities.

For example:



6. Remove Invalid or Noisy Data

df_clean = df[~df["education_level"].isin(["Don't know", "Refused"])]

Critical step—bad inputs = bad model.

For Example:





7. Filter Ordinal Responses

Survey Likert scale:

df_agree = df[df["gov_performance"].isin(["Agree", "Strongly Agree"])]

Later, this becomes ordinal encoding (e.g., 4, 5).




8. Filtering Before Train/Test Split

You often filter first, then split:

df_model = df[df["income"].notna()]

Only complete cases go into the model pipeline.

For example:




Common Mistakes (Critical in ML Pipelines)

  • Using and instead of & → breaks vectorization

  • Forgetting parentheses → wrong logic

  • Filtering after encoding → inconsistent features

  • Ignoring survey codes (-1, 99) → silent model corruption



Why This Step is Non-Negotiable

Boolean indexing is not just filtering—it’s feature selection logic.

In a production pipeline (e.g., on Amazon Web Services using S3 + Glue), this step determines:

  • What data flows into training

  • What gets discarded

  • What biases enter your model



Final Takeaway

Feature engineering starts with control—and boolean indexing is that control layer.

If you can:

  • Precisely filter valid data

  • Segment populations correctly

  • Remove noise before encoding


You’re not just cleaning data—you’re defining model intelligence.


Advance Your Career With 16 Python Projects in Data & ML — All for $288.

Comments

Popular posts from this blog

How to Decide Whether to Drop or Fill Missing Value

How to create your first line chart with World Bank Kenya GDP data