How to Filter Rows Using Boolean Indexing in Pandas (Afrobarometer Kenya Dataset)

April 19, 2026

Learn how to filter rows in Pandas using boolean indexing with real Afrobarometer Kenya survey data. Build clean, ML-ready features with practical examples.

Why Boolean Indexing Matters in Feature Engineering

In Pandas, boolean indexing is how you select the exact slices of data that become features.

In Module 04 of the course, you’ll no longer exploring—you’ll be deciding what the model sees.

In this case, with a dataset like Afrobarometer Kenya Round 10, filtering is how you:

Remove invalid survey responses
Isolate specific demographic groups
Prepare clean subsets for encoding

Load and Inspect the Dataset

This assumes that you’ve already converted the .sav file and loaded it:

import pandas as pd

df = pd.read_csv("afrobarometer_kenya_r10.csv")
df.head()

1. Filter Valid Survey Responses

Survey datasets contain codes like -1, 8, 9 for missing or “Don’t know”.

df = df[df["trust_president"] >= 0]

This keeps only valid numeric responses for trust in the president.

2. Filter by Demographics (Gender Example)

Raw values might be "Male" / "Female":

df_female = df[df["gender"] == "Female"]

This subset can later be encoded into binary (0/1).

3. Filter Age Groups Before Conversion

Afrobarometer often stores age as ranges like "18-25":

df_youth = df[df["age_group"] == "18-25"]

You isolate the group before transforming it into numeric features.

4. Combining Conditions (Real Feature Logic)

Example: Young respondents in Nairobi with valid income:

df_segment = df[
    (df["age_group"] == "18-25") &
    (df["location"] == "Nairobi") &
    (df["income"] > 0)
]

This is how real segmentation happens before feature creation.

5. Filter Multiple Categories with `.isin()`

For regional grouping:

df_urban = df[df["location"].isin(["Nairobi", "Mombasa", "Kisumu"])]

Useful before one-hot encoding cities.

For example:

6. Remove Invalid or Noisy Data

df_clean = df[~df["education_level"].isin(["Don't know", "Refused"])]

Critical step—bad inputs = bad model.

For Example:

7. Filter Ordinal Responses

Survey Likert scale:

df_agree = df[df["gov_performance"].isin(["Agree", "Strongly Agree"])]

Later, this becomes ordinal encoding (e.g., 4, 5).

8. Filtering Before Train/Test Split

You often filter first, then split:

df_model = df[df["income"].notna()]

Only complete cases go into the model pipeline.

For example:

Common Mistakes (Critical in ML Pipelines)

Using and instead of & → breaks vectorization
Forgetting parentheses → wrong logic
Filtering after encoding → inconsistent features
Ignoring survey codes (-1, 99) → silent model corruption

Why This Step is Non-Negotiable

Boolean indexing is not just filtering—it’s feature selection logic.

In a production pipeline (e.g., on Amazon Web Services using S3 + Glue), this step determines:

What data flows into training
What gets discarded
What biases enter your model

Final Takeaway

Feature engineering starts with control—and boolean indexing is that control layer.

If you can:

Precisely filter valid data
Segment populations correctly
Remove noise before encoding

You’re not just cleaning data—you’re defining model intelligence.

Advance Your Career With 16 Python Projects in Data & ML — All for $288.

Search This Blog

Practical Python for Data Engineering, Data Analysis & Machine Learning