How to Filter Rows Using Boolean Indexing in Pandas (Afrobarometer Kenya Dataset)
Learn how to filter rows in Pandas using boolean indexing with real Afrobarometer Kenya survey data. Build clean, ML-ready features with practical examples.
Why Boolean Indexing Matters in Feature Engineering
In Pandas, boolean indexing is how you select the exact slices of data that become features.
In Module 04 of the course, you’ll no longer exploring—you’ll be deciding what the model sees.
In this case, with a dataset like Afrobarometer Kenya Round 10, filtering is how you:
Remove invalid survey responses
Isolate specific demographic groups
Prepare clean subsets for encoding
Load and Inspect the Dataset
This assumes that you’ve already converted the .sav file and loaded it:
import pandas as pd
df = pd.read_csv("afrobarometer_kenya_r10.csv")
df.head()
1. Filter Valid Survey Responses
Survey datasets contain codes like -1, 8, 9 for missing or “Don’t know”.
df = df[df["trust_president"] >= 0]
This keeps only valid numeric responses for trust in the president.
2. Filter by Demographics (Gender Example)
Raw values might be "Male" / "Female":
df_female = df[df["gender"] == "Female"]
This subset can later be encoded into binary (0/1).
3. Filter Age Groups Before Conversion
Afrobarometer often stores age as ranges like "18-25":
df_youth = df[df["age_group"] == "18-25"]
You isolate the group before transforming it into numeric features.
4. Combining Conditions (Real Feature Logic)
Example: Young respondents in Nairobi with valid income:
df_segment = df[
(df["age_group"] == "18-25") &
(df["location"] == "Nairobi") &
(df["income"] > 0)
]
This is how real segmentation happens before feature creation.
5. Filter Multiple Categories with .isin()
For regional grouping:
df_urban = df[df["location"].isin(["Nairobi", "Mombasa", "Kisumu"])]
Useful before one-hot encoding cities.
For example:
6. Remove Invalid or Noisy Data
df_clean = df[~df["education_level"].isin(["Don't know", "Refused"])]
Critical step—bad inputs = bad model.
For Example:
7. Filter Ordinal Responses
Survey Likert scale:
df_agree = df[df["gov_performance"].isin(["Agree", "Strongly Agree"])]
Later, this becomes ordinal encoding (e.g., 4, 5).
8. Filtering Before Train/Test Split
You often filter first, then split:
df_model = df[df["income"].notna()]
Only complete cases go into the model pipeline.
For example:
Common Mistakes (Critical in ML Pipelines)
Using
andinstead of&→ breaks vectorizationForgetting parentheses → wrong logic
Filtering after encoding → inconsistent features
Ignoring survey codes (
-1,99) → silent model corruption
Why This Step is Non-Negotiable
Boolean indexing is not just filtering—it’s feature selection logic.
In a production pipeline (e.g., on Amazon Web Services using S3 + Glue), this step determines:
What data flows into training
What gets discarded
What biases enter your model
Final Takeaway
Feature engineering starts with control—and boolean indexing is that control layer.
If you can:
Precisely filter valid data
Segment populations correctly
Remove noise before encoding
You’re not just cleaning data—you’re defining model intelligence.
Advance Your Career With 16 Python Projects in Data & ML — All for $288.
Comments
Post a Comment