How to Deal With Outliers in Census Data — Drop, Cap, or Keep?

Afrobarometer uses Likert scales, categorical codes, and a few continuous variables. 



Most "outliers" here aren't extreme values — they're refusals, don't-knows, and skip patterns encoded as numbers.

Before you touch a single row, understand this: Afrobarometer encodes non-responses as -189, and 98/99 depending on the variable. These will show up as outliers in any statistical check. They are not — they are missing data in disguise.

Do this before any outlier check. Otherwise your IQR and percentile calculations are poisoned by refusal codes.



After handling sentinels, true impossible values are rare. A trust variable scored 0–3 should have nothing above 3. A poverty frequency variable scored 0–4 should have nothing at 7.

# Q56A: Trust in President — valid range 0 (not at all) to 3 (a lot) # Anything outside this after sentinel removal is a data entry error. before = df['Q56A'].notna().sum() df.loc[~df['Q56A'].between(0, 3), 'Q56A'] = np.nan after = df['Q56A'].notna().sum() print(f"Q56A: invalidated {before - after} out-of-range values")


Afrobarometer Kenya records respondent age as a continuous variable. Ages above 110 are recording errors. Ages between 95–110 are plausible but rare enough to warrant a note.

# Q101: Respondent age (continuous) # Drop physiologically impossible ages. # Cap is not needed — just clean the floor and ceiling. print("Age distribution before:") print(df['Q101'].describe()) df = df[df['Q101'].between(18, 110) | df['Q101'].isna()] print("\nAge distribution after:") print(df['Q101'].describe())


If 40% of Kenyan respondents in Round 9 said they trust the President "not at all" (0), that's not an outlier — that's the finding. Don't smooth it out.

# Q56A: Check the distribution BEFORE deciding anything trust_counts = df['Q56A'].value_counts(normalize=True).sort_index() print(trust_counts.map(lambda x: f"{x:.1%}")) # Labels for reference: # 0 = Not at all 1 = Just a little # 2 = Somewhat 3 = A lot

If the 0s dominate, that's a politically meaningful signal specific to Kenya's survey context. Flag it in your markdown, don't drop it.


The Lived Poverty Index (Q90–Q95) asks how often respondents went without food, water, medical care, fuel, and cash. Scores of 4 ("always") are not outliers in a low-income sample — they are the point of the variable.

# Q90: How often gone without food in past year # 0=Never 1=Just once/twice 2=Several times # 3=Many times 4=Always # Do not cap or drop 4s — they represent chronic deprivation. lpi_cols = ['Q90', 'Q91', 'Q92', 'Q93', 'Q94'] df['lpi_score'] = df[lpi_cols].mean(axis=1) print(df['lpi_score'].describe()) print(f"\nRespondents at max deprivation (score=4): " f"{(df['lpi_score'] == 4).sum()}")


Most variables are bounded ordinal scales. There are no true statistical outliers — only sentinel codes masquerading as data, and genuine extreme responses that reflect Kenya's real conditions. Recode the first. Respect the second.


Advance Your Career With 16 Python Projects in Data & ML — All for $288.

Comments

Popular posts from this blog

How to Filter Rows Using Boolean Indexing in Pandas (Afrobarometer Kenya Dataset)

How to Decide Whether to Drop or Fill Missing Value

How to create your first line chart with World Bank Kenya GDP data