How to Deal With Outliers in Census Data — Drop, Cap, or Keep?
Afrobarometer uses Likert scales, categorical codes, and a few continuous variables.
Most "outliers" here aren't extreme values — they're refusals, don't-knows, and skip patterns encoded as numbers.
Before you touch a single row, understand this: Afrobarometer encodes non-responses as -1, 8, 9, and 98/99 depending on the variable. These will show up as outliers in any statistical check. They are not — they are missing data in disguise.
Do this before any outlier check. Otherwise your IQR and percentile calculations are poisoned by refusal codes.
After handling sentinels, true impossible values are rare. A trust variable scored 0–3 should have nothing above 3. A poverty frequency variable scored 0–4 should have nothing at 7.
Afrobarometer Kenya records respondent age as a continuous variable. Ages above 110 are recording errors. Ages between 95–110 are plausible but rare enough to warrant a note.
If 40% of Kenyan respondents in Round 9 said they trust the President "not at all" (0), that's not an outlier — that's the finding. Don't smooth it out.
If the 0s dominate, that's a politically meaningful signal specific to Kenya's survey context. Flag it in your markdown, don't drop it.
The Lived Poverty Index (Q90–Q95) asks how often respondents went without food, water, medical care, fuel, and cash. Scores of 4 ("always") are not outliers in a low-income sample — they are the point of the variable.
Most variables are bounded ordinal scales. There are no true statistical outliers — only sentinel codes masquerading as data, and genuine extreme responses that reflect Kenya's real conditions. Recode the first. Respect the second.
Advance Your Career With 16 Python Projects in Data & ML — All for $288.
Comments
Post a Comment