How to Spot Skewed Distributions Using Poverty and Inequality Platform (PIP) Data

Income and poverty datasets are rarely evenly distributed. 



In most countries, a large percentage of the population earns relatively low incomes, while a much smaller group earns significantly more. 

This creates what analysts call a skewed distribution.

Understanding skewness is important in economic analysis because it affects:

  • averages,

  • forecasts,

  • statistical models,

  • and policy interpretation.

In this tutorial, we will use poverty and inequality data to identify skewed distributions and learn how to handle them properly using Python and Pandas.

We will assume the dataset comes from the World Bank Poverty and Inequality Platform (PIP) 


Load the Dataset

First, upload the CSV dataset.

from google.colab import files
uploaded = files.upload()



Example columns may include:

  • country

  • year

  • income

  • poverty_rate

  • gini_index

For this tutorial, we will focus on the income column.


Why Income Data Is Usually Skewed

Income data is almost always right-skewed because:

  • most people earn moderate or low incomes,

  • while a smaller percentage earns extremely high incomes.

This causes a long tail on the right side of the distribution.

A few very high-income observations can heavily distort averages.


Visualize the Distribution

The fastest way to detect skewness is with a histogram.

import matplotlib.pyplot as plt
import pandas as pd

df = pd.read_csv('IDN_PovertyRate_20260324_2021_01_02_PROD_2026-05-11.csv')

print(df.columns)

df[' Poverty rate (%)'].hist(bins=40)

plt.xlabel('Poverty rate (%)')
plt.ylabel('Frequency')
plt.title('Distribution of Poverty rate (%)')

plt.show()


If the histogram shows:

  • most values clustered on the left,

  • and a long tail stretching right,

then the data is positively skewed.



Compare Mean and Median

Another strong indicator of skewness is the relationship between the mean and median.

mean_poverty_rate = df[' Poverty rate (%)'].mean()
median_poverty_rate = df[' Poverty rate (%)'].median()

print("Mean Poverty Rate:", mean_poverty_rate)
print("Median Poverty Rate:", median_poverty_rate)



In right-skewed data:

  • the mean becomes larger than the median,

  • because extreme high-income values pull the average upward.

This is why economists often prefer medians when discussing household income.


Calculate Skewness Numerically

Pandas provides a direct skewness calculation.

print(df['Poverty rate (%)'].skew())


Interpretation:

Skewness Value              Meaning
Near 0                    Symmetric distribution
Greater than 1                    Strong right skew
Less than -1                    Strong left skew

Income datasets often produce skewness values well above 1.

In this case the data is symmetrically distributed. 


Why Skewed Income Data Matters

Skewed distributions can create misleading conclusions.

For example:

  • average income may appear high,

  • while most people actually earn much less.

This is common in:

  • national income reports,

  • wealth analysis,

  • and poverty studies.

A small wealthy population can distort national averages significantly.


Use the Median Instead of the Mean

The median is less sensitive to extreme outliers.

median_poverty_rate = df[' Poverty rate (%)'].median()

print(median_poverty_rate)


In inequality analysis, median income often provides a more realistic representation of living standards.


Apply a Log Transformation

Log transformations reduce skewness by compressing large values.

import numpy as np
import matplotlib.pyplot as plt

# Use ' Poverty rate (%)' as the column for transformation.
# np.log1p is used instead of np.log to gracefully handle potential zero values.
df['log_poverty_rate'] = np.log1p(df[' Poverty rate (%)'])

# Now visualize the transformed data:
df['log_poverty_rate'].hist(bins=40)

plt.xlabel('Log Poverty Rate')
plt.ylabel('Frequency')
plt.title('Log-Transformed Poverty Rate Distribution')

plt.show()



The distribution should appear more balanced and easier to analyze statistically.


Detect Extreme Outliers

Very high-income observations can distort models and visualizations.

You can detect outliers using the Interquartile Range (IQR) method.

q1 = df[' Poverty rate (%)'].quantile(0.25)
q3 = df[' Poverty rate (%)'].quantile(0.75)

iqr = q3 - q1

lower = q1 - 1.5 * iqr
upper = q3 + 1.5 * iqr

outliers = df[df[' Poverty rate (%)'] > upper]

print(outliers.head())



In economic datasets, these extreme values may represent:

  • ultra-high earners,

  • reporting anomalies,

  • or genuine inequality effects.


Real-World Interpretation

Suppose:

  • most households earn between $2 and $20 daily,

  • while a small percentage earns hundreds or thousands per day.

The distribution becomes highly right-skewed.

Without addressing skewness:

  • averages become inflated,

  • regressions become unstable,

  • and poverty trends become harder to interpret accurately.

This is why economists frequently:

  • use medians,

  • apply logarithmic scaling,

  • and analyze percentiles instead of raw averages.


Skewed distributions are extremely common in poverty and inequality analysis. Learning to identify them is a foundational skill in:

  • data analysis,

  • economics,

  • public policy,

  • and machine learning.

Before building models or publishing conclusions, always inspect your data distribution first. A simple histogram can reveal important patterns that averages alone may hide.


Advance Your Career With 16 Python Projects in Data & ML — All for $288.


Comments

Popular posts from this blog

How to Filter Rows Using Boolean Indexing in Pandas (Afrobarometer Kenya Dataset)

How to Decide Whether to Drop or Fill Missing Value

How to create your first line chart with World Bank Kenya GDP data