How to Use groupby() to Summarise African Health Data by Region (Using Real World Bank Data)
One of the most important skills in data analytics is learning how to summarize large datasets into meaningful insights.
Raw datasets are often too detailed to interpret directly. Analysts therefore use aggregation techniques to answer higher-level questions such as:
Which region performs best?
Which countries have the lowest health outcomes?
What patterns exist across geographic areas?
In Python’s Pandas library, the groupby() function is one of the most powerful tools for this type of analysis.
In this tutorial, you will use real health data from the World Bank to summarize African life expectancy data by region using Google Colab.
We will answer a practical question:
Which African regions have the highest average life expectancy?
This is a realistic use case because regional summaries are commonly used in:
public health reporting,
NGO dashboards,
government policy analysis,
and healthcare research.
Step 1 — Download the Dataset
Go to the World Bank Life Expectancy dataset.
Download the CSV file.
This indicator measures: Life expectancy at birth (years)
Step 2 — Upload the File to Google Colab
Start by uploading the dataset:
import pandas as pd
from google.colab import files
uploaded = files.upload()
Select the CSV file you downloaded from the World Bank.
Step 3 — Load the Dataset
World Bank files contain metadata rows at the top, so use skiprows=4.
file_name = list(uploaded.keys())[0]
df = pd.read_csv(file_name, skiprows=4)
df.head()
Step 4 — Create African Region Categories
The World Bank dataset contains countries but not African regions directly.
We therefore create our own region mapping.
region_map = {
'Kenya': 'East Africa',
'Uganda': 'East Africa',
'Tanzania': 'East Africa',
'Rwanda': 'East Africa',
'Nigeria': 'West Africa',
'Ghana': 'West Africa',
'Senegal': 'West Africa',
'South Africa': 'Southern Africa',
'Botswana': 'Southern Africa',
'Namibia': 'Southern Africa'
}
Step 5 — Filter the Dataset
Keep only the selected African countries:
df = df[df['Country Name'].isin(region_map.keys())]
Step 6 — Create a Region Column
Now map countries to regions:
df['Region'] = df['Country Name'].map(region_map)
Step 7 — Select the Latest Life Expectancy Data
health_data = df[['Country Name', 'Region', '2024']]
health_data = health_data.rename(
columns={'2024': 'Life_Expectancy'}
)
health_data = health_data.dropna()
health_data
Step 8 — Use groupby() to Summarise by Region
Now calculate average life expectancy by African region.
regional_summary = health_data.groupby('Region')[
'Life_Expectancy'
].mean()
regional_summary
What groupby() Is Doing
This line: groupby('Region') groups all countries that belong to the same African region.
Then: .mean() calculates the average life expectancy for each region.
Example Output
You may see results similar to:
| Region | Average Life Expectancy |
|---|---|
| East Africa | 66.5 |
| Southern Africa | 64.7 |
| West Africa | 62.0 |
This immediately reveals a pattern: East Africa has the highest average life expectancy in this sample.
That is the purpose of grouped analysis.
Instead of analyzing countries individually, you summarize broader regional trends.
Step 9 — Sort the Results
To rank regions from highest to lowest:
regional_summary = regional_summary.sort_values(
ascending=False
)
regional_summary
Step 10 — Visualize the Summary
Now create a simple bar chart:
import matplotlib.pyplot as plt
regional_summary.plot(
kind='bar',
figsize=(8,5),
color='skyblue'
)
plt.title("Average Life Expectancy by African Region")
plt.ylabel("Life Expectancy (Years)")
plt.xlabel("Region")
plt.show()
This transforms the grouped summary into a visual insight.
Why groupby() Matters in Real Analytics
Most professional analytics workflows involve grouped summaries.
For example:
hospitals grouped by county,
sales grouped by region,
transactions grouped by customer type,
or economic indicators grouped by continent.
Without groupby(), analysts would need to:
filter repeatedly,
calculate manually,
and combine outputs themselves.
groupby() automates this efficiently.
Real African Health Analytics Use Cases
The exact same technique can be used for:
vaccination rates,
maternal mortality,
HIV prevalence,
healthcare spending,
malaria incidence,
or nutrition statistics.
For example:
df.groupby('Region')['Vaccination_Rate'].mean()
or:
df.groupby('Region')['Malaria_Cases'].sum()
The real value of groupby() is not the code itself.
The value comes from the ability to transform detailed country-level data into regional insights that decision-makers can actually interpret.
That is what makes grouped analysis one of the foundations of modern data analytics.
Advance Your Career With 16 Python Projects in Data & ML — All for $288.
Comments
Post a Comment