How to Use groupby() to Summarise African Health Data by Region (Using Real World Bank Data)

One of the most important skills in data analytics is learning how to summarize large datasets into meaningful insights.



Raw datasets are often too detailed to interpret directly. Analysts therefore use aggregation techniques to answer higher-level questions such as:

  • Which region performs best?

  • Which countries have the lowest health outcomes?

  • What patterns exist across geographic areas?

In Python’s Pandas library, the groupby() function is one of the most powerful tools for this type of analysis.

In this tutorial, you will use real health data from the World Bank to summarize African life expectancy data by region using Google Colab.

We will answer a practical question:

Which African regions have the highest average life expectancy?

This is a realistic use case because regional summaries are commonly used in:

  • public health reporting,

  • NGO dashboards,

  • government policy analysis,

  • and healthcare research.


Step 1 — Download the Dataset

Go to the World Bank Life Expectancy dataset.

Download the CSV file.

This indicator measures: Life expectancy at birth (years)


Step 2 — Upload the File to Google Colab

Start by uploading the dataset:

import pandas as pd
from google.colab import files

uploaded = files.upload()


Select the CSV file you downloaded from the World Bank.


Step 3 — Load the Dataset

World Bank files contain metadata rows at the top, so use skiprows=4.

file_name = list(uploaded.keys())[0]

df = pd.read_csv(file_name, skiprows=4)

df.head()



Step 4 — Create African Region Categories

The World Bank dataset contains countries but not African regions directly.

We therefore create our own region mapping.

region_map = {
    'Kenya': 'East Africa',
    'Uganda': 'East Africa',
    'Tanzania': 'East Africa',
    'Rwanda': 'East Africa',

    'Nigeria': 'West Africa',
    'Ghana': 'West Africa',
    'Senegal': 'West Africa',

    'South Africa': 'Southern Africa',
    'Botswana': 'Southern Africa',
    'Namibia': 'Southern Africa'
}



Step 5 — Filter the Dataset

Keep only the selected African countries:

df = df[df['Country Name'].isin(region_map.keys())]



Step 6 — Create a Region Column

Now map countries to regions:

df['Region'] = df['Country Name'].map(region_map)



Step 7 — Select the Latest Life Expectancy Data

health_data = df[['Country Name', 'Region', '2024']]

health_data = health_data.rename(
    columns={'2024': 'Life_Expectancy'}
)

health_data = health_data.dropna()

health_data


Step 8 — Use groupby() to Summarise by Region

Now calculate average life expectancy by African region.

regional_summary = health_data.groupby('Region')[
    'Life_Expectancy'
].mean()

regional_summary



What groupby() Is Doing

This line: groupby('Region') groups all countries that belong to the same African region.

Then: .mean() calculates the average life expectancy for each region.


Example Output

You may see results similar to:

Region                Average Life Expectancy
East Africa                    66.5
Southern Africa                    64.7
West Africa                    62.0

This immediately reveals a pattern: East Africa has the highest average life expectancy in this sample.

That is the purpose of grouped analysis.

Instead of analyzing countries individually, you summarize broader regional trends.


Step 9 — Sort the Results

To rank regions from highest to lowest:

regional_summary = regional_summary.sort_values(
    ascending=False
)

regional_summary



Step 10 — Visualize the Summary

Now create a simple bar chart:

import matplotlib.pyplot as plt

regional_summary.plot(
    kind='bar',
    figsize=(8,5),
    color='skyblue'
)

plt.title("Average Life Expectancy by African Region")
plt.ylabel("Life Expectancy (Years)")
plt.xlabel("Region")

plt.show()


This transforms the grouped summary into a visual insight.


Why groupby() Matters in Real Analytics

Most professional analytics workflows involve grouped summaries.

For example:

  • hospitals grouped by county,

  • sales grouped by region,

  • transactions grouped by customer type,

  • or economic indicators grouped by continent.

Without groupby(), analysts would need to:

  • filter repeatedly,

  • calculate manually,

  • and combine outputs themselves.

groupby() automates this efficiently.


Real African Health Analytics Use Cases

The exact same technique can be used for:

  • vaccination rates,

  • maternal mortality,

  • HIV prevalence,

  • healthcare spending,

  • malaria incidence,

  • or nutrition statistics.

For example:

df.groupby('Region')['Vaccination_Rate'].mean()

or:

df.groupby('Region')['Malaria_Cases'].sum()

The real value of groupby() is not the code itself.

The value comes from the ability to transform detailed country-level data into regional insights that decision-makers can actually interpret.

That is what makes grouped analysis one of the foundations of modern data analytics.



Advance Your Career With 16 Python Projects in Data & ML — All for $288.

Comments

Popular posts from this blog

How to Filter Rows Using Boolean Indexing in Pandas (Afrobarometer Kenya Dataset)

How to Decide Whether to Drop or Fill Missing Value

How to create your first line chart with World Bank Kenya GDP data