How to Split African Economic Data for Train/Test Evaluation

Machine learning models are only as reliable as their evaluation process. 




One of the biggest mistakes beginners make in economic forecasting is training and testing models on the same data.

This creates overly optimistic results that collapse in real-world deployment.

To properly evaluate regression models, we split the dataset into:

  • training data

  • testing data

The training data teaches the model patterns, while the testing data measures how well the model generalizes to unseen information.

In this tutorial, we will use a real African economic dataset from the World Bank to predict GDP growth trends across African countries.


Why Train/Test Splits Matter in Economic Data

Economic datasets contain patterns related to:

  • inflation

  • GDP growth

  • unemployment

  • trade

  • population growth

  • government spending

If the model sees all records during training, evaluation becomes meaningless.


A train/test split simulates the real-world scenario:

“Can the model make predictions on economic data it has never seen before?”

This is the foundation of reliable machine learning evaluation.


The Real Dataset We Will Use

We will use African GDP per capita data from the World Bank.

Dataset indicator:

NY.GDP.PCAP.CD

This represents:

GDP per capita (current US dollars)

We will focus on several African countries including:

  • Kenya

  • Nigeria

  • South Africa

  • Egypt

  • Ghana

  • Ethiopia

The goal is to predict GDP per capita using economic indicators.


Step 1: Install Required Libraries

pip install pandas wbdata scikit-learn matplotlib



Step 2: Import Libraries

import pandas as pd
import wbdata

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, r2_score



Step 3: Download African Economic Data

We will retrieve data directly from the World Bank API.

indicators = {
    'NY.GDP.PCAP.CD': 'gdp_per_capita',
    'SP.POP.TOTL': 'population',
    'NE.TRD.GNFS.ZS': 'trade_percent_gdp'
}

countries = ['KEN', 'NGA', 'ZAF', 'EGY', 'GHA', 'ETH']

df = wbdata.get_dataframe(
    indicators,
    country=countries
)

print(df.head())


This creates a real economic dataset from African economies.


Understanding the Dataset

Your dataframe may look like this:

country        date        gdp_per_capita        population        trade_percent_gdp
Kenya202321155510058633.1
Nigeria2023159622380463227.8
Ghana202324453412198542.3

These are real macroeconomic indicators used in economic analysis and forecasting.


Step 4: Handle Missing Values

Economic datasets frequently contain missing records.

df = df.dropna()
print(df)




This removes incomplete rows.


Step 5: Define Features and Target Variable

We want to predict GDP per capita.

Target variable:

y = df['gdp_per_capita']

Feature variables:

X = df[['population', 'trade_percent_gdp']]


Step 6: Split the Dataset

This is the most important step.

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=42
)



Parameters explained:

Parameter                            Meaning
test_size=0.220% reserved for testing
random_state=42Ensures reproducibility

This means:

  • 80% of data trains the model

  • 20% evaluates performance


Why Randomization Matters

The split randomly distributes records between training and testing datasets.

Without randomization:

  • one country may dominate the training set

  • testing data may become biased

  • evaluation metrics become unreliable

Machine learning assumes training and testing data come from similar distributions.


Step 7: Train the Regression Model

model = LinearRegression()

model.fit(X_train, y_train)


The regression model learns relationships between:

  • population

  • trade activity

  • GDP per capita


Step 8: Make Predictions

predictions = model.predict(X_test)

print(predictions[:5])


The model now estimates GDP per capita for unseen African economic records.


Step 9: Evaluate Model Performance

We evaluate using:

  • MAE (Mean Absolute Error)

  • R² Score

mae = mean_absolute_error(y_test, predictions)

r2 = r2_score(y_test, predictions)

print("MAE:", mae)
print("R²:", r2)




Common Mistakes When Splitting Economic Data

1. Splitting After Scaling

Wrong:

scaler.fit(X)

before splitting.

This leaks information from the test set.

Correct workflow:

  1. Split data first

  2. Fit preprocessing only on training data


2. Using Chronological Data Incorrectly

Economic data often has time dependencies.

If forecasting future GDP:

  • use older years for training

  • use newer years for testing

Random splits may break temporal realism in forecasting problems.


Train/Test Splits vs Time Series Splits

For standard regression: train_test_split() works well.

For economic forecasting across time: TimeSeriesSplit() is usually more appropriate.

This preserves chronological order.


Why This Matters in African Economic Analysis

African economic datasets are increasingly used for:

  • fintech forecasting

  • trade intelligence

  • credit scoring

  • agricultural modeling

  • inflation prediction

  • sovereign risk analysis


Poor train/test methodology creates misleading economic models that fail under real-world conditions.

Reliable evaluation is therefore essential.



Using real African economic data from the World Bank makes train/test splitting much more practical and meaningful.

A proper split helps ensure that:

  • models generalize correctly

  • economic forecasts remain realistic

  • evaluation metrics are trustworthy

  • deployment performance matches expectations

In machine learning, train/test splitting is not just a preprocessing step — it is one of the foundations of scientific model evaluation.



Build a Job‑Ready Portfolio in 16 Python Projects — Proven, Practical, and Profitable for $288.





Comments

Popular posts from this blog

How to Filter Rows Using Boolean Indexing in Pandas (Afrobarometer Kenya Dataset)

How to Decide Whether to Drop or Fill Missing Value

How to create your first line chart with World Bank Kenya GDP data