How to Split African Economic Data for Train/Test Evaluation

May 25, 2026

Machine learning models are only as reliable as their evaluation process.

One of the biggest mistakes beginners make in economic forecasting is training and testing models on the same data.

This creates overly optimistic results that collapse in real-world deployment.

To properly evaluate regression models, we split the dataset into:

training data
testing data

The training data teaches the model patterns, while the testing data measures how well the model generalizes to unseen information.

In this tutorial, we will use a real African economic dataset from the World Bank to predict GDP growth trends across African countries.

Why Train/Test Splits Matter in Economic Data

Economic datasets contain patterns related to:

inflation
GDP growth
unemployment
trade
population growth
government spending

If the model sees all records during training, evaluation becomes meaningless.

A train/test split simulates the real-world scenario:

“Can the model make predictions on economic data it has never seen before?”

This is the foundation of reliable machine learning evaluation.

The Real Dataset We Will Use

We will use African GDP per capita data from the World Bank.

Dataset indicator:

NY.GDP.PCAP.CD

This represents:

GDP per capita (current US dollars)

We will focus on several African countries including:

Kenya
Nigeria
South Africa
Egypt
Ghana
Ethiopia

The goal is to predict GDP per capita using economic indicators.

Step 1: Install Required Libraries

pip install pandas wbdata scikit-learn matplotlib

Step 2: Import Libraries

import pandas as pd
import wbdata

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, r2_score

Step 3: Download African Economic Data

We will retrieve data directly from the World Bank API.

indicators = {
    'NY.GDP.PCAP.CD': 'gdp_per_capita',
    'SP.POP.TOTL': 'population',
    'NE.TRD.GNFS.ZS': 'trade_percent_gdp'
}

countries = ['KEN', 'NGA', 'ZAF', 'EGY', 'GHA', 'ETH']

df = wbdata.get_dataframe(
    indicators,
    country=countries
)

print(df.head())

This creates a real economic dataset from African economies.

Understanding the Dataset

Your dataframe may look like this:

country	date	gdp_per_capita	population	trade_percent_gdp
Kenya	2023	2115	55100586	33.1
Nigeria	2023	1596	223804632	27.8
Ghana	2023	2445	34121985	42.3

These are real macroeconomic indicators used in economic analysis and forecasting.

Step 4: Handle Missing Values

Economic datasets frequently contain missing records.

df = df.dropna()
print(df)

This removes incomplete rows.

Step 5: Define Features and Target Variable

We want to predict GDP per capita.

Target variable:

y = df['gdp_per_capita']

Feature variables:

X = df[['population', 'trade_percent_gdp']]

Step 6: Split the Dataset

This is the most important step.

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=42
)

Parameters explained:

Parameter	Meaning
test_size=0.2	20% reserved for testing
random_state=42	Ensures reproducibility

This means:

80% of data trains the model
20% evaluates performance

Why Randomization Matters

The split randomly distributes records between training and testing datasets.

Without randomization:

one country may dominate the training set
testing data may become biased
evaluation metrics become unreliable

Machine learning assumes training and testing data come from similar distributions.

Step 7: Train the Regression Model

model = LinearRegression()

model.fit(X_train, y_train)

The regression model learns relationships between:

population
trade activity
GDP per capita

Step 8: Make Predictions

predictions = model.predict(X_test)

print(predictions[:5])

The model now estimates GDP per capita for unseen African economic records.

Step 9: Evaluate Model Performance

We evaluate using:

MAE (Mean Absolute Error)
R² Score

mae = mean_absolute_error(y_test, predictions)

r2 = r2_score(y_test, predictions)

print("MAE:", mae)
print("R²:", r2)

Common Mistakes When Splitting Economic Data

1. Splitting After Scaling

Wrong:

scaler.fit(X)

before splitting.

This leaks information from the test set.

Correct workflow:

Split data first
Fit preprocessing only on training data

2. Using Chronological Data Incorrectly

Economic data often has time dependencies.

If forecasting future GDP:

use older years for training
use newer years for testing

Random splits may break temporal realism in forecasting problems.

Train/Test Splits vs Time Series Splits

For standard regression: train_test_split() works well.

For economic forecasting across time: TimeSeriesSplit() is usually more appropriate.

This preserves chronological order.

Why This Matters in African Economic Analysis

African economic datasets are increasingly used for:

fintech forecasting
trade intelligence
credit scoring
agricultural modeling
inflation prediction
sovereign risk analysis

Poor train/test methodology creates misleading economic models that fail under real-world conditions.

Reliable evaluation is therefore essential.

Using real African economic data from the World Bank makes train/test splitting much more practical and meaningful.

A proper split helps ensure that:

models generalize correctly
economic forecasts remain realistic
evaluation metrics are trustworthy
deployment performance matches expectations

In machine learning, train/test splitting is not just a preprocessing step — it is one of the foundations of scientific model evaluation.

Build a Job‑Ready Portfolio in 16 Python Projects — Proven, Practical, and Profitable for $288.

Search This Blog

Practical Python for Data Engineering, Data Analysis & Machine Learning