How to Split African Economic Data for Train/Test Evaluation
Machine learning models are only as reliable as their evaluation process.
One of the biggest mistakes beginners make in economic forecasting is training and testing models on the same data.
This creates overly optimistic results that collapse in real-world deployment.
To properly evaluate regression models, we split the dataset into:
training data
testing data
The training data teaches the model patterns, while the testing data measures how well the model generalizes to unseen information.
In this tutorial, we will use a real African economic dataset from the World Bank to predict GDP growth trends across African countries.
Why Train/Test Splits Matter in Economic Data
Economic datasets contain patterns related to:
inflation
GDP growth
unemployment
trade
population growth
government spending
If the model sees all records during training, evaluation becomes meaningless.
A train/test split simulates the real-world scenario:
“Can the model make predictions on economic data it has never seen before?”
This is the foundation of reliable machine learning evaluation.
The Real Dataset We Will Use
We will use African GDP per capita data from the World Bank.
Dataset indicator:
NY.GDP.PCAP.CD
This represents:
GDP per capita (current US dollars)
We will focus on several African countries including:
Kenya
Nigeria
South Africa
Egypt
Ghana
Ethiopia
The goal is to predict GDP per capita using economic indicators.
Step 1: Install Required Libraries
pip install pandas wbdata scikit-learn matplotlib
Step 2: Import Libraries
import pandas as pd
import wbdata
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, r2_score
Step 3: Download African Economic Data
We will retrieve data directly from the World Bank API.
indicators = {
'NY.GDP.PCAP.CD': 'gdp_per_capita',
'SP.POP.TOTL': 'population',
'NE.TRD.GNFS.ZS': 'trade_percent_gdp'
}
countries = ['KEN', 'NGA', 'ZAF', 'EGY', 'GHA', 'ETH']
df = wbdata.get_dataframe(
indicators,
country=countries
)
print(df.head())
This creates a real economic dataset from African economies.
Understanding the Dataset
Your dataframe may look like this:
| country | date | gdp_per_capita | population | trade_percent_gdp |
|---|---|---|---|---|
| Kenya | 2023 | 2115 | 55100586 | 33.1 |
| Nigeria | 2023 | 1596 | 223804632 | 27.8 |
| Ghana | 2023 | 2445 | 34121985 | 42.3 |
These are real macroeconomic indicators used in economic analysis and forecasting.
Step 4: Handle Missing Values
Economic datasets frequently contain missing records.
df = df.dropna()
print(df)
This removes incomplete rows.
Step 5: Define Features and Target Variable
We want to predict GDP per capita.
Target variable:
y = df['gdp_per_capita']
Feature variables:
X = df[['population', 'trade_percent_gdp']]
Step 6: Split the Dataset
This is the most important step.
X_train, X_test, y_train, y_test = train_test_split(
X,
y,
test_size=0.2,
random_state=42
)
Parameters explained:
| Parameter | Meaning |
|---|---|
| test_size=0.2 | 20% reserved for testing |
| random_state=42 | Ensures reproducibility |
This means:
80% of data trains the model
20% evaluates performance
Why Randomization Matters
The split randomly distributes records between training and testing datasets.
Without randomization:
one country may dominate the training set
testing data may become biased
evaluation metrics become unreliable
Machine learning assumes training and testing data come from similar distributions.
Step 7: Train the Regression Model
model = LinearRegression()
model.fit(X_train, y_train)
The regression model learns relationships between:
population
trade activity
GDP per capita
Step 8: Make Predictions
predictions = model.predict(X_test)
print(predictions[:5])
The model now estimates GDP per capita for unseen African economic records.
Step 9: Evaluate Model Performance
We evaluate using:
MAE (Mean Absolute Error)
R² Score
mae = mean_absolute_error(y_test, predictions)
r2 = r2_score(y_test, predictions)
print("MAE:", mae)
print("R²:", r2)
Common Mistakes When Splitting Economic Data
1. Splitting After Scaling
Wrong:
scaler.fit(X)
before splitting.
This leaks information from the test set.
Correct workflow:
Split data first
Fit preprocessing only on training data
2. Using Chronological Data Incorrectly
Economic data often has time dependencies.
If forecasting future GDP:
use older years for training
use newer years for testing
Random splits may break temporal realism in forecasting problems.
Train/Test Splits vs Time Series Splits
For standard regression: train_test_split() works well.
For economic forecasting across time: TimeSeriesSplit() is usually more appropriate.
This preserves chronological order.
Why This Matters in African Economic Analysis
African economic datasets are increasingly used for:
fintech forecasting
trade intelligence
credit scoring
agricultural modeling
inflation prediction
sovereign risk analysis
Poor train/test methodology creates misleading economic models that fail under real-world conditions.
Reliable evaluation is therefore essential.
Using real African economic data from the World Bank makes train/test splitting much more practical and meaningful.
A proper split helps ensure that:
models generalize correctly
economic forecasts remain realistic
evaluation metrics are trustworthy
deployment performance matches expectations
In machine learning, train/test splitting is not just a preprocessing step — it is one of the foundations of scientific model evaluation.
Build a Job‑Ready Portfolio in 16 Python Projects — Proven, Practical, and Profitable for $288.
Comments
Post a Comment