How to Predict GDP Growth from Socioeconomic Indicators Using World Bank Data
Economic growth forecasting is one of the most valuable applications of machine learning in economics, finance, and public policy.
Governments, investors, development organizations, and businesses all rely on GDP growth forecasts to make strategic decisions.
In this tutorial, you will learn how to build a machine learning model that predicts GDP growth using socioeconomic indicators from the World Bank Open Data Platform.
We will use indicators such as:
Inflation
Unemployment
Population growth
Exports
Education enrollment
Foreign direct investment (FDI)
Internet penetration
The target variable will be GDP growth annual percentage.
The World Bank provides over 16,000 indicators across hundreds of countries through its data platform and API. (World Bank Data Help Desk)
Why GDP Growth Prediction Matters
GDP growth measures how fast an economy expands or contracts over time.
The World Bank defines GDP growth as the annual percentage growth rate of GDP at market prices based on constant local currency. (World Bank Open Data)
Economists use GDP growth forecasts to:
Evaluate economic stability
Predict recessions
Analyze investment opportunities
Compare country performance
Study the impact of policies
Machine learning helps identify nonlinear relationships between socioeconomic variables and economic performance.
Step 1: Install Required Libraries
pip install pandas numpy scikit-learn matplotlib seaborn wbdata
We will use:
pandasfor data manipulationscikit-learnfor machine learningwbdatato pull World Bank indicators
Step 2: Import Libraries
import pandas as pd
import numpy as np
import wbdata
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, r2_score
Step 3: Select World Bank Indicators
We will predict GDP growth using several socioeconomic indicators.
| Indicator | World Bank Code |
|---|---|
| GDP Growth | NY.GDP.MKTP.KD.ZG |
| Inflation | FP.CPI.TOTL.ZG |
| Unemployment | SL.UEM.TOTL.ZS |
| Population Growth | SP.POP.GROW |
| Internet Users | IT.NET.USER.ZS |
| Exports (% GDP) | NE.EXP.GNFS.ZS |
| School Enrollment | SE.SEC.ENRR |
The GDP growth indicator is officially available through the World Bank database. (World Bank Open Data)
Step 4: Download Data from the World Bank
indicators = {
'NY.GDP.MKTP.KD.ZG': 'gdp_growth',
'FP.CPI.TOTL.ZG': 'inflation',
'SL.UEM.TOTL.ZS': 'unemployment',
'SP.POP.GROW': 'population_growth',
'IT.NET.USER.ZS': 'internet_users',
'NE.EXP.GNFS.ZS': 'exports',
'SE.SEC.ENRR': 'school_enrollment'
}
data = wbdata.get_dataframe(indicators)
df = data.reset_index()
print(df.head())
This creates a dataset containing multiple countries and years.
Step 5: Clean the Dataset
World Bank datasets often contain missing values.
df = df.dropna()
You can also use imputation if needed:
df = df.fillna(df.mean(numeric_only=True))
The World Bank regularly updates datasets and methodologies, so cleaning is essential before modeling. (Data Topics)
Step 6: Define Features and Target
Our target variable is GDP growth.
X = df.drop(columns=['gdp_growth'])
# Remove non-numeric columns
X = X.select_dtypes(include=np.number)
y = df['gdp_growth']
Step 7: Split the Data
We separate training and testing data.
X_train, X_test, y_train, y_test = train_test_split(
X,
y,
test_size=0.2,
random_state=42
)
A typical split is:
80% training
20% testing
Step 8: Train the Model
We will use a Random Forest Regressor.
model = RandomForestRegressor(
n_estimators=200,
random_state=42
)
model.fit(X_train, y_train)
Random Forest works well because GDP growth relationships are often nonlinear.
Step 9: Make Predictions
predictions = model.predict(X_test)
Step 10: Evaluate the Model
We use:
MAE (Mean Absolute Error)
R² Score
mae = mean_absolute_error(y_test, predictions)
r2 = r2_score(y_test, predictions)
print("MAE:", mae)
print("R²:", r2)
The R² metric measures how much variance the model explains.
An R² score closer to 1 indicates stronger predictive performance.
Step 11: Analyze Feature Importance
Understanding which socioeconomic indicators drive GDP growth is extremely valuable.
importance = pd.DataFrame({
'Feature': X.columns,
'Importance': model.feature_importances_
})
importance = importance.sort_values(
by='Importance',
ascending=False
)
print(importance)
You may discover that:
Inflation strongly impacts growth
Internet penetration correlates with productivity
Exports influence developing economies
Education levels improve long-term GDP performance
Example Insights
A model trained on World Bank data might reveal:
| Feature | Importance |
|---|---|
| Inflation | 0.29 |
| Exports | 0.22 |
| Internet Users | 0.18 |
| Population Growth | 0.14 |
| Unemployment | 0.10 |
| School Enrollment | 0.07 |
These relationships vary by country and time period.
Common Challenges
1. Missing Data
Many countries have incomplete records.
2. Multicollinearity
Some indicators are highly correlated.
3. Time Dependency
GDP growth depends heavily on historical trends.
4. Economic Shocks
Pandemics, wars, and inflation crises can reduce prediction accuracy.
Research shows that deep learning and recursive forecasting approaches are increasingly being used for long-term GDP forecasting. (arXiv)
Predicting GDP growth from socioeconomic indicators combines:
Economics
Data engineering
Machine learning
Time-series analysis
The World Bank dataset is one of the best sources for global economic modeling because it provides standardized indicators across decades and countries.
As you advance, you can improve your model using:
XGBoost
LightGBM
LSTMs
Panel data modeling
Time-series forecasting
Feature lagging
Country clustering
GDP forecasting is not just an academic exercise. It powers investment strategies, national policy planning, risk analysis, and global development forecasting.
Advance Your Career With 16 Python Projects in Data & ML — All for $288.
Comments
Post a Comment