How to Predict GDP Growth from Socioeconomic Indicators Using World Bank Data

May 24, 2026

Economic growth forecasting is one of the most valuable applications of machine learning in economics, finance, and public policy.

Governments, investors, development organizations, and businesses all rely on GDP growth forecasts to make strategic decisions.

In this tutorial, you will learn how to build a machine learning model that predicts GDP growth using socioeconomic indicators from the World Bank Open Data Platform.

We will use indicators such as:

Inflation
Unemployment
Population growth
Exports
Education enrollment
Foreign direct investment (FDI)
Internet penetration

The target variable will be GDP growth annual percentage.

The World Bank provides over 16,000 indicators across hundreds of countries through its data platform and API. (World Bank Data Help Desk)

Why GDP Growth Prediction Matters

GDP growth measures how fast an economy expands or contracts over time.

The World Bank defines GDP growth as the annual percentage growth rate of GDP at market prices based on constant local currency. (World Bank Open Data)

Economists use GDP growth forecasts to:

Evaluate economic stability
Predict recessions
Analyze investment opportunities
Compare country performance
Study the impact of policies

Machine learning helps identify nonlinear relationships between socioeconomic variables and economic performance.

Step 1: Install Required Libraries

pip install pandas numpy scikit-learn matplotlib seaborn wbdata

We will use:

pandas for data manipulation
scikit-learn for machine learning
wbdata to pull World Bank indicators

Step 2: Import Libraries

import pandas as pd
import numpy as np
import wbdata
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, r2_score

Step 3: Select World Bank Indicators

We will predict GDP growth using several socioeconomic indicators.

Indicator	World Bank Code
GDP Growth	NY.GDP.MKTP.KD.ZG
Inflation	FP.CPI.TOTL.ZG
Unemployment	SL.UEM.TOTL.ZS
Population Growth	SP.POP.GROW
Internet Users	IT.NET.USER.ZS
Exports (% GDP)	NE.EXP.GNFS.ZS
School Enrollment	SE.SEC.ENRR

The GDP growth indicator is officially available through the World Bank database. (World Bank Open Data)

Step 4: Download Data from the World Bank

indicators = {
    'NY.GDP.MKTP.KD.ZG': 'gdp_growth',
    'FP.CPI.TOTL.ZG': 'inflation',
    'SL.UEM.TOTL.ZS': 'unemployment',
    'SP.POP.GROW': 'population_growth',
    'IT.NET.USER.ZS': 'internet_users',
    'NE.EXP.GNFS.ZS': 'exports',
    'SE.SEC.ENRR': 'school_enrollment'
}

data = wbdata.get_dataframe(indicators)

df = data.reset_index()

print(df.head())

This creates a dataset containing multiple countries and years.

Step 5: Clean the Dataset

World Bank datasets often contain missing values.

df = df.dropna()

You can also use imputation if needed:

df = df.fillna(df.mean(numeric_only=True))

The World Bank regularly updates datasets and methodologies, so cleaning is essential before modeling. (Data Topics)

Step 6: Define Features and Target

Our target variable is GDP growth.

X = df.drop(columns=['gdp_growth'])

# Remove non-numeric columns
X = X.select_dtypes(include=np.number)

y = df['gdp_growth']

Step 7: Split the Data

We separate training and testing data.

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=42
)

A typical split is:

80% training
20% testing

Step 8: Train the Model

We will use a Random Forest Regressor.

model = RandomForestRegressor(
    n_estimators=200,
    random_state=42
)

model.fit(X_train, y_train)

Random Forest works well because GDP growth relationships are often nonlinear.

Step 9: Make Predictions

predictions = model.predict(X_test)

Step 10: Evaluate the Model

We use:

MAE (Mean Absolute Error)
R² Score

mae = mean_absolute_error(y_test, predictions)
r2 = r2_score(y_test, predictions)

print("MAE:", mae)
print("R²:", r2)

The R² metric measures how much variance the model explains.

An R² score closer to 1 indicates stronger predictive performance.

Step 11: Analyze Feature Importance

Understanding which socioeconomic indicators drive GDP growth is extremely valuable.

importance = pd.DataFrame({
    'Feature': X.columns,
    'Importance': model.feature_importances_
})

importance = importance.sort_values(
    by='Importance',
    ascending=False
)

print(importance)

You may discover that:

Inflation strongly impacts growth
Internet penetration correlates with productivity
Exports influence developing economies
Education levels improve long-term GDP performance

Example Insights

A model trained on World Bank data might reveal:

Feature	Importance
Inflation	0.29
Exports	0.22
Internet Users	0.18
Population Growth	0.14
Unemployment	0.10
School Enrollment	0.07

These relationships vary by country and time period.

Common Challenges

1. Missing Data

Many countries have incomplete records.

2. Multicollinearity

Some indicators are highly correlated.

3. Time Dependency

GDP growth depends heavily on historical trends.

4. Economic Shocks

Pandemics, wars, and inflation crises can reduce prediction accuracy.

Research shows that deep learning and recursive forecasting approaches are increasingly being used for long-term GDP forecasting. (arXiv)

Predicting GDP growth from socioeconomic indicators combines:

Economics
Data engineering
Machine learning
Time-series analysis

The World Bank dataset is one of the best sources for global economic modeling because it provides standardized indicators across decades and countries.

As you advance, you can improve your model using:

XGBoost
LightGBM
LSTMs
Panel data modeling
Time-series forecasting
Feature lagging
Country clustering

GDP forecasting is not just an academic exercise. It powers investment strategies, national policy planning, risk analysis, and global development forecasting.

Advance Your Career With 16 Python Projects in Data & ML — All for $288.

Search This Blog

Practical Python for Data Engineering, Data Analysis & Machine Learning