How to Split Data Into Train and Test Sets — and Why It Matters Using World Bank GDP Data

Machine learning models do not become useful because they memorize data. They become useful because they can make accurate predictions on data they have never seen before



That is why splitting a dataset into training and testing sets is one of the most important steps in any machine learning workflow.

In this tutorial, we will use World Bank GDP data to learn how train-test splitting works and why it prevents misleading model results.

Why Train-Test Splitting Matters

When you train a model on a dataset, the model learns patterns from that data

If you evaluate the model using the exact same dataset, the performance metrics may look excellent even when the model performs poorly in the real world.

A train-test split solves this problem.

  • The training set teaches the model

  • The test set evaluates the model on unseen data

This helps measure whether the model can generalize.

You can think of it like studying for an exam:

  • Reading textbooks = training

  • Taking the final exam = testing

Without a test set, you are only measuring memory.

The core idea is:

Good ML Models Predict Unseen Data Well


Step 1: Import the Required Libraries

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error



Step 2: Upload the World Bank GDP Dataset

Download GDP data from:

World Bank Open Data

Upload the CSV file into Google Colab.

from google.colab import files

uploaded = files.upload()

Now load the dataset.

df = pd.read_csv('API_NY.GDP.PCAP.CD_DS2_en_csv_v2_121663.csv', header=4)

print(df.head())



Step 3: Select Features and Target Variable & Split the Dataset

Suppose we want to predict GDP using population and life expectancy.

Here:

  • X contains the input features

  • y contains the target variable we want to predict


We now divide the data into training and testing groups.

# Drop irrelevant columns
df_cleaned = df.drop(columns=['Indicator Name', 'Indicator Code', '2025', 'Unnamed: 70'], errors='ignore')

# Select feature (X) and target (y)
X = df_cleaned[['2023']]
y = df_cleaned['2024']

# Drop rows with NaN values in X or y for a clean split
data = pd.concat([X, y], axis=1).dropna()
X = data[['2023']]
y = data['2024']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Shape of X_train: {X_train.shape}")
print(f"Shape of X_test: {X_test.shape}")
print(f"Shape of y_train: {y_train.shape}")
print(f"Shape of y_test: {y_test.shape}")




This means:
  • 80% of the data is used for training

  • 20% is reserved for testing

The split is conceptually:

Dataset = 80% Training + 20% Testing

The random_state=42 ensures reproducibility. Every time you run the notebook, you get the same split.


Step 5: Train the Model

model = LinearRegression()

model.fit(X_train, y_train)

The model now learns relationships between GDP, population, and life expectancy.


Step 6: Make Predictions

predictions = model.predict(X_test)

print(predictions[:5])

Notice that predictions are made only on the test data the model has never seen before.



Step 7: Evaluate the Model

mae = mean_absolute_error(y_test, predictions)

print("Mean Absolute Error:", mae)

The Mean Absolute Error (MAE) measures how far predictions are from actual GDP values.

Smaller errors generally indicate better model performance.




What Happens If You Skip the Split?

If you train and test on the same GDP data:

  • The model may memorize country patterns

  • Accuracy scores may become misleading

  • Real-world performance may collapse

This problem is called overfitting.

An overfit model performs well on training data but poorly on unseen data.


The relationship looks like this:

High Training Accuracy ≠  High Real-World Accuracy



Best Practices for Train-Test Splitting

Use 80/20 for Most Projects

A common standard is:

  • 80% training

  • 20% testing

Always Shuffle the Data

Random splitting reduces bias.

Never Train on Test Data

The test set must remain unseen until evaluation.

Use random_state

This ensures reproducible experiments.


Final Thoughts

Train-test splitting is one of the foundations of reliable machine learning. Without it, models can appear accurate while failing completely in production environments.

Using World Bank GDP data makes this concept practical because it reflects real-world economic patterns across countries and regions.

As your machine learning projects become more advanced, proper evaluation practices become even more important than the model itself.


Advance Your Career With 16 Python Projects in Data & ML — All for $288.


Comments

Popular posts from this blog

How to Filter Rows Using Boolean Indexing in Pandas (Afrobarometer Kenya Dataset)

How to Decide Whether to Drop or Fill Missing Value

How to create your first line chart with World Bank Kenya GDP data