How to Split Data Into Train and Test Sets — and Why It Matters Using World Bank GDP Data
Machine learning models do not become useful because they memorize data. They become useful because they can make accurate predictions on data they have never seen before.
That is why splitting a dataset into training and testing sets is one of the most important steps in any machine learning workflow.
In this tutorial, we will use World Bank GDP data to learn how train-test splitting works and why it prevents misleading model results.
Why Train-Test Splitting Matters
When you train a model on a dataset, the model learns patterns from that data.
If you evaluate the model using the exact same dataset, the performance metrics may look excellent even when the model performs poorly in the real world.
A train-test split solves this problem.
The training set teaches the model
The test set evaluates the model on unseen data
This helps measure whether the model can generalize.
You can think of it like studying for an exam:
Reading textbooks = training
Taking the final exam = testing
Without a test set, you are only measuring memory.
The core idea is:
Good ML Models Predict Unseen Data Well
Step 1: Import the Required Libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error
Step 2: Upload the World Bank GDP Dataset
Download GDP data from:
Upload the CSV file into Google Colab.
from google.colab import files
uploaded = files.upload()
Now load the dataset.
Step 3: Select Features and Target Variable & Split the Dataset
Suppose we want to predict GDP using population and life expectancy.
Here:
Xcontains the input featuresycontains the target variable we want to predict
We now divide the data into training and testing groups.
# Drop irrelevant columnsdf_cleaned = df.drop(columns=['Indicator Name', 'Indicator Code', '2025', 'Unnamed: 70'], errors='ignore')# Select feature (X) and target (y)X = df_cleaned[['2023']]y = df_cleaned['2024']# Drop rows with NaN values in X or y for a clean splitdata = pd.concat([X, y], axis=1).dropna()X = data[['2023']]y = data['2024']# Split the data into training and testing setsX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)print(f"Shape of X_train: {X_train.shape}")print(f"Shape of X_test: {X_test.shape}")print(f"Shape of y_train: {y_train.shape}")print(f"Shape of y_test: {y_test.shape}")
This means:
80% of the data is used for training
20% is reserved for testing
The split is conceptually:
Dataset = 80% Training + 20% Testing
The random_state=42 ensures reproducibility. Every time you run the notebook, you get the same split.
Step 5: Train the Model
model = LinearRegression()
model.fit(X_train, y_train)
The model now learns relationships between GDP, population, and life expectancy.
Step 6: Make Predictions
predictions = model.predict(X_test)
print(predictions[:5])
Notice that predictions are made only on the test data the model has never seen before.
Step 7: Evaluate the Model
mae = mean_absolute_error(y_test, predictions)
print("Mean Absolute Error:", mae)
The Mean Absolute Error (MAE) measures how far predictions are from actual GDP values.
Smaller errors generally indicate better model performance.
What Happens If You Skip the Split?
If you train and test on the same GDP data:
The model may memorize country patterns
Accuracy scores may become misleading
Real-world performance may collapse
This problem is called overfitting.
An overfit model performs well on training data but poorly on unseen data.
The relationship looks like this:
High Training Accuracy ≠ High Real-World Accuracy
Best Practices for Train-Test Splitting
Use 80/20 for Most Projects
A common standard is:
80% training
20% testing
Always Shuffle the Data
Random splitting reduces bias.
Never Train on Test Data
The test set must remain unseen until evaluation.
Use random_state
This ensures reproducible experiments.
Final Thoughts
Train-test splitting is one of the foundations of reliable machine learning. Without it, models can appear accurate while failing completely in production environments.
Using World Bank GDP data makes this concept practical because it reflects real-world economic patterns across countries and regions.
As your machine learning projects become more advanced, proper evaluation practices become even more important than the model itself.
Advance Your Career With 16 Python Projects in Data & ML — All for $288.
Comments
Post a Comment