How to Plot Residuals to Diagnose a Regression Model



Many beginners stop after training a regression model and checking metrics like:

  • RMSE

  • MAE

But strong regression analysis goes further.

A model can have:

  • Good R²

  • Low MAE

  • Acceptable RMSE

and still be fundamentally broken.

This is why professional data scientists inspect residuals.


Residual plots help you diagnose:

  • Poor model fit

  • Non-linearity

  • Heteroscedasticity

  • Outliers

  • Overfitting problems

In this guide, you will learn how to plot and interpret residuals using Python, scikit-learn, and matplotlib.


What Are Residuals?

Residuals are the differences between:

  • Actual values

  • Predicted values

The formula is:

Residual = y - {y}

Where:

  • (y) = actual value

  • ({y}) = predicted value

Residuals tell you how wrong your model is for each prediction.


Why Residuals Matter

Metrics summarize model performance into one number.

Residuals show where the model fails.

That distinction is critical.

Example:

  • RMSE may look acceptable

  • But residuals may reveal systematic errors


A model that consistently underpredicts high values can still have a decent R² score.

Residual analysis exposes these hidden issues.


Goal of Residual Diagnostics

In a strong linear regression model, residuals should look:

  • Random

  • Patternless

  • Evenly distributed around zero

If patterns appear, your model assumptions are likely violated.


Step 1: Import Libraries

import pandas as pd
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression



Step 2: Create Sample Data

We will predict exam scores from study hours.

data = {
    "Hours_Studied": [1,2,3,4,5,6,7,8,9,10],
    "Exam_Score": [30,35,45,50,60,65,70,78,85,90]
}

df = pd.DataFrame(data)




Step 3: Define Features and Target

X = df[["Hours_Studied"]]
y = df["Exam_Score"]




Step 4: Split the Data

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=42
)



Step 5: Train the Regression Model

model = LinearRegression()

model.fit(X_train, y_train)



Step 6: Make Predictions

predictions = model.predict(X_test)

Step 7: Calculate Residuals

Residuals are:

Actual − Predicted

residuals = y_test - predictions



Step 8: Plot Residuals

Now create the residual plot.

plt.scatter(predictions, residuals)

plt.axhline(y=0)

plt.xlabel("Predicted Values")
plt.ylabel("Residuals")

plt.title("Residual Plot")

plt.show()


This is the most common residual diagnostic plot.

N.B. The data we use is very scarce, you can increase the variables to view this much better than the example above


How to Interpret the Residual Plot

Good Residual Plot

A healthy regression model shows:

  • Random scatter

  • No visible patterns

  • Equal spread around zero

Example interpretation:

“The model errors appear random, suggesting the linear relationship is appropriate.”


Bad Residual Plot Patterns

Residual plots become powerful when diagnosing problems.



Problem 1: Curved Pattern

If residuals form a curve:

  • The relationship may not be linear

  • Linear regression is underfitting

Example:

  • Sales growth accelerating exponentially

  • Population growth trends

  • Compound growth systems

The model may need:

  • Polynomial features

  • Log transformation

  • Non-linear algorithms


Problem 2: Funnel Shape

If residual spread grows wider at higher predictions:

  • Variance is inconsistent

  • This is called heteroscedasticity

Example:

  • Predicting luxury house prices

  • Predicting startup revenue

  • Financial forecasting

Large-value observations may contain larger errors.


Problem 3: Clusters

If residuals appear grouped:

  • Important variables may be missing

  • Hidden categories may exist

Example:

  • Different customer types

  • Different countries

  • Different economic conditions


Problem 4: Large Outliers

Single points far away from the rest indicate:

  • Outliers

  • Data quality problems

  • Rare events

These can heavily distort regression models.


Why Zero Matters

The horizontal zero line is critical.

plt.axhline(y=0)

If residuals are centered around zero:

  • Predictions are unbiased

If most residuals stay above or below zero:

  • The model systematically overpredicts or underpredicts


Residuals vs Predicted Values

The most common plot uses:

  • X-axis = predicted values

  • Y-axis = residuals

This helps detect:

  • Error growth

  • Systematic bias

  • Model instability


Common Beginner Mistakes

1. Looking Only at R²

A high R² does not guarantee a valid regression model.

Residuals may still reveal serious issues.


2. Ignoring Outliers

Extreme observations can dominate linear regression behavior.

Always inspect residual plots for anomalies.


3. Assuming Linear Regression Always Fits

Many real-world relationships are non-linear.

Residual plots help you detect this quickly.


When Residual Analysis Becomes Essential

Residual diagnostics are especially important in:

  • Finance

  • Healthcare

  • Economic forecasting

  • Supply chain optimization

  • Real estate valuation

In high-stakes environments, metrics alone are not enough.



Residual plots are one of the most powerful tools in regression analysis.

They help you move beyond:

  • “The model has good accuracy”


to deeper questions like:

  • Is the model biased?

  • Is the relationship truly linear?

  • Are errors stable?

  • Are outliers distorting predictions?


Professional machine learning workflows always include residual diagnostics because understanding why a model fails is just as important as measuring how much it fails.



Advance Your Career With 16 Python Projects in Data & ML — All for $288.



Comments

Popular posts from this blog

How to Filter Rows Using Boolean Indexing in Pandas (Afrobarometer Kenya Dataset)

How to Decide Whether to Drop or Fill Missing Value

How to create your first line chart with World Bank Kenya GDP data