How to Plot Residuals to Diagnose a Regression Model

May 21, 2026

Many beginners stop after training a regression model and checking metrics like:

RMSE
MAE
R²

But strong regression analysis goes further.

A model can have:

Good R²
Low MAE
Acceptable RMSE

and still be fundamentally broken.

This is why professional data scientists inspect residuals.

Residual plots help you diagnose:

Poor model fit
Non-linearity
Heteroscedasticity
Outliers
Overfitting problems

In this guide, you will learn how to plot and interpret residuals using Python, scikit-learn, and matplotlib.

What Are Residuals?

Residuals are the differences between:

Actual values
Predicted values

The formula is:

Residual = y - {y}

Where:

(y) = actual value
({y}) = predicted value

Residuals tell you how wrong your model is for each prediction.

Why Residuals Matter

Metrics summarize model performance into one number.

Residuals show where the model fails.

That distinction is critical.

Example:

RMSE may look acceptable
But residuals may reveal systematic errors

A model that consistently underpredicts high values can still have a decent R² score.

Residual analysis exposes these hidden issues.

Goal of Residual Diagnostics

In a strong linear regression model, residuals should look:

Random
Patternless
Evenly distributed around zero

If patterns appear, your model assumptions are likely violated.

Step 1: Import Libraries

import pandas as pd
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

Step 2: Create Sample Data

We will predict exam scores from study hours.

data = {
    "Hours_Studied": [1,2,3,4,5,6,7,8,9,10],
    "Exam_Score": [30,35,45,50,60,65,70,78,85,90]
}

df = pd.DataFrame(data)

Step 3: Define Features and Target

X = df[["Hours_Studied"]]
y = df["Exam_Score"]

Step 4: Split the Data

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=42
)

Step 5: Train the Regression Model

model = LinearRegression()

model.fit(X_train, y_train)

Step 6: Make Predictions

predictions = model.predict(X_test)

Step 7: Calculate Residuals

Residuals are:

Actual − Predicted

residuals = y_test - predictions

Step 8: Plot Residuals

Now create the residual plot.

plt.scatter(predictions, residuals)

plt.axhline(y=0)

plt.xlabel("Predicted Values")
plt.ylabel("Residuals")

plt.title("Residual Plot")

plt.show()

This is the most common residual diagnostic plot.

N.B. The data we use is very scarce, you can increase the variables to view this much better than the example above

How to Interpret the Residual Plot

Good Residual Plot

A healthy regression model shows:

Random scatter
No visible patterns
Equal spread around zero

Example interpretation:

“The model errors appear random, suggesting the linear relationship is appropriate.”

Bad Residual Plot Patterns

Residual plots become powerful when diagnosing problems.

Problem 1: Curved Pattern

If residuals form a curve:

The relationship may not be linear
Linear regression is underfitting

Example:

Sales growth accelerating exponentially
Population growth trends
Compound growth systems

The model may need:

Polynomial features
Log transformation
Non-linear algorithms

Problem 2: Funnel Shape

If residual spread grows wider at higher predictions:

Variance is inconsistent
This is called heteroscedasticity

Example:

Predicting luxury house prices
Predicting startup revenue
Financial forecasting

Large-value observations may contain larger errors.

Problem 3: Clusters

If residuals appear grouped:

Important variables may be missing
Hidden categories may exist

Example:

Different customer types
Different countries
Different economic conditions

Problem 4: Large Outliers

Single points far away from the rest indicate:

Outliers
Data quality problems
Rare events

These can heavily distort regression models.

Why Zero Matters

The horizontal zero line is critical.

plt.axhline(y=0)

If residuals are centered around zero:

Predictions are unbiased

If most residuals stay above or below zero:

The model systematically overpredicts or underpredicts

Residuals vs Predicted Values

The most common plot uses:

X-axis = predicted values
Y-axis = residuals

This helps detect:

Error growth
Systematic bias
Model instability

Common Beginner Mistakes

1. Looking Only at R²

A high R² does not guarantee a valid regression model.

Residuals may still reveal serious issues.

2. Ignoring Outliers

Extreme observations can dominate linear regression behavior.

Always inspect residual plots for anomalies.

3. Assuming Linear Regression Always Fits

Many real-world relationships are non-linear.

Residual plots help you detect this quickly.

When Residual Analysis Becomes Essential

Residual diagnostics are especially important in:

Finance
Healthcare
Economic forecasting
Supply chain optimization
Real estate valuation

In high-stakes environments, metrics alone are not enough.

Residual plots are one of the most powerful tools in regression analysis.

They help you move beyond:

“The model has good accuracy”

to deeper questions like:

Is the model biased?
Is the relationship truly linear?
Are errors stable?
Are outliers distorting predictions?

Professional machine learning workflows always include residual diagnostics because understanding why a model fails is just as important as measuring how much it fails.

Advance Your Career With 16 Python Projects in Data & ML — All for $288.

Search This Blog

Practical Python for Data Engineering, Data Analysis & Machine Learning

How to Plot Residuals to Diagnose a Regression Model

What Are Residuals?

Why Residuals Matter

Goal of Residual Diagnostics

Step 1: Import Libraries

Step 2: Create Sample Data

Step 3: Define Features and Target

Step 4: Split the Data

Step 5: Train the Regression Model

Step 6: Make Predictions

Step 7: Calculate Residuals

Step 8: Plot Residuals

How to Interpret the Residual Plot

Good Residual Plot

Bad Residual Plot Patterns

Problem 1: Curved Pattern

Problem 2: Funnel Shape

Problem 3: Clusters

Problem 4: Large Outliers

Why Zero Matters

Residuals vs Predicted Values

Common Beginner Mistakes

1. Looking Only at R²

2. Ignoring Outliers

3. Assuming Linear Regression Always Fits

When Residual Analysis Becomes Essential

Comments

Post a Comment

Popular posts from this blog

How to Filter Rows Using Boolean Indexing in Pandas (Afrobarometer Kenya Dataset)

How to Decide Whether to Drop or Fill Missing Value

How to create your first line chart with World Bank Kenya GDP data