How to Plot Residuals to Diagnose a Regression Model
Many beginners stop after training a regression model and checking metrics like:
RMSE
MAE
R²
But strong regression analysis goes further.
A model can have:
Good R²
Low MAE
Acceptable RMSE
and still be fundamentally broken.
This is why professional data scientists inspect residuals.
Residual plots help you diagnose:
Poor model fit
Non-linearity
Heteroscedasticity
Outliers
Overfitting problems
In this guide, you will learn how to plot and interpret residuals using Python, scikit-learn, and matplotlib.
What Are Residuals?
Residuals are the differences between:
Actual values
Predicted values
The formula is:
Residual = y - {y}
Where:
(y) = actual value
({y}) = predicted value
Residuals tell you how wrong your model is for each prediction.
Why Residuals Matter
Metrics summarize model performance into one number.
Residuals show where the model fails.
That distinction is critical.
Example:
RMSE may look acceptable
But residuals may reveal systematic errors
A model that consistently underpredicts high values can still have a decent R² score.
Residual analysis exposes these hidden issues.
Goal of Residual Diagnostics
In a strong linear regression model, residuals should look:
Random
Patternless
Evenly distributed around zero
If patterns appear, your model assumptions are likely violated.
Step 1: Import Libraries
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
Step 2: Create Sample Data
We will predict exam scores from study hours.
data = {
"Hours_Studied": [1,2,3,4,5,6,7,8,9,10],
"Exam_Score": [30,35,45,50,60,65,70,78,85,90]
}
df = pd.DataFrame(data)
Step 3: Define Features and Target
X = df[["Hours_Studied"]]
y = df["Exam_Score"]
Step 4: Split the Data
X_train, X_test, y_train, y_test = train_test_split(
X,
y,
test_size=0.2,
random_state=42
)
Step 5: Train the Regression Model
model = LinearRegression()
model.fit(X_train, y_train)
Step 6: Make Predictions
predictions = model.predict(X_test)
Step 7: Calculate Residuals
Residuals are:
Actual − Predicted
residuals = y_test - predictions
Step 8: Plot Residuals
Now create the residual plot.
plt.scatter(predictions, residuals)
plt.axhline(y=0)
plt.xlabel("Predicted Values")
plt.ylabel("Residuals")
plt.title("Residual Plot")
plt.show()
This is the most common residual diagnostic plot.
N.B. The data we use is very scarce, you can increase the variables to view this much better than the example above
How to Interpret the Residual Plot
Good Residual Plot
A healthy regression model shows:
Random scatter
No visible patterns
Equal spread around zero
Example interpretation:
“The model errors appear random, suggesting the linear relationship is appropriate.”
Bad Residual Plot Patterns
Residual plots become powerful when diagnosing problems.
Problem 1: Curved Pattern
If residuals form a curve:
The relationship may not be linear
Linear regression is underfitting
Example:
Sales growth accelerating exponentially
Population growth trends
Compound growth systems
The model may need:
Polynomial features
Log transformation
Non-linear algorithms
Problem 2: Funnel Shape
If residual spread grows wider at higher predictions:
Variance is inconsistent
This is called heteroscedasticity
Example:
Predicting luxury house prices
Predicting startup revenue
Financial forecasting
Large-value observations may contain larger errors.
Problem 3: Clusters
If residuals appear grouped:
Important variables may be missing
Hidden categories may exist
Example:
Different customer types
Different countries
Different economic conditions
Problem 4: Large Outliers
Single points far away from the rest indicate:
Outliers
Data quality problems
Rare events
These can heavily distort regression models.
Why Zero Matters
The horizontal zero line is critical.
plt.axhline(y=0)
If residuals are centered around zero:
Predictions are unbiased
If most residuals stay above or below zero:
The model systematically overpredicts or underpredicts
Residuals vs Predicted Values
The most common plot uses:
X-axis = predicted values
Y-axis = residuals
This helps detect:
Error growth
Systematic bias
Model instability
Common Beginner Mistakes
1. Looking Only at R²
A high R² does not guarantee a valid regression model.
Residuals may still reveal serious issues.
2. Ignoring Outliers
Extreme observations can dominate linear regression behavior.
Always inspect residual plots for anomalies.
3. Assuming Linear Regression Always Fits
Many real-world relationships are non-linear.
Residual plots help you detect this quickly.
When Residual Analysis Becomes Essential
Residual diagnostics are especially important in:
Finance
Healthcare
Economic forecasting
Supply chain optimization
Real estate valuation
In high-stakes environments, metrics alone are not enough.
Residual plots are one of the most powerful tools in regression analysis.
They help you move beyond:
“The model has good accuracy”
to deeper questions like:
Is the model biased?
Is the relationship truly linear?
Are errors stable?
Are outliers distorting predictions?
Professional machine learning workflows always include residual diagnostics because understanding why a model fails is just as important as measuring how much it fails.
Advance Your Career With 16 Python Projects in Data & ML — All for $288.
Comments
Post a Comment