How to Build Your First Linear Regression Model With scikit-learn

May 21, 2026

Linear regression is one of the most important algorithms in machine learning.

It helps you predict continuous values such as house prices, sales revenue, temperatures, stock demand, or customer spending.

If you are learning machine learning with Python, building a linear regression model using scikit-learn is one of the best places to start because the workflow teaches you the foundations of predictive modeling.

In this guide, you will learn how to:

Load a dataset
Prepare features and target variables
Split data into training and testing sets
Train a linear regression model
Make predictions
Evaluate model performance

What Is Linear Regression?

Linear regression predicts a numeric value by finding a relationship between input variables and an output variable.

For example:

Advertising spend → sales revenue
Years of experience → salary
House size → house price

The algorithm tries to fit the best straight line through the data.

The equation looks like this:

y = mx + b

Where:

(y) = predicted value
(x) = input feature
(m) = slope
(b) = intercept

Step 1: Install Required Libraries

You need:

Python
pandas
scikit-learn
matplotlib

Install them with:

pip install pandas scikit-learn matplotlib

Step 2: Import Libraries

import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, r2_score

import matplotlib.pyplot as plt

Step 3: Create a Simple Dataset

In this example, we predict student exam scores based on study hours.

data = {
    "Hours_Studied": [1, 2, 3, 4, 5, 6, 7, 8],
    "Exam_Score": [35, 40, 50, 55, 65, 70, 80, 85]
}

df = pd.DataFrame(data)

print(df)

Output:

Step 4: Define Features and Target

Machine learning models learn from features to predict a target.

X = df[["Hours_Studied"]]
y = df["Exam_Score"]

Here:

X = feature column
y = target column

Step 5: Split Data Into Training and Testing Sets

You should never train and test a model on the same data.

Use train_test_split():

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=42
)

Explanation:

test_size=0.2 means 20% of data becomes test data
random_state=42 ensures reproducible results

Step 6: Train the Linear Regression Model

Create the model:

model = LinearRegression()

Train it:

model.fit(X_train, y_train)

This step teaches the model the relationship between study hours and exam scores.

Step 7: Make Predictions

Now predict exam scores using test data.

predictions = model.predict(X_test)

print(predictions)

You can also predict a completely new value:

new_prediction = model.predict([[9]])

print(new_prediction)

The model estimates the exam score for a student who studies 9 hours.

Step 8: Evaluate the Model

Two common evaluation metrics are:

Mean Absolute Error (MAE)

Measures average prediction error.

mae = mean_absolute_error(y_test, predictions)

print(mae)

Lower MAE is better.

R² Score

Measures how well the model explains the data.

r2 = r2_score(y_test, predictions)

print(r2)

Interpretation:

1.0 = perfect fit
0.0 = poor fit
Negative values = very bad model

Step 9: Visualize the Regression Line

Visualization helps you understand model behavior.

plt.scatter(X, y)

plt.plot(X, model.predict(X))

plt.xlabel("Hours Studied")
plt.ylabel("Exam Score")

plt.show()

The scatter plot shows actual data points, while the line shows model predictions.

Common Beginner Mistakes

1. Training on the Entire Dataset

Always keep separate test data.

2. Using Categorical Data Without Encoding

Linear regression only works with numeric input.

3. Ignoring Feature Scaling in Larger Projects

While simple regression may work without scaling, larger models often benefit from normalization or standardization.

Building your first linear regression model with scikit-learn teaches the core workflow used in machine learning projects:

Load data
Prepare features
Split data
Train a model
Make predictions
Evaluate performance

Once you understand linear regression, you can move into more advanced models such as decision trees, random forests, gradient boosting, and neural networks.

Advance Your Career With 16 Python Projects in Data & ML — All for $288.

Search This Blog

Practical Python for Data Engineering, Data Analysis & Machine Learning