How to Build Your First Linear Regression Model With scikit-learn

Linear regression is one of the most important algorithms in machine learning. 



It helps you predict continuous values such as house prices, sales revenue, temperatures, stock demand, or customer spending.

If you are learning machine learning with Python, building a linear regression model using scikit-learn is one of the best places to start because the workflow teaches you the foundations of predictive modeling.

In this guide, you will learn how to:

  • Load a dataset

  • Prepare features and target variables

  • Split data into training and testing sets

  • Train a linear regression model

  • Make predictions

  • Evaluate model performance


What Is Linear Regression?

Linear regression predicts a numeric value by finding a relationship between input variables and an output variable.

For example:

  • Advertising spend → sales revenue

  • Years of experience → salary

  • House size → house price

The algorithm tries to fit the best straight line through the data.


The equation looks like this:

y = mx + b

Where:

  • (y) = predicted value

  • (x) = input feature

  • (m) = slope

  • (b) = intercept


Step 1: Install Required Libraries

You need:

  • Python

  • pandas

  • scikit-learn

  • matplotlib

Install them with:

pip install pandas scikit-learn matplotlib



Step 2: Import Libraries

import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, r2_score

import matplotlib.pyplot as plt



Step 3: Create a Simple Dataset

In this example, we predict student exam scores based on study hours.

data = {
    "Hours_Studied": [1, 2, 3, 4, 5, 6, 7, 8],
    "Exam_Score": [35, 40, 50, 55, 65, 70, 80, 85]
}

df = pd.DataFrame(data)

print(df)

Output:




Step 4: Define Features and Target

Machine learning models learn from features to predict a target.

X = df[["Hours_Studied"]]
y = df["Exam_Score"]

Here:

  • X = feature column

  • y = target column



Step 5: Split Data Into Training and Testing Sets

You should never train and test a model on the same data.

Use train_test_split():

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=42
)

Explanation:

  • test_size=0.2 means 20% of data becomes test data

  • random_state=42 ensures reproducible results




Step 6: Train the Linear Regression Model

Create the model:

model = LinearRegression()

Train it:

model.fit(X_train, y_train)

This step teaches the model the relationship between study hours and exam scores.



Step 7: Make Predictions

Now predict exam scores using test data.

predictions = model.predict(X_test)

print(predictions)

You can also predict a completely new value:

new_prediction = model.predict([[9]])

print(new_prediction)

The model estimates the exam score for a student who studies 9 hours.




Step 8: Evaluate the Model

Two common evaluation metrics are:

Mean Absolute Error (MAE)

Measures average prediction error.



mae = mean_absolute_error(y_test, predictions)

print(mae)


Lower MAE is better.


R² Score

Measures how well the model explains the data.




r2 = r2_score(y_test, predictions)

print(r2)



Interpretation:

  • 1.0 = perfect fit

  • 0.0 = poor fit

  • Negative values = very bad model


Step 9: Visualize the Regression Line

Visualization helps you understand model behavior.

plt.scatter(X, y)

plt.plot(X, model.predict(X))

plt.xlabel("Hours Studied")
plt.ylabel("Exam Score")

plt.show()


The scatter plot shows actual data points, while the line shows model predictions.


Common Beginner Mistakes

1. Training on the Entire Dataset

Always keep separate test data.

2. Using Categorical Data Without Encoding

Linear regression only works with numeric input.

3. Ignoring Feature Scaling in Larger Projects

While simple regression may work without scaling, larger models often benefit from normalization or standardization.


Building your first linear regression model with scikit-learn teaches the core workflow used in machine learning projects:

  1. Load data

  2. Prepare features

  3. Split data

  4. Train a model

  5. Make predictions

  6. Evaluate performance


Once you understand linear regression, you can move into more advanced models such as decision trees, random forests, gradient boosting, and neural networks.


Advance Your Career With 16 Python Projects in Data & ML — All for $288.


Comments

Popular posts from this blog

How to Filter Rows Using Boolean Indexing in Pandas (Afrobarometer Kenya Dataset)

How to Decide Whether to Drop or Fill Missing Value

How to create your first line chart with World Bank Kenya GDP data