How to Build Your First Linear Regression Model With scikit-learn
Linear regression is one of the most important algorithms in machine learning.
It helps you predict continuous values such as house prices, sales revenue, temperatures, stock demand, or customer spending.
If you are learning machine learning with Python, building a linear regression model using scikit-learn is one of the best places to start because the workflow teaches you the foundations of predictive modeling.
In this guide, you will learn how to:
Load a dataset
Prepare features and target variables
Split data into training and testing sets
Train a linear regression model
Make predictions
Evaluate model performance
What Is Linear Regression?
Linear regression predicts a numeric value by finding a relationship between input variables and an output variable.
For example:
Advertising spend → sales revenue
Years of experience → salary
House size → house price
The algorithm tries to fit the best straight line through the data.
The equation looks like this:
y = mx + b
Where:
(y) = predicted value
(x) = input feature
(m) = slope
(b) = intercept
Step 1: Install Required Libraries
You need:
Python
pandas
scikit-learn
matplotlib
Install them with:
pip install pandas scikit-learn matplotlib
Step 2: Import Libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, r2_score
import matplotlib.pyplot as plt
Step 3: Create a Simple Dataset
In this example, we predict student exam scores based on study hours.
data = {
"Hours_Studied": [1, 2, 3, 4, 5, 6, 7, 8],
"Exam_Score": [35, 40, 50, 55, 65, 70, 80, 85]
}
df = pd.DataFrame(data)
print(df)
Output:
Step 4: Define Features and Target
Machine learning models learn from features to predict a target.
X = df[["Hours_Studied"]]
y = df["Exam_Score"]
Here:
X= feature columny= target column
Step 5: Split Data Into Training and Testing Sets
You should never train and test a model on the same data.
Use train_test_split():
X_train, X_test, y_train, y_test = train_test_split(
X,
y,
test_size=0.2,
random_state=42
)
Explanation:
test_size=0.2means 20% of data becomes test datarandom_state=42ensures reproducible results
Step 6: Train the Linear Regression Model
Create the model:
model = LinearRegression()
Train it:
model.fit(X_train, y_train)
This step teaches the model the relationship between study hours and exam scores.
Step 7: Make Predictions
Now predict exam scores using test data.
predictions = model.predict(X_test)
print(predictions)
You can also predict a completely new value:
new_prediction = model.predict([[9]])
print(new_prediction)
The model estimates the exam score for a student who studies 9 hours.
Step 8: Evaluate the Model
Two common evaluation metrics are:
Mean Absolute Error (MAE)
Measures average prediction error.
mae = mean_absolute_error(y_test, predictions)
print(mae)
Lower MAE is better.
R² Score
Measures how well the model explains the data.
r2 = r2_score(y_test, predictions)
print(r2)
Interpretation:
1.0= perfect fit0.0= poor fitNegative values = very bad model
Step 9: Visualize the Regression Line
Visualization helps you understand model behavior.
plt.scatter(X, y)
plt.plot(X, model.predict(X))
plt.xlabel("Hours Studied")
plt.ylabel("Exam Score")
plt.show()
The scatter plot shows actual data points, while the line shows model predictions.
Common Beginner Mistakes
1. Training on the Entire Dataset
Always keep separate test data.
2. Using Categorical Data Without Encoding
Linear regression only works with numeric input.
3. Ignoring Feature Scaling in Larger Projects
While simple regression may work without scaling, larger models often benefit from normalization or standardization.
Building your first linear regression model with scikit-learn teaches the core workflow used in machine learning projects:
Load data
Prepare features
Split data
Train a model
Make predictions
Evaluate performance
Once you understand linear regression, you can move into more advanced models such as decision trees, random forests, gradient boosting, and neural networks.
Advance Your Career With 16 Python Projects in Data & ML — All for $288.
Comments
Post a Comment