How to Compare Logistic Regression vs Decision Trees on Real Data
Comparing classification models is one of the most important parts of applied machine learning.
A model that performs well on training data may completely fail in production if it generalises poorly.
Two of the most widely used classification algorithms are Logistic Regression and Decision Trees. Both can solve binary classification problems, but they behave very differently on real-world datasets.
In this tutorial, you will learn how to compare Logistic Regression and Decision Trees using a practical dataset, proper evaluation metrics, and clear interpretation methods.
Instead of relying on theory alone, we will use real data and examine how each model behaves under the same conditions.
Why Compare Logistic Regression and Decision Trees?
Both models are popular because they are interpretable and relatively easy to implement.
Logistic Regression
Logistic Regression is a linear model used for classification. It estimates probabilities and works well when relationships between variables are approximately linear.
Advantages:
Fast training
Easy interpretation
Good baseline model
Outputs probabilities
Limitations:
Struggles with complex nonlinear relationships
Sensitive to feature scaling
Can underperform with highly irregular data
Decision Trees
Decision Trees split data into branches using rules that maximise class separation.
Advantages:
Handles nonlinear relationships
Easy to visualise
Requires little preprocessing
Handles mixed data types well
Limitations:
Easily overfits
Can become unstable with small data changes
Often less generalisable than simpler models
The Dataset
We will use the Breast Cancer Wisconsin dataset available in scikit-learn. This is a real medical dataset used to classify tumors as malignant or benign.
The dataset contains:
569 observations
30 numerical features
Binary target variable
Step 1: Import Libraries
import pandas as pd
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import (
accuracy_score,
precision_score,
recall_score,
f1_score,
roc_auc_score,
confusion_matrix
)
Step 2: Load the Real Dataset
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target
print(X.head())
Step 3: Train-Test Split
We separate the data into training and testing datasets so evaluation remains fair.
X_train, X_test, y_train, y_test = train_test_split(
X,
y,
test_size=0.2,
random_state=42
)
Step 4: Scale Features for Logistic Regression
Logistic Regression performs better when numerical variables are scaled.
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
Decision Trees do not require scaling.
Step 5: Train Logistic Regression
log_model = LogisticRegression(max_iter=500)
log_model.fit(X_train_scaled, y_train)
log_preds = log_model.predict(X_test_scaled)
log_probs = log_model.predict_proba(X_test_scaled)[:, 1]
Step 6: Train Decision Tree
tree_model = DecisionTreeClassifier(
max_depth=4,
random_state=42
)
tree_model.fit(X_train, y_train)
tree_preds = tree_model.predict(X_test)
tree_probs = tree_model.predict_proba(X_test)[:, 1]
Step 7: Compare Performance Metrics
results = pd.DataFrame({
"Metric": ["Accuracy", "Precision", "Recall", "F1 Score", "ROC-AUC"],
"Logistic Regression": [
accuracy_score(y_test, log_preds),
precision_score(y_test, log_preds),
recall_score(y_test, log_preds),
f1_score(y_test, log_preds),
roc_auc_score(y_test, log_probs)
],
"Decision Tree": [
accuracy_score(y_test, tree_preds),
precision_score(y_test, tree_preds),
recall_score(y_test, tree_preds),
f1_score(y_test, tree_preds),
roc_auc_score(y_test, tree_probs)
]
})
print(results)
Understanding the Metrics
Accuracy
Accuracy measures total correct predictions.
Accuracy={TP+TN}/{TP+TN+FP+FN}
Accuracy works well on balanced datasets but can mislead on imbalanced data.
Precision
Precision measures how many positive predictions were correct.
Precision={TP}/{TP+FP}
Useful when false positives are expensive.
Recall
Recall measures how many real positives were detected.
Recall={TP}/{TP+FN}
Important in healthcare, fraud detection, and risk systems.
F1 Score
F1 balances Precision and Recall.
F1=2\times{Precision\times Recall}/{Precision+Recall}
ROC-AUC
ROC-AUC measures how well the model separates classes across thresholds.
A score:
Near 1.0 = excellent
Near 0.5 = random guessing
Step 8: Compare Confusion Matrices
print("Logistic Regression")
print(confusion_matrix(y_test, log_preds))
print("Decision Tree")
print(confusion_matrix(y_test, tree_preds))
The confusion matrix helps identify:
True Positives
True Negatives
False Positives
False Negatives
What You Will Usually Observe
In many structured datasets:
Logistic Regression produces smoother, more stable predictions
Decision Trees capture complex patterns more easily
Trees may overfit training data
Logistic Regression often generalises better
Typical outcomes:
Logistic Regression achieves slightly higher ROC-AUC
Decision Trees achieve interpretable rule-based predictions
Trees can outperform when relationships are highly nonlinear
When to Choose Logistic Regression
Use Logistic Regression when:
You need probability estimates
Interpretability matters
Data relationships are approximately linear
You want a strong baseline model
Overfitting risk must remain low
When to Choose Decision Trees
Use Decision Trees when:
Relationships are nonlinear
Feature interactions matter
Business users need rule-based explanations
Data contains mixed variable types
Minimal preprocessing is preferred
The Most Important Lesson
Model comparison is not about finding a universally “best” algorithm. It is about identifying which model performs best for a specific business problem, dataset structure, and error tolerance.
A healthcare model may prioritise Recall.
A fraud model may prioritise Precision.
A customer churn model may prioritise ROC-AUC.
The correct model depends on operational goals, not just accuracy.
Comparing Logistic Regression and Decision Trees on real data teaches one of the most important machine learning principles: evaluation matters more than algorithm hype.
Many beginners jump directly into complex models without properly benchmarking simpler methods.
In practice, Logistic Regression frequently performs surprisingly well, while Decision Trees provide interpretability and nonlinear flexibility.
The best machine learning engineers do not choose models emotionally. They compare them systematically using:
Fair train/test splits
Multiple evaluation metrics
Real datasets
Business-aware interpretation
That is how reliable machine learning systems are built.
Advance Your Career With 16 Python Projects in Data & ML — All for $288.
Comments
Post a Comment