How to Compare Logistic Regression vs Decision Trees on Real Data

Comparing classification models is one of the most important parts of applied machine learning. 




A model that performs well on training data may completely fail in production if it generalises poorly.

Two of the most widely used classification algorithms are Logistic Regression and Decision Trees. Both can solve binary classification problems, but they behave very differently on real-world datasets.

In this tutorial, you will learn how to compare Logistic Regression and Decision Trees using a practical dataset, proper evaluation metrics, and clear interpretation methods. 

Instead of relying on theory alone, we will use real data and examine how each model behaves under the same conditions.


Why Compare Logistic Regression and Decision Trees?

Both models are popular because they are interpretable and relatively easy to implement.

Logistic Regression

Logistic Regression is a linear model used for classification. It estimates probabilities and works well when relationships between variables are approximately linear.

Advantages:

  • Fast training

  • Easy interpretation

  • Good baseline model

  • Outputs probabilities

Limitations:

  • Struggles with complex nonlinear relationships

  • Sensitive to feature scaling

  • Can underperform with highly irregular data



Decision Trees

Decision Trees split data into branches using rules that maximise class separation.

Advantages:

  • Handles nonlinear relationships

  • Easy to visualise

  • Requires little preprocessing

  • Handles mixed data types well

Limitations:

  • Easily overfits

  • Can become unstable with small data changes

  • Often less generalisable than simpler models


The Dataset

We will use the Breast Cancer Wisconsin dataset available in scikit-learn. This is a real medical dataset used to classify tumors as malignant or benign.

The dataset contains:

  • 569 observations

  • 30 numerical features

  • Binary target variable


Step 1: Import Libraries

import pandas as pd
import numpy as np

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split

from sklearn.preprocessing import StandardScaler

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier

from sklearn.metrics import (
    accuracy_score,
    precision_score,
    recall_score,
    f1_score,
    roc_auc_score,
    confusion_matrix
)


Step 2: Load the Real Dataset

data = load_breast_cancer()

X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target

print(X.head())


Step 3: Train-Test Split

We separate the data into training and testing datasets so evaluation remains fair.

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=42
)


Step 4: Scale Features for Logistic Regression

Logistic Regression performs better when numerical variables are scaled.

scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

Decision Trees do not require scaling.



Step 5: Train Logistic Regression

log_model = LogisticRegression(max_iter=500)

log_model.fit(X_train_scaled, y_train)

log_preds = log_model.predict(X_test_scaled)
log_probs = log_model.predict_proba(X_test_scaled)[:, 1]


Step 6: Train Decision Tree

tree_model = DecisionTreeClassifier(
    max_depth=4,
    random_state=42
)

tree_model.fit(X_train, y_train)

tree_preds = tree_model.predict(X_test)
tree_probs = tree_model.predict_proba(X_test)[:, 1]


Step 7: Compare Performance Metrics

results = pd.DataFrame({
    "Metric": ["Accuracy", "Precision", "Recall", "F1 Score", "ROC-AUC"],

    "Logistic Regression": [
        accuracy_score(y_test, log_preds),
        precision_score(y_test, log_preds),
        recall_score(y_test, log_preds),
        f1_score(y_test, log_preds),
        roc_auc_score(y_test, log_probs)
    ],

    "Decision Tree": [
        accuracy_score(y_test, tree_preds),
        precision_score(y_test, tree_preds),
        recall_score(y_test, tree_preds),
        f1_score(y_test, tree_preds),
        roc_auc_score(y_test, tree_probs)
    ]
})

print(results)



Understanding the Metrics

Accuracy

Accuracy measures total correct predictions.

Accuracy={TP+TN}/{TP+TN+FP+FN}

Accuracy works well on balanced datasets but can mislead on imbalanced data.


Precision

Precision measures how many positive predictions were correct.

Precision={TP}/{TP+FP}

Useful when false positives are expensive.


Recall

Recall measures how many real positives were detected.

Recall={TP}/{TP+FN}

Important in healthcare, fraud detection, and risk systems.


F1 Score

F1 balances Precision and Recall.

F1=2\times{Precision\times Recall}/{Precision+Recall}


ROC-AUC

ROC-AUC measures how well the model separates classes across thresholds.

A score:

  • Near 1.0 = excellent

  • Near 0.5 = random guessing



Step 8: Compare Confusion Matrices

print("Logistic Regression")
print(confusion_matrix(y_test, log_preds))

print("Decision Tree")
print(confusion_matrix(y_test, tree_preds))

The confusion matrix helps identify:

  • True Positives

  • True Negatives

  • False Positives

  • False Negatives



What You Will Usually Observe

In many structured datasets:

  • Logistic Regression produces smoother, more stable predictions

  • Decision Trees capture complex patterns more easily

  • Trees may overfit training data

  • Logistic Regression often generalises better


Typical outcomes:

  • Logistic Regression achieves slightly higher ROC-AUC

  • Decision Trees achieve interpretable rule-based predictions

  • Trees can outperform when relationships are highly nonlinear


When to Choose Logistic Regression

Use Logistic Regression when:

  • You need probability estimates

  • Interpretability matters

  • Data relationships are approximately linear

  • You want a strong baseline model

  • Overfitting risk must remain low


When to Choose Decision Trees

Use Decision Trees when:

  • Relationships are nonlinear

  • Feature interactions matter

  • Business users need rule-based explanations

  • Data contains mixed variable types

  • Minimal preprocessing is preferred


The Most Important Lesson

Model comparison is not about finding a universally “best” algorithm. It is about identifying which model performs best for a specific business problem, dataset structure, and error tolerance.

A healthcare model may prioritise Recall.

A fraud model may prioritise Precision.

A customer churn model may prioritise ROC-AUC.


The correct model depends on operational goals, not just accuracy.


Comparing Logistic Regression and Decision Trees on real data teaches one of the most important machine learning principles: evaluation matters more than algorithm hype.


Many beginners jump directly into complex models without properly benchmarking simpler methods.

In practice, Logistic Regression frequently performs surprisingly well, while Decision Trees provide interpretability and nonlinear flexibility.

The best machine learning engineers do not choose models emotionally. They compare them systematically using:

  • Fair train/test splits

  • Multiple evaluation metrics

  • Real datasets

  • Business-aware interpretation


That is how reliable machine learning systems are built.


Advance Your Career With 16 Python Projects in Data & ML — All for $288.




Comments

Popular posts from this blog

How to Filter Rows Using Boolean Indexing in Pandas (Afrobarometer Kenya Dataset)

How to Build a Pivot Table From Our World in Data Demographics

How to Decide Whether to Drop or Fill Missing Value