Decision Tree and Random Forest Regression

Introduction

While linear and logistic regression are powerful tools, they assume specific relationships between variables (linear for continuous outcomes, logistic for binary outcomes). What if the relationship is more complex? Enter decision tree regression and random forest regression - flexible methods that can capture non-linear patterns and interactions without explicit feature engineering.

Decision Tree Regression

What is Decision Tree Regression?

The Big Picture

Imagine you’re estimating house prices.

Instead of fitting a straight line (linear regression), you ask a series of yes/no questions:

Is the house bigger than 2000 sq ft?
Does it have more than 3 bedrooms?
Is it in neighborhood A or B?

Each question splits your data into smaller groups, and at the end, you predict the average price of houses in each final group.

How It Works: The Mechanics

A decision tree regression model works by:

Splitting the data: Find the feature and threshold that best divides your data into two groups
Recursing: Repeat the process on each group
Stopping: When groups are “pure enough” or meet stopping criteria
Predicting: For a new data point, follow the tree down to a leaf and return the average value of training points in that leaf

Ex. 1: Simple House Price Prediction

Let’s say we have this tiny dataset:

Square Feet	Bedrooms	Price ($1000s)
1200	2	200
1400	2	220
1600	3	280
1800	3	300
2000	4	380
2200	4	400

The tree might look like:

           [Square Feet < 1700?]
                /          \
              Yes           No
              /              \
        [Avg: 230K]    [Bedrooms < 4?]
                           /         \
                         Yes          No
                         /             \
                   [Avg: 290K]    [Avg: 390K]

Prediction for a 1500 sq ft, 2 BR house: Follow left → predict $230K
Prediction for a 1900 sq ft, 3 BR house: Follow right → left → predict $290K

Example 2: Visualizing the Split

Consider predicting a person’s salary based on years of experience and education level.

Data points (Years_Experience, Education_Years, Salary):
(1, 12, 35K), (2, 12, 40K), (3, 16, 55K), (4, 16, 60K),
(5, 18, 70K), (6, 18, 75K), (10, 18, 90K), (12, 20, 110K)

The decision tree creates rectangular regions in the feature space:

        [Years_Experience < 4?]
             /            \
           Yes             No
           /                \
    [Edu_Years < 14?]   [Years_Exp < 8?]
       /        \           /         \
     Yes        No        Yes          No
     /          \         /             \
[Avg: 37.5K] [Avg: 57.5K] [Avg: 78.3K] [Avg: 100K]

Key Insight: Unlike linear regression which fits a single plane, decision trees partition the space into rectangles, each with its own prediction.

How Are Splits Chosen?

The algorithm minimizes residual sum of squares (RSS) at each split:

\[RSS = \sum_{i \in \text{left}} (y_i - \bar{y}_{\text{left}})^2 + \sum_{j \in \text{right}} (y_j - \bar{y}_{\text{right}})^2\]

Where $\bar{y}_{\text{left}}$ and $\bar{y}_{\text{right}}$ are the average values in each resulting group.

Intuition: Find the split that makes each side as “homogeneous” as possible - minimize the variance within each group.

Example 3: Temperature Prediction

Suppose we’re predicting daily high temperature based on the month:

Month	Temperature (°F)
1	35
2	38
3	45
4	58
5	70
6	80
7	85
8	83
9	75
10	60
11	48
12	37

A decision tree might split like this:

            [Month < 6.5?]
               /        \
             Yes         No
             /            \
      [Month < 3.5?]  [Month < 9.5?]
         /      \         /       \
       Yes      No      Yes        No
       /         \       /          \
   [Avg: 36°] [Avg: 58°] [Avg: 81°] [Avg: 48°]

This captures the non-linear relationship (winter → spring → summer → fall) that linear regression would struggle with.

Piecewise Constant Predictions

Understanding Discretized Outputs

Critical characteristic of decision tree regression: outputs are always discretized.

When you train a tree, each leaf node stores a single prediction value (the mean of all training samples in that leaf).

No matter how many different inputs you provide, the model can only produce as many distinct values as it has leaf nodes.

Key Insight: If your tree has 10 leaf nodes, it can only produce 10 unique predictions - ever.

Ex.: Sweeping Through Input Values

Let’s see this in action with a simple example of eight values:

[10, 15, 25, 40, 60, 85, 115, 150]

and train a small decision tree with only 4 leaf nodes.

Code

import numpy as np
from sklearn.tree import DecisionTreeRegressor
import matplotlib.pyplot as plt

# Simple quadratic relationship
X_train = np.array([[1], [2], [3], [4], [5], [6], [7], [8]])
y_train = np.array([10, 15, 25, 40, 60, 85, 115, 150])  # roughly x^2

# Shallow tree with only 4 leaf nodes
tree = DecisionTreeRegressor(max_depth=2, random_state=42)
tree.fit(X_train, y_train)

# Sweep through 100 input values
X_test = np.linspace(0, 10, 100).reshape(-1, 1)
predictions = tree.predict(X_test)

print(f"Number of unique predictions: {len(np.unique(predictions))}")
print(f"Unique values: {np.unique(predictions)}")

# Output:
# Number of unique predictions: 4
# Unique values: [12.5, 32.5, 72.5, 132.5]

Number of unique predictions: 4
Unique values: [ 16.66666667  50.         100.         150.        ]

Visualizing the Step Function

When you plot predictions from a decision tree, you see a step function:

Code

plt.figure(figsize=(8, 3))

# Plot the step function
plt.plot(X_test, predictions, 'r-', linewidth=2, label='Decision Tree', drawstyle='steps-post')

# Plot training data
plt.scatter(X_train, y_train, s=100, alpha=0.6, edgecolors='black', label='Training Data')

# Plot the true function for comparison
y_true = X_test**2
# plt.plot(X_test, y_true, 'g--', alpha=0.5, linewidth=2, label='True Function (x²)')

plt.xlabel('Input (x)', fontsize=12)
plt.ylabel('Prediction (y)', fontsize=12)
plt.title('Decision Tree Creates Step Functions', fontsize=14)
plt.legend(fontsize=11)
plt.grid(alpha=0.3)
plt.show()

The red line jumps at specific thresholds (decision boundaries), between jumps, the prediction is constant

Why Does This Happen?

The tree partitions the input space into regions:

Region 1: x < 2.5  →  predict 12.5  (avg of x=1,2)
Region 2: 2.5 ≤ x < 5.5  →  predict 32.5  (avg of x=3,4)
Region 3: 5.5 ≤ x < 7.5  →  predict 72.5  (avg of x=5,6,7)
Region 4: x ≥ 7.5  →  predict 132.5  (avg of x=8)

All inputs falling into Region 1 get the exact same prediction, regardless of whether x = 0.5 or x = 2.4.

Critical Implications

This discretization has several important consequences:

No True Extrapolation: The model can’t predict outside the range of training targets

Code

print(f"Min training value: {y_train.min()}")  # 10
print(f"Max training value: {y_train.max()}")  # 150
print(f"Min prediction: {predictions.min()}")  # 12.5 (close to 10)
print(f"Max prediction: {predictions.max()}")  # 132.5 (close to 150)
# Can NEVER predict 200, even if x=20!

Min training value: 10
Max training value: 150
Min prediction: 16.666666666666668
Max prediction: 150.0

Step Functions Only: Predictions jump discontinuously at decision boundaries
- Input: x = 2.49 → output: 12.5
- Input: x = 2.51 → output: 32.5
- A tiny change in input causes a large jump in output
Limited Resolution: With k leaf nodes, you get exactly k possible outputs
- Shallow tree (depth=2): ~4 leaves → 4 unique predictions
- Medium tree (depth=5): ~32 leaves → 32 unique predictions
- Deep tree (depth=10): ~1024 leaves → up to 1024 unique predictions
Can’t Capture Smooth Trends: Even if the true function is smooth (like sin(x) or x²), the prediction will be jagged

Random Forest Regression

Wisdom of the Crowd

The Core Idea

Problem: Single trees are unstable and prone to overfitting.
Solution: Build many trees and average their predictions.

A random forest is an ensemble of decision trees, where each tree is:

Trained on a bootstrap sample of the data (random sampling with replacement)
At each split, considers only a random subset of features

Random Forest Regression

Wisdom of the Crowd

Why Does This Work?

Intuition: If you ask 100 people to estimate something, their average is often better than most individuals. Random forests apply this “wisdom of the crowd” principle.

Bootstrap sampling creates diversity: each tree sees slightly different data
Random feature selection reduces correlation between trees
Averaging reduces variance while maintaining low bias

Example 5: Building a Random Forest

Dataset: Predicting miles per gallon (MPG) from horsepower, weight, and year.

Original data (8 cars):
(HP, Weight, Year, MPG)
(100, 2500, 2015, 30)
(150, 3000, 2015, 25)
(200, 3500, 2016, 20)
(120, 2600, 2016, 28)
(180, 3200, 2017, 22)
(110, 2700, 2017, 29)
(160, 3100, 2018, 24)
(140, 2800, 2018, 26)

Tree 1: Bootstrap sample (random with replacement)

Sample: rows [1, 1, 3, 4, 5, 7, 8, 8]
At root: randomly consider features [HP, Weight]
Best split: Weight < 2900 → …

Tree 2: Different bootstrap sample

Sample: rows [2, 2, 3, 4, 6, 6, 7, 8]
At root: randomly consider features [HP, Year]
Best split: HP < 140 → …

Example 5: Building a Random Forest

Dataset: Predicting miles per gallon (MPG) from horsepower, weight, and year.

Original data (8 cars):
(HP, Weight, Year, MPG)
(100, 2500, 2015, 30)
(150, 3000, 2015, 25)
(200, 3500, 2016, 20)
(120, 2600, 2016, 28)
(180, 3200, 2017, 22)
(110, 2700, 2017, 29)
(160, 3100, 2018, 24)
(140, 2800, 2018, 26)

Tree 3: Another bootstrap sample

Sample: rows [1, 2, 4, 4, 5, 6, 7, 8]
At root: randomly consider features [Weight, Year]
Best split: Year < 2017 → …

… build 100 or 1000 such trees …

Prediction: For a car with (130, 2750, 2017):

Tree 1 predicts: 28.5
Tree 2 predicts: 27.0
Tree 3 predicts: 28.0
… (97 more trees) …
Final prediction: average of all 100 trees = 27.8

Example 6: Reducing Overfitting

Let’s revisit the $y = x^2 + \text{noise}$ example:

Single deep tree: Memorizes noise, poor generalization
Random forest:

Tree 1 (bootstrap sample 1): slightly different predictions
Tree 2 (bootstrap sample 2): slightly different predictions
...
Tree 100 (bootstrap sample 100): slightly different predictions

Average prediction: Smooths out the noise, captures the x² trend

Result: Random forests are much more robust to overfitting than individual trees.

Hyperparameters to Tune

For Decision Trees:

max_depth: Maximum depth of the tree
- Too shallow: underfitting
- Too deep: overfitting
- Example: max_depth=5 often works well
min_samples_split: Minimum samples required to split a node
- Higher values: simpler trees
- Example: min_samples_split=20
min_samples_leaf: Minimum samples required in a leaf
- Higher values: smoother predictions
- Example: min_samples_leaf=10

Hyperparameters to Tune

For Random Forests:

n_estimators: Number of trees in the forest
- More trees: better performance but slower
- Example: n_estimators=100 (common default)
max_features: Number of features to consider for each split
- Default: sqrt(total features) for classification, total_features/3 for regression
- Lower values: more diversity between trees
bootstrap: Whether to use bootstrap samples
- Default: True (strongly recommended)

Example 7: Python Code Walkthrough

Code

import numpy as np
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

# Generate synthetic data: y = sin(x) + noise
np.random.seed(42)
X = np.sort(np.random.uniform(0, 10, 100)).reshape(-1, 1)
y = np.sin(X).ravel() + np.random.normal(0, 0.1, X.shape[0])

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Single Decision Tree
dt = DecisionTreeRegressor(max_depth=3, random_state=42)
dt.fit(X_train, y_train)
y_pred_dt = dt.predict(X_test)

# Random Forest
rf = RandomForestRegressor(
    n_estimators=100, 
    max_depth=3, 
    random_state=42
)
rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_test)

# Evaluate
print(f"Decision Tree R²: {r2_score(y_test, y_pred_dt):.3f}")
print(f"Random Forest R²: {r2_score(y_test, y_pred_rf):.3f}")

print(f"Decision Tree RMSE: {np.sqrt(mean_squared_error(y_test, y_pred_dt)):.3f}")
print(f"Random Forest RMSE: {np.sqrt(mean_squared_error(y_test, y_pred_rf)):.3f}")

# Visualize predictions
X_plot = np.linspace(0, 10, 500).reshape(-1, 1)
y_dt = dt.predict(X_plot)
y_rf = rf.predict(X_plot)

plt.figure(figsize=(12, 4))

plt.subplot(1, 2, 1)
plt.scatter(X_train, y_train, alpha=0.5, label='Train')
plt.plot(X_plot, y_dt, 'r-', linewidth=2, label='Decision Tree')
plt.plot(X_plot, np.sin(X_plot), 'g--', alpha=0.5, label='True function')
plt.legend()
plt.title('Decision Tree Regression')

plt.subplot(1, 2, 2)
plt.scatter(X_train, y_train, alpha=0.5, label='Train')
plt.plot(X_plot, y_rf, 'b-', linewidth=2, label='Random Forest')
plt.plot(X_plot, np.sin(X_plot), 'g--', alpha=0.5, label='True function')
plt.legend()
plt.title('Random Forest Regression')

plt.tight_layout()
plt.show()

Decision Tree R²: 0.883
Random Forest R²: 0.897
Decision Tree RMSE: 0.205
Random Forest RMSE: 0.193

Expected Output: - Decision Tree: Step-like predictions (piecewise constant) - Random Forest: Smoother predictions averaging many step functions

Example 8: Real-World Application - Housing Prices

Code

from sklearn.datasets import fetch_california_housing
from sklearn.preprocessing import StandardScaler

# Load data
housing = fetch_california_housing()
X, y = housing.data, housing.target

# Split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Note: No need to scale features for tree-based methods!
# (But we would need to for linear regression)

# Train Random Forest
rf = RandomForestRegressor(
    n_estimators=100,
    max_depth=10,
    min_samples_split=10,
    random_state=42,
    n_jobs=-1  # Use all CPU cores
)

rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)

print(f"R² Score: {r2_score(y_test, y_pred):.3f}")
print(f"RMSE: {np.sqrt(mean_squared_error(y_test, y_pred)):.3f}")

# Feature importance
importances = rf.feature_importances_
features = housing.feature_names

for name, imp in sorted(zip(features, importances), 
                        key=lambda x: x[1], 
                        reverse=True):
    print(f"{name}: {imp:.3f}")

R² Score: 0.773
RMSE: 0.545
MedInc: 0.600
AveOccup: 0.140
Latitude: 0.076
Longitude: 0.076
HouseAge: 0.047
AveRooms: 0.030
AveBedrms: 0.016
Population: 0.016

Interpretation: Feature importance shows which variables matter most for prediction. This is a huge advantage over linear regression where coefficient interpretation can be tricky.

When to Use Which Method?

Use Decision Trees when:

✓ Interpretability is crucial
✓ You have limited data
✓ You need a quick baseline model
✓ Features are mostly categorical

Use Random Forests when:

✓ Predictive accuracy is the priority
✓ You have enough data (100+ samples)
✓ The relationship is complex/non-linear
✓ You need feature importance rankings
✓ You want robust predictions with less tuning

Use Linear Regression when:

✓ The relationship is truly linear
✓ You need to extrapolate beyond training data
✓ You need smooth predictions
✓ Coefficient interpretation is important

Key Takeaways

Decision trees partition the feature space into rectangles and predict the average within each region
Splitting is done greedily to minimize variance (RSS) in resulting groups
Single trees are interpretable but prone to overfitting and instability
Random forests build many diverse trees through bootstrapping and random feature selection
Averaging predictions reduces variance while maintaining the trees’ ability to capture non-linearity
No feature scaling needed for tree-based methods
Feature importance is automatically computed and very useful
Hyperparameter tuning is important but random forests are fairly robust to default settings

Summary

Decision tree and random forest regression are powerful tools in the data scientist’s toolkit. They handle non-linearity naturally, require minimal preprocessing, and provide excellent predictive performance. While single trees offer interpretability, random forests sacrifice some interpretability for superior accuracy and robustness. Understanding when and how to use these methods will significantly expand your regression modeling capabilities.