Prerequisites

This post relies on you understanding Decision Trees. You may want to make sure you understand the basics from there before continuing here.

What is it

Usually people use decision trees because they’re easy to setup and understand. They’re practically just if-else statements that get generated automatically.

However, one thing decision trees fall short is when having to predict for new examples. Mainly because they tend to find the best way to fit the data you provide them with and are blind to small deviations from it.

	Decision Tree	Random forest
Structure	One tree	Many trees voting
Readability	Very interpretable	Harder to interpret
Variance	High	Lower
Generalization	Can overfit easily	More robust on new data

How it works

A random forest is a bunch of decision trees that vote on the correct answer. The catch? No two trees are the same.

You randomly debuff (inhibit) your decision trees by:

not showing them your entire dataset
not giving them access to all the variables when building them to introduce some variability in your system. This way you’re making your model less categorical.

Example

Imagine wanting to go on a rollercoaster, and they have some strict threshold where if you’re 2 cm shorter, you can’t ride. Maybe in cases like these, they also measure your grip strength. That way if you’re below the minimum threshold, but do bouldering as a hobby, maybe they’ll let you pass.

Where else can I use them

I saw a StatQuest on how you can use random forests to:

fill in missing data - say something messed up during recording and a sensor got corrupted
clustering - you may want to group your data to be able to see emerging patterns

Anti-overfitting knobs (the important ones)

Since a random forest is made out of multiple decision trees, you’ll have access to the same variables you did when training trees.

Common control knobs:

n_estimators - how many trees get to vote (more trees usually = more stable predictions)
max_depth - limits the number of branches a tree can have
min_samples_split - minimum samples to create a new split
min_samples_leaf - minimum samples in each final leaf
max_leaf_nodes - limits total number of leaves
max_features - how many features each tree can consider at each split

Downsides

Random forests are cool and all, but they can still struggle with very complex patterns, while also becoming a bit harder to read compared to decision trees.

In my opinion, if you get to the point where you need 20+ trees that are more than 10 layers deep to predict something and it still has low accuracy, you may want to start looking into neural networks.

Quick starter code (scikit-learn)

If one label shows up way more than the others, class_weight="balanced" is worth a shot - otherwise the forest can get away with always voting the common one.

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
 
# X = your feature matrix, y = labels
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)
 
model = RandomForestClassifier(
    n_estimators=200,
    max_depth=4,
    min_samples_leaf=10,
    random_state=42,
    # class_weight="balanced",  # uncomment if classes are imbalanced
)
 
model.fit(X_train, y_train)
accuracy = model.score(X_test, y_test)
print(f"Accuracy: {accuracy:.2f}")

Additional Resources

StatQuest: Random Forests - YouTube Part1, YouTube Part2

2BytesGoat

Explorer

Random Forest

What is it

How it works

Where else can I use them

Anti-overfitting knobs (the important ones)

Downsides

Quick starter code (scikit-learn)

Additional Resources

Graph View

Table of Contents

Backlinks