What’s the difference between a classification tree and a regression tree?

A classification tree predicts a category (and often a probability per class). A regression tree predicts a number, typically using the average target value within each leaf.

How does a decision tree decide where to split?

At each node, the tree tests possible features and thresholds and chooses the split that most improves its objective, such as reducing impurity for classification or reducing error for regression.

Why do decision trees overfit so easily?

Trees can keep splitting until they capture very small patterns in the training data. Without constraints, they can memorise noise, leading to strong training results but weaker performance on new data.

Decision Trees in Machine Learning (2026 Guide)

Updated on January 31, 2026 5 minutes read

Decision trees are a practical machine learning model for both classification and regression. In 2026, they are still widely used for explainable baselines and as building blocks inside stronger ensemble methods.

A decision tree learns a set of simple rules that split data into smaller groups. Each split asks a question about one feature, then routes examples down the branch that matches the answer.

What a decision tree is

You can think of a decision tree as a flowchart. It starts at a root node, tests one condition at a time, and ends at a leaf that produces a prediction.

This structure is valuable when you need something you can inspect, explain, and debug. It is also a strong starting point before moving to ensembles.

Core parts of a tree

Root node: the first node that contains the full training set.
Split rule: a condition on a feature, such as x <= threshold.
Internal node: a node that sends samples to child nodes.
Branch: the path selected by the split rule outcome.
Leaf node: the final output (class label, probability, or number).

How a tree learns from data

Training a decision tree means choosing splits that reduce uncertainty about the target. Most implementations do this greedily by selecting the best split now, then repeating on each resulting subset.

This approach can capture non-linear patterns without heavy feature scaling. It can also overfit if the tree grows too deep without constraints.

Common split criteria

For classification, trees often split to create purer class groups. Two common ideas you will see are Gini impurity and entropy (information gain).

For regression, trees often split to reduce prediction error within nodes. This is typically done by reducing variance or minimizing a loss like MSE.

Stopping conditions

A tree usually stops growing when any of the following becomes true:

A maximum depth is reached.
A node has too few samples to split reliably.
A split no longer improves the chosen criterion in a meaningful way.

How predictions are made

Prediction follows the rules the tree learned during training. A new example starts at the root, applies the split condition, then moves down the matching branch until it reaches a leaf node.

At that leaf, the model returns its final output. The exact output depends on whether you are doing classification or regression.

Classification output

A classification leaf typically stores the class distribution of the training samples that landed there. The tree can output a label and often a probability per class based on that distribution.

Regression output

A regression leaf typically stores a numeric estimate derived from the targets in that leaf. The most common estimate is the mean target value.

Working with different feature types

Decision trees can handle a mix of numerical and categorical inputs, but the Practical behavior depends on the library you use and how you preprocess data.

Before training, check how your toolkit expects categorical features and what It does with missing values.

Numerical features

For numeric columns, trees search for a threshold that creates the best split. That is why trees can model curved or step-like relationships without manually adding polynomial features.

Categorical features

Some libraries require encoding categorical variables (for example, one-hot or ordinal encoding). Others can handle categoricals more directly, depending on the implementation.

If your tree produces confusing rules on categoricals, review your encoding and Watch for very high-cardinality columns.

Missing values

Some implementations can route missing values through a dedicated path. Others require you to impute missing values before training.

Missingness can carry information, so treat this as a modeling decision. When in doubt, compare a simple imputation baseline to more careful handling.

Overfitting and how to prevent it

A single deep tree can memorize patterns that do not generalize. You may see excellent training performance and disappointing results on new data.

In practice, you reduce overfitting by limiting complexity and validating with a clean evaluation setup.

High-impact regularization settings

These are the first knobs to try in most toolkits:

max_depth: limits how many decisions the model can stack.
min_samples_split: requires a minimum number of samples to split a node.
min_samples_leaf: ensures leaves represent meaningful groups of samples.
max_features: limits features considered at each split, often stabilizing the model and reducing variance.

Pruning

Pruning removes branches that do not improve generalization. This can be done during training (pre-pruning via constraints) or after training (post-pruning).

If your library offers cost-complexity pruning or similar options, tune it with cross-validation to avoid fitting the validation set by accident.

Interpretability and feature importance

Decision trees are popular because a single prediction can be explained as a short path of rules. That is useful for stakeholder communication and auditing.

Be careful with feature importance metrics from trees. They can overemphasize features with many possible split points, such as high-cardinality categories.

Trees in ensembles: random forests and boosting

Even when you do not deploy a single tree, understanding trees matters because Many high-performing tabular models are built from them.

Random forests reduce variance by averaging many trees trained on different samples and feature subsets.
Gradient boosting builds trees sequentially, where each new tree focuses on correcting the errors of the previous model.

Ensembles are often more accurate, but they can be harder to explain than one small tree.

A practical workflow checklist

Use this checklist to keep your modeling work grounded:

Define the task and metric (classification vs regression).
Train a small, shallow tree as a baseline.
Validate with a holdout set or cross-validation.
Constrain depth and leaf size before searching broadly.
Inspect errors and confirm there is no data leakage.
Compare against a random forest or boosting model if you need more accuracy.

Quick scikit-learn example

    from sklearn.tree import DecisionTreeClassifier
    from sklearn.metrics import classification_report

    model = DecisionTreeClassifier(max_depth=4, random_state=42)
    model.fit(X_train, y_train)

    y_pred = model.predict(X_test)
    y_proba = model.predict_proba(X_test)

    print(classification_report(y_test, y_pred))

Learn more with Code Labs Academy

If you want structured practice building and evaluating models on real datasets, explore Code Labs Academy’s Data Science & AI Bootcamp.