## Introduction

Decision Trees (DTs) are a non-parametric supervised learning method used for classification and regression. The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features.

## Entropy

The goal of the training is to find the best splits in the nodes in order to find the most optimal tree. The splits are done using some criteria such as: **Entropy**.

There exist many definitions of entropy such as:

- Entropy corresponds to the amount of information contained in a source of information.
- Entropy can also be seen as the randomness or the measuring of surprise in a set.
- Entropy is a metric that measures the unpredictability or impurity in the system.

In decision trees, we will consider entropy as the measure of the purity inside of a node. The goal of the decision tree model is to reduce the entropy of the nodes at each split:

Thus, we want to maximize the difference between the entropy of the parent node and the entropy of the child nodes. This difference is called the **Information gain**.

The Entropy $H$ of a set $X$ is mathematically formulated as follow:

$$ H(X) = - \sum\limits_{x \in X} p(x) \log p(x) $$

## Information gain

Information Gain is the **difference** between the **entropy of the parent node** and the **weighted sum** of the **entropies of the chlid nodes**, and thus it can be formulated as follow:

$$IG(Y, X) = H(Y) - \sum_{x \in unique(X)} P(x|X) \times H(Y | X = x)$$ $$= H(Y) - \sum_{x \in unique(X)} \frac{X.count(x)}{len(X)} \times H(Y[X == x])$$ where:

- $H(.)$ is the entropy.
- $Y$ is the population prior to the split, it represents the parent node.
- $X$ is the variable that we want to use for the splitting.
- $x$ is a unique value of X.
- $Y[X==x]$ is a splitted list with only $x$ values.

let's take a proper example:

We are going to calculate the Information Gain when we divide the parent node by using the values of X: $$IG(parent, X) = H(parent) - \sum_{x \in unique(X)} P(x|X) \times H(parent | X = x)$$

First, we calculate the entropy of the parent node:
$$ H(parent) = - P(Y=Blue) \times \log P(Y=Blue) - P(Y=Yellow) \times \log P(Y=Yellow) $$
$$ = - \frac{11}{21} \times \log(\frac{11}{21}) - \frac{10}{21} \times \log(\frac{10}{21}) = 0.3 $$

Then, we are going to calculate the internal probability of each child node after the split by using the unique values of X:
$$ unique(X) = [Circle, Square] $$

$$ \sum_{x \in unique(X)} P(x|X) \times H(Y | X = x) = P(Square|X) \times H(Y | X = Square) $$ $$ + P(Circle|X) \times H(Y | X = Circle) $$ $$ = \frac{9}{21} \times H(Y | X = Square) + \frac{12}{21} \times H(Y | X = Circle) $$

Such as:

- $H(Y | X = Square)$ : represents the entropy of the first child node.
- $H(Y | X = Circle)$ : represents the entropy of the second child node.

We start with the first child node:

$$ H(Y | X = Square) = - P(Y=Blue | X = Square) \times \log P(Y=Blue| X = Square) $$ $$ - P(Y=Yellow| X = Square) \times \log P(Y=Yellow| X = Square) $$ $$ = - \frac{7}{9} \times \log\frac{7}{9} - \frac{2}{9} \times \log\frac{2}{9} = 0.23 $$

And then the second child node:

$$ H(Y | X = Circle) = - P(Y=Blue | X = Circle) \times \log P(Y=Blue| X = Circle) $$ $$ - P(Y=Yellow| X = Circle) \times \log P(Y=Yellow| X = Circle) $$ $$ = - \frac{4}{12} \times \log\frac{4}{12} - \frac{8}{12} \times \log\frac{8}{12} = 0.28 $$

Finally, we substitute the entropies in the Information Gain formula:

$$IG(parent, X) = H(parent) - \sum_{x \in unique(X)} P(x|X) \times H(parent | X = x)$$ $$ = 0.3 - \frac{9}{21} \times 0.23 - \frac{12}{21} \times 0.28 = 0.041 $$

As stated before, the objective of a node split is to maximize the Information Gain, and thus minimize the Entropy in the resulting children node. To do this, we need to try and split the node with different sets of inputs $ X_1, X_2, \ldots, Xn $ and we only keep the split that maximizes the Information Gain:

$$ X^{*} = \underset{X_i}{\operatorname{argmax}} IG(Y, X_i) $$

## When to stop splitting

The node splitting in decision trees is a recursive, so there must be a criteria which we can use in order to stop the splitting. These some of the most implemented criteria:

**When the node is pure:**H(node) = 0. It's pointless to split the node any further.**Maximum number of depth:**We can set a maximum depth that the model can reach, it means that even if the node is not pure the splitting is stopped.**Minimum number of samples per node:**We can also set a minimum number $N$ of samples per node. If the number of samples per node is equal to $N$ then we stop splitting even if the node is not pure.

By the end of the training ( the splitting ), each node that relies on the end of the decision tree is called a "Leaf", because it is not a root to any subtree. Each leaf will represent yield the class with the most samples.

## Conclusion

Decision tree is one of the most famous machine learning algorithms due to its efficiency, intuitive background and simple implementation. This algorithm can further be used with numerical independent variables ( Gaussian Decision Tree ), and it can be extended to solve regression tasks as well.