Understanding Entropy and Information Gain

tags: #ML/supervised/classification/trees

What is entropy?

Entropy is the measure of the state of disorder in a system.

It is used in information theory to determine how much information is available in the system for extraction.

In the context of classification, where you place the decision boundary corresponds to the amount of information still available to be extracted i.e., entropy is used as a measure of the impurity or randomness of a set of data.
This gives us an estimate of how good the decision boundary is and is a commonly used measure in decision trees and random forests.
When the $H = 0$ , no information can be extracted and there is NO misclassification i.e., all information is extracted from the data

600

⚠ Switch to EXCALIDRAW VIEW in the MORE OPTIONS menu of this document. ⚠

Text Elements

Entropy = 0.00
Entropy = 1.96
We have two misclassification with this decision
boundary, i.e., amount of information is still
available is still high
Entropy = 1.18

How do we measure entropy?

Entropy can be computed as the sum of the probability of each possible class output, multiplied by the logarithm (base 2) of the probability of that outcome:

H = - \sum_{i = 1}^{n} p (x_{i}) * l o g_{2} p (x_{i})

Note the negative sign

Since the probability of any event is always between 0 and 1, the logarithm of the probability will always be negative. Therefore, the negative sign is used to ensure that the entropy is a positive value.

Example:

500

⚠ Switch to EXCALIDRAW VIEW in the MORE OPTIONS menu of this document. ⚠

Text Elements

H =
Total of 10
2/10 + log2(2/10)

3/10 + log2(3/10)
5/10 + log2(5/10)
H = (-1.4185) * -1 = 1.485

Weighted Entropy: Measuring Entropy AFTER a Split

To measure the entropy of the system after a split, we compute the weighted entropy:

Weighted H = \sum_{i = 1}^{k} H_{i} * P (g r o u p_{i})

where,

$H_{i}$ is the entropy of the $i$ -th group
$P ({group}_{i})$ is the proportion of data points that belong to the $i$ -th group

See example: INF2179 Entropy and Decision Tree Lecture

Feature of Importance: What is information gain?

Information gain refers to the change in information BEFORE and AFTER the split.

This measures the reduction in entropy (or increase in information) that results from splitting the dataset based on a given feature.

This quantifies how much a given feature contributes to the overall classification accuracy of a decision tree or other machine learning model.

A higher information gain indicates that a feature is more informative, or in other words, more useful for predicting the target variable.

Example:

Suppose $H_{b e f o r e} = 2$ and $H_{a f t e r} = 0$ , then the IG is 2
This is the difference in entropy before and after the split.

This is a commonly used metric to design decision trees.

Formula:

I G (T, α) = H (T) - H (T | α)

where $H (T)$ is the entropy of the target variable, and $H (T | α)$ is the conditional entropy of the given target variable given feature $α$ , computed after the split.
therefore, the information gain is the difference between the two entropies, indicating how much information is gained by splitting the dataset based on the feature α: $$IG = H_{before} - H_{after}$$