Understanding Entropy and Information Gain
tags: #ML/supervised/classification/trees
What is entropy?
Entropy is the measure of the state of disorder in a system.
It is used in information theory to determine how much information is available in the system for extraction.
- In the context of classification, where you place the decision boundary corresponds to the amount of information still available to be extracted i.e., entropy is used as a measure of the impurity or randomness of a set of data.
- This gives us an estimate of how good the decision boundary is and is a commonly used measure in decision trees and random forests.
- When the
, no information can be extracted and there is NO misclassification i.e., all information is extracted from the data
How do we measure entropy?
Entropy can be computed as the sum of the probability of each possible class output, multiplied by the logarithm (base 2) of the probability of that outcome:
Since the probability of any event is always between 0 and 1, the logarithm of the probability will always be negative. Therefore, the negative sign is used to ensure that the entropy is a positive value.
Example:
To measure the entropy of the system after a split, we compute the weighted entropy:
where,
is the entropy of the -th group is the proportion of data points that belong to the -th group
See example: INF2179 Entropy and Decision Tree Lecture
Feature of Importance: What is information gain?
Information gain refers to the change in information BEFORE and AFTER the split.
This measures the reduction in entropy (or increase in information) that results from splitting the dataset based on a given feature.
This quantifies how much a given feature contributes to the overall classification accuracy of a decision tree or other machine learning model.
A higher information gain indicates that a feature is more informative, or in other words, more useful for predicting the target variable.
Example:
- Suppose
and , then the IG is 2 - This is the difference in entropy before and after the split.
This is a commonly used metric to design decision trees.
Formula:
- where
is the entropy of the target variable, and is the conditional entropy of the given target variable given feature , computed after the split. - therefore, the information gain is the difference between the two entropies, indicating how much information is gained by splitting the dataset based on the feature α: $$IG = H_{before} - H_{after}$$