Understanding Entropy and Information Gain

tags: #ML/supervised/classification/trees

What is entropy?

Entropy is the measure of the state of disorder in a system.

It is used in information theory to determine how much information is available in the system for extraction.

600

⚠ Switch to EXCALIDRAW VIEW in the MORE OPTIONS menu of this document. ⚠

Text Elements

Entropy = 0.00
Entropy = 1.96
We have two misclassification with this decision
boundary, i.e., amount of information is still
available is still high
Entropy = 1.18

How do we measure entropy?

Entropy can be computed as the sum of the probability of each possible class output, multiplied by the logarithm (base 2) of the probability of that outcome:

H=i=1np(xi)log2p(xi)
Note the negative sign

Since the probability of any event is always between 0 and 1, the logarithm of the probability will always be negative. Therefore, the negative sign is used to ensure that the entropy is a positive value.

Example:

500

⚠ Switch to EXCALIDRAW VIEW in the MORE OPTIONS menu of this document. ⚠

Text Elements

H =
Total of 10
2/10 + log2(2/10)

  • 3/10 + log2(3/10)
  • 5/10 + log2(5/10)
    H = (-1.4185) * -1 = 1.485
Weighted Entropy: Measuring Entropy AFTER a Split

To measure the entropy of the system after a split, we compute the weighted entropy:

Weighted H=i=1kHiP(groupi)

where,

  • Hi is the entropy of the i-th group
  • P(groupi) is the proportion of data points that belong to the i-th group

See example: INF2179 Entropy and Decision Tree Lecture


Feature of Importance: What is information gain?

Information gain refers to the change in information BEFORE and AFTER the split.

This measures the reduction in entropy (or increase in information) that results from splitting the dataset based on a given feature.

This quantifies how much a given feature contributes to the overall classification accuracy of a decision tree or other machine learning model.

A higher information gain indicates that a feature is more informative, or in other words, more useful for predicting the target variable.

Example:

This is a commonly used metric to design decision trees.

Formula:

IG(T,α)=H(T)H(T|α)
Powered by Forestry.md