Entropy, Information Gain & Gini Impurity

Entropy (H)

Shannon's entropy is defined for a system with N possible states as follows:
where
is the probability of finding the system in the -th state. This is a very important concept used in physics, information theory, and other areas. Entropy can be described as the degree of chaos in the system. The higher the entropy, the less ordered the system and vice versa.

Information Gain (IG)

 The information gain (IG) for a split based on the variable is defined as


 where
is the number of groups after the split, is number of objects from the sample in which variable is equal to the -th value.

Entropy allows us to formalize partitions in a decision tree. But this is only one heuristic. There exists others

Gini Uncertainty (Impurity): 

Maximizing this criterion can be interpreted as the maximization of the number of pairs of objects of the same class that are in the same subtree (not to be confused with the Gini index).

Misclassification error:

In practice, misclassification error is almost never used, and Gini uncertainty and information gain work similarly

For binary classification, entropy and Gini uncertainty take the following form:



where (
is the probability of an object having a label +).

No comments:

Post a Comment

How to handle categorical features with spark-ml?

How to Handle Categorical Features with Spark ML Categorical features are a type of feature that can take on a limited number of values, suc...