Adventures in Software Testing and Stock Market: Entropy, Information Gain & Gini Impurity

Entropy (H)

Shannon's entropy is defined for a system with N possible states as follows:

S = - \sum i = 1 N p i log 2 p i,

where

p_{i}

is the probability of finding the system in the

i

-th state. This is a very important concept used in physics, information theory, and other areas. Entropy can be described as the degree of chaos in the system. The higher the entropy, the less ordered the system and vice versa.

Information Gain (IG)

The information gain (IG) for a split based on the variable

Q

is defined as

I G (Q) = S O - \sum i = 1 q N i N S i,

where

q

q

is the number of groups after the split,

N_{i}

is number of objects from the sample in which variable

Q

is equal to the

i

-th value.

Entropy allows us to formalize partitions in a decision tree. But this is only one heuristic. There exists others

Gini Uncertainty (Impurity): $G = 1 - \sum_{k} (p_{k})^{2}$

Maximizing this criterion can be interpreted as the maximization of the number of pairs of objects of the same class that are in the same subtree (not to be confused with the Gini index).

Misclassification error: $E = 1 - max_{k} p_{k}$

$E = 1 - max_{k} p_{k}$

In practice, misclassification error is almost never used, and Gini uncertainty and information gain work similarly

For binary classification, entropy and Gini uncertainty take the following form:
$S = - p_{+} \log_{2} p_{+} - p_{-} \log_{2} p_{-} = - p_{+} \log_{2} p_{+} - (1 - p_{+}) \log_{2} (1 - p_{+});$

G = 1 - p_{+}^{2} - p_{-}^{2} = 1 - p_{+}^{2} - (1 - p_{+})^{2} = 2 p_{+} (1 - p_{+}) .

where (

p_{+}

is the probability of an object having a label +).

Adventures in Software Testing and Stock Market

Entropy, Information Gain & Gini Impurity

Entropy (H)

Information Gain (IG)

Gini Uncertainty (Impurity): $G = 1 - \sum_{k} (p_{k})^{2}$

Maximizing this criterion can be interpreted as the maximization of the number of pairs of objects of the same class that are in the same subtree (not to be confused with the Gini index).

Misclassification error: $E = 1 - max_{k} p_{k}$

$E = 1 - max_{k} p_{k}$

In practice, misclassification error is almost never used, and Gini uncertainty and information gain work similarly

$E = 1 - max_{k} p_{k}$ $E = 1 - max_{k} p_{k}$ $E = 1 - max_{k} p_{k}$ $E = 1 - max_{k} p_{k}$

No comments:

Post a Comment

How to handle categorical features with spark-ml?

Entropy, Information Gain & Gini Impurity

Entropy (H)

Information Gain (IG)

Gini Uncertainty (Impurity): G=1−∑k(pk)2

Maximizing this criterion can be interpreted as the maximization of the number of pairs of objects of the same class that are in the same subtree (not to be confused with the Gini index).

Misclassification error: E=1−maxkpk

In practice, misclassification error is almost never used, and Gini uncertainty and information gain work similarly

No comments:

Post a Comment

How to handle categorical features with spark-ml?

Gini Uncertainty (Impurity): $G = 1 - \sum_{k} (p_{k})^{2}$

Misclassification error: $E = 1 - max_{k} p_{k}$