Adventures in Software Testing and Stock Market: March 2019

Entropy, Information Gain & Gini Impurity

Entropy (H)

Shannon's entropy is defined for a system with N possible states as follows:

S = - \sum i = 1 N p i log 2 p i,

where

p_{i}

is the probability of finding the system in the

i

-th state. This is a very important concept used in physics, information theory, and other areas. Entropy can be described as the degree of chaos in the system. The higher the entropy, the less ordered the system and vice versa.

Information Gain (IG)

The information gain (IG) for a split based on the variable

Q

is defined as

I G (Q) = S O - \sum i = 1 q N i N S i,

where

q

q

is the number of groups after the split,

N_{i}

is number of objects from the sample in which variable

Q

is equal to the

i

-th value.

Entropy allows us to formalize partitions in a decision tree. But this is only one heuristic. There exists others

Gini Uncertainty (Impurity): $G = 1 - \sum_{k} (p_{k})^{2}$

Maximizing this criterion can be interpreted as the maximization of the number of pairs of objects of the same class that are in the same subtree (not to be confused with the Gini index).

Misclassification error: $E = 1 - max_{k} p_{k}$

$E = 1 - max_{k} p_{k}$

In practice, misclassification error is almost never used, and Gini uncertainty and information gain work similarly

For binary classification, entropy and Gini uncertainty take the following form:
$S = - p_{+} \log_{2} p_{+} - p_{-} \log_{2} p_{-} = - p_{+} \log_{2} p_{+} - (1 - p_{+}) \log_{2} (1 - p_{+});$

G = 1 - p_{+}^{2} - p_{-}^{2} = 1 - p_{+}^{2} - (1 - p_{+})^{2} = 2 p_{+} (1 - p_{+}) .

where (

p_{+}

is the probability of an object having a label +).

$E = 1 - max_{k} p_{k}$ $E = 1 - max_{k} p_{k}$ $E = 1 - max_{k} p_{k}$ $E = 1 - max_{k} p_{k}$

Pyspark - Data Manipulation

Data Manipulation

The main way you manipulate data is using the the map() function.

The map (and mapValues) is one of the main workhorses of Spark. Imagine you had a file that was tab delimited and you wanted to rearrange your data to be column1, column3, column2.

I’m working with the MovieLens 100K dataset

mycomputer:~$ head u.data
196 242 3 881250949
186 302 3 891717742
22 377 1 878887116
244 51 2 880606923
166 346 1 886397596
298 474 4 884182806
115 265 2 881171488
253 465 5 891628467
305 451 3 886324817
6 86 3 883603013

Now we have to first load the data into spark.

mydata = sc.textFile("../u.data")

Next we have to map a couple functions to our data.

mydata.map(lambda x: x.split('\t')).\
    map(lambda y: (y[0], y[2], y[1]))

We are doing two things in this one line.

Using a map to split the data wherever it finds a tab (\t).
Taking the results of the split and rearranging the results (Python starts its lists / column with zero).

You’ll notice the “lambda x:” inside of the map function. That’s called an anonymous function (or a lambda function). The “x” part is really every row of your data. You use “x” after the colon like any other python object – which is why we can split it into a list and later rearrange it.

Here’s what the data looks like after these two map functions.

(u'196', u'3', u'242'), 
(u'186', u'3', u'302'), 
(u'22', u'1', u'377'), 
(u'244', u'2', u'51'), 
(u'166', u'1', u'346'), 
(u'298', u'4', u'474'), 
(u'115', u'2', u'265'), 
(u'253', u'5', u'465'), 
(u'305', u'3', u'451'), 
(u'6', u'3', u'86')

Here’s another example of how Spark treats its data. It assumes it’s text (especially coming from a textFile load).

So, if we wanted to make those values numeric, we should have written our map as…

mydata.map(lambda x: x.split('\t')).\
    map(lambda y: (int(y[0]), float(y[2]), int(y[1])))

We now have data that looks like (196, 3.0, 242)

How to get mean and standard deviation in Pyspark 2.0+

How to get mean and standard deviation in Pyspark Dataframe Columns:

from pyspark.sql.functions import mean as _mean, stddev as _stddev, col

df_stats = df.select(
    _mean(col('columnName')).alias('mean'),
    _stddev(col('columnName')).alias('std')
).collect()

mean = df_stats[0]['mean']
std = df_stats[0]['std']

Note that there are three different standard deviation functions. From the docs the one I used (stddev) returns the following:

Aggregate function: returns the unbiased sample standard deviation of the expression in a group

You could use the describe() method as well:

df.describe().show()

Refer to this link for more info: pyspark.sql.functions

UPDATE: This is how you can work through the nested data.

Use explode to extract the values into separate rows, then call mean and stddev as shown above.

Here's a MWE:

from pyspark.sql.types import IntegerType
from pyspark.sql.functions import explode, col, udf, mean as _mean, stddev as _stddev

# mock up sample dataframe
df = sqlCtx.createDataFrame(
    [(680, [[691,1], [692,5]]), (685, [[691,2], [692,2]]), (684, [[691,1], [692,3]])],
    ["product_PK", "products"]
)

# udf to get the "score" value - returns the item at index 1
get_score = udf(lambda x: x[1], IntegerType())

# explode column and get stats
df_stats = df.withColumn('exploded', explode(col('products')))\
    .withColumn('score', get_score(col('exploded')))\
    .select(
        _mean(col('score')).alias('mean'),
        _stddev(col('score')).alias('std')
    )\
    .collect()

mean = df_stats[0]['mean']
std = df_stats[0]['std']

print([mean, std])

Which outputs:

[2.3333333333333335, 1.505545305418162]

You can verify that these values are correct using numpy:

vals = [1,5,2,2,1,3]
print([np.mean(vals), np.std(vals, ddof=1)])

Explanation: Your "products" column is a list of lists. Calling explode will make a new row for each element of the outer list. Then grab the "score" value from each of the exploded rows, which you have defined as the second element in a 2-element list. Finally, call the aggregate functions on this new column.