Adventures in Software Testing and Stock Market

How to handle categorical features with spark-ml?

How to Handle Categorical Features with Spark ML

Categorical features are a type of feature that can take on a limited number of values, such as gender, state, or country. These features can be challenging to handle with machine learning algorithms, as they are not continuous like numerical features.

In this article, we will discuss how to handle categorical features with Spark ML. We will cover the following topics:

What are categorical features?
Why are categorical features challenging to handle?
How to handle categorical features with Spark ML

What are Categorical Features?

Categorical features are features that can take on a limited number of values. These features are often represented as strings or integers. Some examples of categorical features include:

Gender: Male or Female
State: California, New York, Texas
Country: United States, Canada, Mexico

Why are Categorical Features Challenging to Handle?

Machine learning algorithms are designed to work with numerical features. These algorithms work by finding patterns in the data that can be used to make predictions. However, categorical features can be challenging to work with because they are not continuous.

For example, the feature "gender" can only take on two values: male or female. This means that there is no way to order the values of this feature. This can make it difficult for machine learning algorithms to find patterns in the data.

How to Handle Categorical Features with Spark ML

There are a number of ways to handle categorical features with Spark ML. One common approach is to use a technique called one-hot encoding. One-hot encoding converts categorical features into numerical features by creating a new feature for each possible value of the categorical feature.

For example, the feature "gender" can be converted into three new features: male, female, and unknown. This allows machine learning algorithms to work with categorical features as if they were numerical features.

Another approach to handling categorical features is to use a technique called label encoding. Label encoding converts categorical features into numerical features by assigning a unique integer to each possible value of the categorical feature.

For example, the feature "gender" can be converted into two new features: 0 for male and 1 for female. This allows machine learning algorithms to work with categorical features as if they were numerical features.

The best approach to handling categorical features will depend on the specific machine learning algorithm that you are using. Some algorithms work better with one-hot encoding, while others work better with label encoding.

Conclusion

Categorical features can be challenging to handle with machine learning algorithms. However, there are a number of techniques that can be used to handle these features. The best approach to handling categorical features will depend on the specific machine learning algorithm that you are using.

I hope this article has been helpful. Please let me know if you have any questions.

Entropy, Information Gain & Gini Impurity

Entropy (H)

Shannon's entropy is defined for a system with N possible states as follows:

S = - \sum i = 1 N p i log 2 p i,

where

p_{i}

is the probability of finding the system in the

i

-th state. This is a very important concept used in physics, information theory, and other areas. Entropy can be described as the degree of chaos in the system. The higher the entropy, the less ordered the system and vice versa.

Information Gain (IG)

The information gain (IG) for a split based on the variable

Q

is defined as

I G (Q) = S O - \sum i = 1 q N i N S i,

where

q

q

is the number of groups after the split,

N_{i}

is number of objects from the sample in which variable

Q

is equal to the

i

-th value.

Entropy allows us to formalize partitions in a decision tree. But this is only one heuristic. There exists others

Gini Uncertainty (Impurity): $G = 1 - \sum_{k} (p_{k})^{2}$

Maximizing this criterion can be interpreted as the maximization of the number of pairs of objects of the same class that are in the same subtree (not to be confused with the Gini index).

Misclassification error: $E = 1 - max_{k} p_{k}$

$E = 1 - max_{k} p_{k}$

In practice, misclassification error is almost never used, and Gini uncertainty and information gain work similarly

For binary classification, entropy and Gini uncertainty take the following form:
$S = - p_{+} \log_{2} p_{+} - p_{-} \log_{2} p_{-} = - p_{+} \log_{2} p_{+} - (1 - p_{+}) \log_{2} (1 - p_{+});$

G = 1 - p_{+}^{2} - p_{-}^{2} = 1 - p_{+}^{2} - (1 - p_{+})^{2} = 2 p_{+} (1 - p_{+}) .

where (

p_{+}

is the probability of an object having a label +).

$E = 1 - max_{k} p_{k}$ $E = 1 - max_{k} p_{k}$ $E = 1 - max_{k} p_{k}$ $E = 1 - max_{k} p_{k}$

Pyspark - Data Manipulation

Data Manipulation

The main way you manipulate data is using the the map() function.

The map (and mapValues) is one of the main workhorses of Spark. Imagine you had a file that was tab delimited and you wanted to rearrange your data to be column1, column3, column2.

I’m working with the MovieLens 100K dataset

mycomputer:~$ head u.data
196 242 3 881250949
186 302 3 891717742
22 377 1 878887116
244 51 2 880606923
166 346 1 886397596
298 474 4 884182806
115 265 2 881171488
253 465 5 891628467
305 451 3 886324817
6 86 3 883603013

Now we have to first load the data into spark.

mydata = sc.textFile("../u.data")

Next we have to map a couple functions to our data.

mydata.map(lambda x: x.split('\t')).\
    map(lambda y: (y[0], y[2], y[1]))

We are doing two things in this one line.

Using a map to split the data wherever it finds a tab (\t).
Taking the results of the split and rearranging the results (Python starts its lists / column with zero).

You’ll notice the “lambda x:” inside of the map function. That’s called an anonymous function (or a lambda function). The “x” part is really every row of your data. You use “x” after the colon like any other python object – which is why we can split it into a list and later rearrange it.

Here’s what the data looks like after these two map functions.

(u'196', u'3', u'242'), 
(u'186', u'3', u'302'), 
(u'22', u'1', u'377'), 
(u'244', u'2', u'51'), 
(u'166', u'1', u'346'), 
(u'298', u'4', u'474'), 
(u'115', u'2', u'265'), 
(u'253', u'5', u'465'), 
(u'305', u'3', u'451'), 
(u'6', u'3', u'86')

Here’s another example of how Spark treats its data. It assumes it’s text (especially coming from a textFile load).

So, if we wanted to make those values numeric, we should have written our map as…

mydata.map(lambda x: x.split('\t')).\
    map(lambda y: (int(y[0]), float(y[2]), int(y[1])))

We now have data that looks like (196, 3.0, 242)

How to get mean and standard deviation in Pyspark 2.0+

How to get mean and standard deviation in Pyspark Dataframe Columns:

from pyspark.sql.functions import mean as _mean, stddev as _stddev, col

df_stats = df.select(
    _mean(col('columnName')).alias('mean'),
    _stddev(col('columnName')).alias('std')
).collect()

mean = df_stats[0]['mean']
std = df_stats[0]['std']

Note that there are three different standard deviation functions. From the docs the one I used (stddev) returns the following:

Aggregate function: returns the unbiased sample standard deviation of the expression in a group

You could use the describe() method as well:

df.describe().show()

Refer to this link for more info: pyspark.sql.functions

UPDATE: This is how you can work through the nested data.

Use explode to extract the values into separate rows, then call mean and stddev as shown above.

Here's a MWE:

from pyspark.sql.types import IntegerType
from pyspark.sql.functions import explode, col, udf, mean as _mean, stddev as _stddev

# mock up sample dataframe
df = sqlCtx.createDataFrame(
    [(680, [[691,1], [692,5]]), (685, [[691,2], [692,2]]), (684, [[691,1], [692,3]])],
    ["product_PK", "products"]
)

# udf to get the "score" value - returns the item at index 1
get_score = udf(lambda x: x[1], IntegerType())

# explode column and get stats
df_stats = df.withColumn('exploded', explode(col('products')))\
    .withColumn('score', get_score(col('exploded')))\
    .select(
        _mean(col('score')).alias('mean'),
        _stddev(col('score')).alias('std')
    )\
    .collect()

mean = df_stats[0]['mean']
std = df_stats[0]['std']

print([mean, std])

Which outputs:

[2.3333333333333335, 1.505545305418162]

You can verify that these values are correct using numpy:

vals = [1,5,2,2,1,3]
print([np.mean(vals), np.std(vals, ddof=1)])

Explanation: Your "products" column is a list of lists. Calling explode will make a new row for each element of the outer list. Then grab the "score" value from each of the exploded rows, which you have defined as the second element in a 2-element list. Finally, call the aggregate functions on this new column.

How to handle categorical features with spark-ml?

Entropy, Information Gain & Gini Impurity

Entropy (H)

Information Gain (IG)

Gini Uncertainty (Impurity): G=1−∑k(pk)2

Maximizing this criterion can be interpreted as the maximization of the number of pairs of objects of the same class that are in the same subtree (not to be confused with the Gini index).

Misclassification error: E=1−maxkpk

In practice, misclassification error is almost never used, and Gini uncertainty and information gain work similarly

Pyspark - Data Manipulation

How to get mean and standard deviation in Pyspark 2.0+

How to handle categorical features with spark-ml?

Gini Uncertainty (Impurity): $G = 1 - \sum_{k} (p_{k})^{2}$

Misclassification error: $E = 1 - max_{k} p_{k}$