Adventures in Software Testing and Stock Market: Pyspark

Data Manipulation

The main way you manipulate data is using the the map() function.

The map (and mapValues) is one of the main workhorses of Spark. Imagine you had a file that was tab delimited and you wanted to rearrange your data to be column1, column3, column2.

I’m working with the MovieLens 100K dataset

mycomputer:~$ head u.data
196 242 3 881250949
186 302 3 891717742
22 377 1 878887116
244 51 2 880606923
166 346 1 886397596
298 474 4 884182806
115 265 2 881171488
253 465 5 891628467
305 451 3 886324817
6 86 3 883603013

Now we have to first load the data into spark.

mydata = sc.textFile("../u.data")

Next we have to map a couple functions to our data.

mydata.map(lambda x: x.split('\t')).\
    map(lambda y: (y[0], y[2], y[1]))

We are doing two things in this one line.

Using a map to split the data wherever it finds a tab (\t).
Taking the results of the split and rearranging the results (Python starts its lists / column with zero).

You’ll notice the “lambda x:” inside of the map function. That’s called an anonymous function (or a lambda function). The “x” part is really every row of your data. You use “x” after the colon like any other python object – which is why we can split it into a list and later rearrange it.

Here’s what the data looks like after these two map functions.

(u'196', u'3', u'242'), 
(u'186', u'3', u'302'), 
(u'22', u'1', u'377'), 
(u'244', u'2', u'51'), 
(u'166', u'1', u'346'), 
(u'298', u'4', u'474'), 
(u'115', u'2', u'265'), 
(u'253', u'5', u'465'), 
(u'305', u'3', u'451'), 
(u'6', u'3', u'86')

Here’s another example of how Spark treats its data. It assumes it’s text (especially coming from a textFile load).

So, if we wanted to make those values numeric, we should have written our map as…

mydata.map(lambda x: x.split('\t')).\
    map(lambda y: (int(y[0]), float(y[2]), int(y[1])))

We now have data that looks like (196, 3.0, 242)

Adventures in Software Testing and Stock Market

Pyspark - Data Manipulation

No comments:

Post a Comment

How to handle categorical features with spark-ml?