Data Manipulation
The main way you manipulate data is using the the map() function.
The map (and mapValues) is one of the main workhorses of Spark. Imagine you had a file that was tab delimited and you wanted to rearrange your data to be column1, column3, column2.
I’m working with the MovieLens 100K dataset
Now we have to first load the data into spark.
Next we have to map a couple functions to our data.
The main way you manipulate data is using the the map() function.
The map (and mapValues) is one of the main workhorses of Spark. Imagine you had a file that was tab delimited and you wanted to rearrange your data to be column1, column3, column2.
I’m working with the MovieLens 100K dataset
mycomputer:~$ head u.data 196 242 3 881250949 186 302 3 891717742 22 377 1 878887116 244 51 2 880606923 166 346 1 886397596 298 474 4 884182806 115 265 2 881171488 253 465 5 891628467 305 451 3 886324817 6 86 3 883603013
Now we have to first load the data into spark.
mydata = sc.textFile("../u.data")
Next we have to map a couple functions to our data.
mydata.map(lambda x: x.split('\t')).\
map(lambda y: (y[0], y[2], y[1]))
We are doing two things in this one line.- Using a map to split the data wherever it finds a tab (\t).
- Taking the results of the split and rearranging the results (Python starts its lists / column with zero).
You’ll notice the “lambda x:” inside of the map function. That’s called an anonymous function (or a lambda function). The “x” part is really every row of your data. You use “x” after the colon like any other python object – which is why we can split it into a list and later rearrange it.
Here’s what the data looks like after these two map functions.
Here’s what the data looks like after these two map functions.
(u'196', u'3', u'242'), (u'186', u'3', u'302'), (u'22', u'1', u'377'), (u'244', u'2', u'51'), (u'166', u'1', u'346'), (u'298', u'4', u'474'), (u'115', u'2', u'265'), (u'253', u'5', u'465'), (u'305', u'3', u'451'), (u'6', u'3', u'86')
Here’s another example of how Spark treats its data. It assumes it’s text (especially coming from a textFile load).
So, if we wanted to make those values numeric, we should have written our map as…
mydata.map(lambda x: x.split('\t')).\
map(lambda y: (int(y[0]), float(y[2]), int(y[1])))
We now have data that looks like
(196, 3.0, 242)
No comments:
Post a Comment