Antonio Gulli's coding playground: Hands on big data - Crash Course on Spark Map & Reduce and other Transformations

Sunday, September 14, 2014

Second lesson.

Let's start to load some data into the Spark Context sc (this is a predefined object, storing RDD)
Then do a map&reduce. Pretty simple using scala functional notation. We map the pages s into their length, and then reduce with a lamda function which sums two elements. That's it: we have the total length of the pages. Spark will allocate the optimal number of mappers and reducers on your behalf. However, if you prefer this can be tuned manually.
If we use map(s => 1) then we can compute the total number of pages
If we want to sort the pages alphabetically then we can use sortByKey() and collect() the results into the master as an Array

As you see this is way easier than plain vanilla Hadoop. We will play with other tranformations during next lessons