- Let's start to load some data into the Spark Context sc (this is a predefined object, storing RDD)
- Then do a map&reduce. Pretty simple using scala functional notation. We map the pages s into their length, and then reduce with a lamda function which sums two elements. That's it: we have the total length of the pages. Spark will allocate the optimal number of mappers and reducers on your behalf. However, if you prefer this can be tuned manually.
- If we use map(s => 1) then we can compute the total number of pages
- If we want to sort the pages alphabetically then we can use sortByKey() and collect() the results into the master as an Array
As you see this is way easier than plain vanilla Hadoop. We will play with other tranformations during next lessons