Monday, September 15, 2014

Hands on big data - Crash Course on Spark - Optimizing Transformations - lesson 4

1. Take a list of names
2. Create buckets by initials
3. Group by keys
4. Map each each initial into the set of names, get the size
5. Collect the results into the master as an array

This code is not super-optimized. For instance, we might want to force the number of partitions - the default is 64Mb chunks

Then we can avoid to count duplicates

 or better count the unique names with a reduceByKey where the reduce function is the sum

These optimizations reduce the information sent over the wire.

Found this example very instructive [Deep Dive : Spark internals], however you should add some string sanity check

and make sure that you don't run in some heap problems if you fix the partitions

You will find more transformatios here

No comments:

Post a Comment