1. Take a list of names
2. Create buckets by initials
3. Group by keys
4. Map each each initial into the set of names, get the size
5. Collect the results into the master as an array
This code is not super-optimized. For instance, we might want to force the number of partitions - the default is 64Mb chunks
or better count the unique names with a reduceByKey where the reduce function is the sum
Found this example very instructive [Deep Dive : Spark internals], however you should add some string sanity check
and make sure that you don't run in some heap problems if you fix the partitions
You will find more transformatios here
No comments:
Post a Comment