Antonio Gulli's coding playground: Hands on big data - Crash Course on Spark - PageRank

Friday, September 19, 2014

Hands on big data - Crash Course on Spark - PageRank - lesson 8

Let's compute PageRank. Below you will find the definition.

Assuming to have a pairRDD of (url, neighbors) and (url, rank) where the rank is initialized as a vector of 1s, or uniformly random. Then for a fixed number of iterations (or until the rank is not changing significantly between two consecutive iterations)

we join links and ranks forming (url, (links, rank)) for assigning to a flatMap based on the dest. Then we reduceByKey on the destination using + as reduction. The result is computed as 0.15 + 0.85 * computed rank

The problem is that each join needs a full shuffle over the network. So a way to reduced the overhead in computation is partition with an HashPartitioner in this case on 8 partitions