Assuming to have a pairRDD of (url, neighbors) and (url, rank) where the rank is initialized as a vector of 1s, or uniformly random. Then for a fixed number of iterations (or until the rank is not changing significantly between two consecutive iterations)
we join links and ranks forming (url, (links, rank)) for assigning to a flatMap based on the dest. Then we reduceByKey on the destination using + as reduction. The result is computed as 0.15 + 0.85 * computed rank
The problem is that each join needs a full shuffle over the network. So a way to reduced the overhead in computation is partition with an HashPartitioner in this case on 8 partitions
And avoid the shuffle
However, one can directly use the SparkPageRank code distributed with Spark
Here I take a toy dataset
And compute the PageRank
Other dataset are here
https://snap.stanford.edu/data/web-Google.html
No comments:
Post a Comment