Saturday, April 7, 2012

How many servers do you need to index the web?

1 comment:

  1. With techniques such as single pass in-memory inversion and sort-based inversion one server can index the entire Web, but it this will take years of processing.

    The distributed indexing and storage models such as MapReduce (or Hadoop) anf GFS (or HDFS) scale very well to the size of the Web. With two machines you half the time, with thousands of machines indexing is done in hours. Tens of thousands of machines need minutes.

    So the real question is: how quickly do you need to index the Web?