Tuesday, March 10, 2009

Hashing, Shingling and HashTrees

Hashing is my favorite methodology for data mining, machine learning and information retrieval. For computing “similarities” between a set of N objects (such as text, images, videos, web pages, etc), you need two steps:
  1. define a distance metric among objects;
  2. compute the matrix of distances in O(N^2);
In many situations, this is too expensive. Here it comes the magic of hashing. Each object is hashed to a bucket. The hash becomes the fingerprint of the object. Then, you can safely compare only to objects belonging to the same bucket. In many practical application you have a resulting complexity which is almost linear. The process becomes:
  1. define the fingerprint of the object;
  2. map the object to buckets; one different bucket for each different fingerprint;
  3. compare directly the objects in the same bucket.
Shingling is one of my favorite hashing technique. Here a fragment of text (a shingle) is hashed with multiple hash functions. Intuitively, if you take two different fragments the more similar they are, the more frequently multiple hash functions will produce identical results (when they are applied to the fragments). Shingling has a rock solid mathematical foundation. In a previous post, I enclosed a C++ code for shingling.

Another interesting hash technique is the HashTree construction. Here an order set of items (itemset) is hashed with an hash function. The hash of item in position i-th is used to select a specific children of the node currently visited. In short, hashing is used to drive the visit of a path from the root to a leaf. Each leave of the HashTree contains a bucket, where similar itemsets are stored. The main difference with shingling is that order is preserved and that to items are considered as candidate for similarity if their hash collides in the first k positions. It turns our that this property if very useful for computing frequent itemsets in the Apriori algorithm. Here you have the C++ code for HashTree.


  1. Very useful blog. Thank you ..

  2. the c++ code is corrupted. Can you please check that.