Saturday, October 2, 2010

Large Scale Tree Data Structures: Reading a fascinating paper on succinct data structure

Giuseppe pushed me to investigate one area where I had rather old knowledge. During the past 3 years, we had an amazing progress in representing succinct data structure. Paolo and Roberto, please forgive my ignorance about this specific area. I am just a new apprentice.

We are all familiar with Trees, and we assume that pointers are an required evil needed for encoding the structure of a tree. Anyway, why do you need to spend 64bit pointers, when you can represent the structure with just few bits? Those ideas are not just theory, they are used in real implementation of Large Scale Tree data Structures that needs to fit in memory.

Let's start from this example taken from the paper Representing Trees of Higher Degree (E. Demain is one of the authors and his studies about Khipu are fascinating, but this is another story...)

An ordinal tree T can be represented as a bit array. The degree of each tree (number of sons) is encoded with a sequence of 1 terminated by a 0. All the nodes are represented with a level degree visit of T (commas are used just for helping the reader but we don't need to represent them). This representation is known as Level Ordered Unary Degree Sequence (LOUDS). Using auxiliary data structures for Rank and Select operations, LOUDS supports in constant time, finding the parent, the ith child, and the degree of a node.
• The number of 0 in the binary array is the number of node in the three.
• Each node is son of another node and therefore has a 1 representing it (we need to find which 1 is the correct one)
• As a consequence, we need at least a space of 2n, plus something for navigation

Now, we need to introduce two operations which can be computed in constant time with additional 0(n) space (see the paper).
• Rank1(j) returns the # of 1 up and including position j. Rank0(j) is a similar operation for 0
• Select1(j) returns the position of of the j-th 1. Select0(j) is a similar operation for 0
Given the two above operations we can navigate the encoded representation of the tree. For instance:
1. Compute the degree of a node in position p. This is the number of 1 up to the next 0 and can be computed as Select0(rank0(p)+1) - p
2. Compute the parent of a three as in Select1(rank0(p)+1)
3. Compute the ith children of a tree;
Et voila. No more costly pointers, just bit-vectors and some additional space. In total no more than 2n + o (n). Looks like magic, but it is not. Think about an heap where you represents a binary three with no pointers. LOUDS are using a similar idea but they are more generic.

Another magic representation is based on balanced parenthesis and it is due to Munro and Raman. Suppose you visit the three depth first and open a parenthesis on the way down, and close the parenthesis on the way up. The ordinal tree is encoded again ad a bitvector but now we have the additional benefit that the nodes of a subtree are very close in the bitvector. This is a consequence of the depth first visit. It is possible to navigate the three with Rank & Select operations.

Benoit, Demain, Munro, Raman proposed a variant of LOUDS based on depth first named Depth First Unary Degree Sequence (DFUDS), which preserves locality and still allows tree navigation in terms of Rank & Select. In addition, they proposed further optimization for finding the ith children of a node, if the tree is a trie with k outdegree.