Sunday, March 29, 2009

Linear Suffix Array

One article out of three thousands is beautiful. "Simple Linear Work Suffix Array Construction" (Juha Karkkainen and Peter Sanders, 2003) is a simple and elegant linear algorithm for building suffix array in linear time. Back in 2003, When I saw the article the very first time back, I thought: "This is a recursive diamond".

Suffix Arrays (SA) are a powerfull tool used in many different fields such as genetics, string matching, compression, and many others. A suffix array A of T is just a sorted array of suffixes of T. To avoid quadratic space we store only the indices of the suffixes, instead of full suffixes of T. Given SA(T), one can search for occurrences of P directly on the suffix array using binary search in O(|P| lg |T|) time. This can be improved to O(|P| + lg |T|).

SA were originally developed by Gene Myers (ex-Celera) and Udi Manber (now at Google). For more than a decade, the direct construction of suffix array was O(N^2). SA were built in linear time indirectly by building an intermediate suffix tree (which are expensive from the point of view of memory). SA is equivalent to an in-order traversal of the leaves of the suffix tree.

Here a succint description of the algorithm:
1. Sort Σ, the alphabet of the T in linear time by using radix sort;
2. Replace each letter in the text with its rank among the letters in the text;
3. Divide the text T into 3 parts:
• T0 = < (T[3i ], T[3i + 1], T[3i + 2]) , i>=0 . . . >
• T1 = < (T[3i + 1], T[3i + 2], T[3i + 3]) , i=0>
• T2 = < (T[3i + 2], T[3i + 3], T[3i + 4]) , i >= 0 >
1. Recurse on the concatenation of T0 and T1. When this recursive call returns, we get T0 and T1 sorted in a suffix array. After that, we need is to sort the suffixes of T2, and merge them with the old suffixes to get suffixes of T. This is the point where the algorithm become a diamond: the construction of the suffixes of T2 and the linear merge.
2. Note that by construction T2[i:] = T[3i+2:]= (T[3i+2], T[3i+3:]) = (T[3i+2], T0[i+1:]). Therefore we can have T2[i:] from the already built To[i+1]
3. Merge the sorted suffixes of T0, T1, and T2. Remember that suffixes of T0 and T1 are already sorted, so comparing a suffix from T0 and a suffix from T1 takes constant time. To compare against a suffix from T2, we decompose it again to get a suffix from either T0 or T1;
Since this sorting and merging is done in linear time, we get a recursion formula T(n)=T(2/3n)+O(n), which gives linear time. You can prove it by substitution -- intuitively note that 2/3 C++ code for suffix array contained therein. It is neat, simple and instructive.

1. 2. 