Thursday, February 12, 2009

Longest common substring, subsequences, increasing subsequences

String gurus are excited about long stuffs. It turns out that such theoretical researches have a lot of practical applications.
  1. Longest Common Substring (LCS): it has applications in biology (gene analysis) and Web-IR (clustering, features' extraction) and many others. Given two strings S : | S | = n and T : | T | = m, finding the common substring has complexity O(nm) by using dynamic programming. In alternative, a generalized suffix tree (GST) can be built in linear time, if the size of the alphabet is constant. Once the GST is built, you can get the LCS by visiting GST bottom-up. You need to annotate each internal node with the strings that caused its creation. The visit is linear in the length of the strings.

  2. Longest Common Subsequences (LCSseq): Substrings are made up of consecutive symbols, while subsequences are not. For instance with web clustering, you may want to "skip" some symbol. In this case, you have to use subsequences instead of substrings. If you have two documents, you may want to compare where they are different (have you ever used the diff program?). There are two interesting proprieties, which are useful for a dynamic programming solution with complexity O(nm)

    • LCS(Xn, Ym) = (LCS( Xn-1, Ym-1), xn), if the string Xn, Ym both ends with the symbol xn
    • LCS(X, Y) = the longest sequences of LCS(Xm – 1, Y) and LCS(X, Yn – 1), this is useful when Xn, Ym do not end with the same symbol
    • LCS({}, {}) = 0

    Running time can be reduced by adopting memoization. Space can be reduced by using hashing.

  3. Shortest Common Supersequence (SCS) is a common super-sequence of minimal length. Think about merging two documents. For two input sequences, an SCS can be formed from an LCS easily, by inserting the non-lcs symbols while preserving the symbol order

  4. Longest Increasing Subsequence (LIS): Here symbols have an order (lexicographical, numerical, etc) and you want to find the longest common subsequence that respects this order. The longest increasing subsequence of a sequence S is the longest common subsequence of S and T, where T is the result of sorting of S. The algorithm is quadratic, but it is possible to give an O(nlogn) algorithm.

1 comment:

  1. I don't know if this thread is still active but i have a very basic question. Given that you are looking for the Longest common substring with exactly k mismatches would that still be considered a longest common substring problem or the longest common subsequence problem ?