Thursday, March 5, 2009

Web mis-spell, edit distance and graph search

"Britnay spers" who is that girl? ... Hmm this name is a common misspell of the "Britney Spears". Misspelling is an important part of Web search. You may want to suggest the "correct" and "closer" alternative query, when a user submits the "wrong" one. Under certain conditions, you may want to re-write the query in a transparent way and provide directly search results for the "correct" alternative query.

Here it is a set of spelling related problems.
  1. Given two strings s, t such that |s|=n, |t|=m, compute the edit distance(s, t). You may assume different costs for insert, delete, and substitution. You can make this in O(mn) and space O(min { m, n } )
  2. Given a set of words S = { w1, ... wk } an accettable bidirectional mutation (wi <-> wj) for the word wi is a word wj, such that edit_distance(wi, wj) < T. Suppose you can access a large query log Q of queries previously submitted to a search engine. Suppose that a user rewrites the word wi into the word wj within a distance D of queries submitted during her search session. In this case, you call the mutation (wi -> wj) real. Note that this mutation is in one direction only. A real accettable mutation is an accettable mutation which is both real and acceptable. A frequent real mutation is a mutation which appears in at least F different users' sessions. Given a configuration (Q, T, D, F) find all the frequent, real and acceptables word mutations. What is the optimal complexity in time and space?
  3. Find the diameter of the graph G formed considering all the frequent, real, acceptable word mutations (wi -> wj).
  4. A sentence is an order sequence of words. For instance, "britney spears" is different from "spear britnery". How can you deal with mispelling in case of sentences?
Here you have the C/C++ code for edit distance.

No comments:

Post a Comment