Sunday, January 23, 2011

Search with wildcards

You are given a collection of n documents D. The collection is static (i.e. no new documents are added to the list) but is very large (e.g. it cannot fit in memory).

You need to retrieve all the documents matching queries where wild cards are allowed. A wild card * will match any sequence of characters, and a wile card ? will match a single character.

Implement a C++ solution.

Quite Difficult.


  1. Got something against egrep? At least given the constraints you put down, seems like by far the fastest way to get this done. Why reinvent the wheel if you don't have to?

  2. This problem can be quit difficult or extremely difficult depending on the type of wildcard query you want to implement:
    a) trailing wildcards query: e.g. lo* will find "look" or "looking" or "lock"
    b) middle wildcards query: e.g. l*t will find "lost" or "last"
    c) leading wildcards query: e.g. *ok will find "look" or "book"

    So you can have:
    1) a
    2) a+b
    3) a+b+c

    Again it depends if you can keep the "inverted index" completely in memory or not.


  3. Implementing finite state machine can help.