Friday, November 27, 2009

When you want to optimize the monetization...

Not the parameter for a search engine. Great article from Danny.

Thursday, November 26, 2009

Search Layers

Search Engines tend to organize indexes in layers. The intuition is that: If "enough good" documents are retrieved from the i-th layer, then you don't need to go to the (i+1)-th layer. Questions:

1. how do you partition the documents?
2. what is "good"?
3. what is "enough"?

Please formalize your answers.

Wednesday, November 25, 2009

Distribution of Facebook Users (~300M world wide)


Facebook claims that they have more than 300 Million of users world wide. I sampled their ads user database and found the following geographical users' distribution:
  1. 34.32% are in U.S.
  2. 8.15% are in U.K.
  3. 5.05% are in France
  4. 4.99% are in Canada
  5. 4.50% are in Italy
  6. 4.42% are in Indonesia
  7. 2.65% are in Spain
Here you can find the distribution of all the other nations.

Tuesday, November 24, 2009

Realtime Web search: an interview

Microsoft Bing’s Antonio Gulli Talks with Talis

Monday, November 23, 2009

Weights and Scale: a variation.

You have 6 weights which appears almost identical but 3 of them are slightly heavier than the remaining 3. The 6 weigth have 3 colours: red, green and blue. For each colour there is a pair of weights and one of the weight in the pair is slighlty heavier than the other. How many times do you need to use a scale with two plates for identifying the lighter weights?

Sunday, November 22, 2009

Ranking teams

Suppose that you have N teams playing a football league. Each team has a rank r_i parameter> 0 and assume there are no ties. Questions:
  1. What is the probability for team i to win on team j?
  2. What is the probability of the whole season (each team plays against the remaining ones)?
  3. Find an algorithm to rank the teams
Please provide your ideas.

Saturday, November 21, 2009

Random Decision Trees

Random Decision Trees are an interesting variant of Decision Trees. Here the key elements:
  1. Different training sets are generated from the N objects in the original training set, by using a bootstrap procedure which randomly samples the same example multiple times. Each sample generate a different tree and all the trees are seen as a forest;
  2. The random trees classifier takes the input feature vector, classifies it with every tree in the forest, and outputs the class label that recieved the majority of “votes”.
  3. Each node of each tree is trained on a random subset of the variables. The size of this set is a training parameter (in general sqrt(#features)). The best split criterium is chosen just considering the random sampled variables;
  4. Due to the above random selection, some training elements are left out for evaluation. In particular, for each left-out vector, find the the class that has got the majority of votes in the trees and compare it to the ground-truth response.
  5. Classification error estimate is computed as ratio of number of misclassified left-out vectors to all the vectors in the original data.
As you know, I love the idea of Boosting (such as AdaBoost) and the ideas behind Random Decision Trees are quite intriguing too.