Tuesday, February 9, 2010

Towards Recency Ranking in Web Search

Academia and search R&D are publishing more and more papers about recenty ranking. I am pretty excited about that since I spent the last 3 years on this topic both in Ask.com and in Bing.com.

Towards Recency Ranking in Web Search
is an high quality paper from Yahoo! about relevancy ranking. The main contribution of the paper is twofold: it presents a query classifier for recency and a ranking model for recent results.

The query classifier builds two models representing the Content and the Query data at time t, respectively. The two models are then compared on different instants of time and a query is considere recent if it increase his probability of being generated in two different istants. This approach is interesting. Nevertheless there are queries that would require fresh results, even if they are constantly observed (such as "Obama", "Britney Spears", "stock quotation", etc).

The ranking model aims at learning a ranking function based on four categories of recency-related features: timestamp features, linktime features, webbuzz features and page classification
features. The ranking algorithm is GBrank. To solve the recency data insufficiency problem, the authors explored several modeling approaches by utilizing regular ranking data. In compositional model the normal ranking output is used as a training feature, while in over-weighting model the normal ranking output is used with recency features and an emphirical optimal weight is derived. In adaptation model, training data from normal ranking is used for learning a regression tree model, which is then fine-tuned with recency ranking data.

The final algorithm combines regular ranking data and recency ranking data together as training data, and empirically determines the relative weights for these two data source. The evaluation set is made up of 70,131 query-url pairs collected during a period of four months (Feb.∼May, 2009) judged by humans and is based on NDGC metrics. One final result is worth mentioning. In the paper, linktime features are the most important recency features among all recency features. Quoting the authors: "Thus, recency is competing with popularity, which is usually indicated by link-based features and click-based features. This leads to the interesting topic on how to appropriately"
deal with the relationship between recency and popularity

Monday, February 8, 2010

Compute all the items which appears more than p% of time

Write the C++ code for an optimal algorithm -- both in time and space.

Sunday, February 7, 2010

Compute all the items which appears more than 50% of time

Given a stream of symbols, with a finite alphabet, compute all the items which appears more than 50% of time

Saturday, February 6, 2010

Slides for LinkedIn People Search



Thanks to Greg for pointing out them. Interesting work@LinkedIn based on Lucene's customizations.

Friday, February 5, 2010

Beautiful video on Twitter Creation


Twitter Code Swarm from Ben Sandofsky on Vimeo.

Bing to power Facebook Search

Second, we are extending our cooperation outside the US, bringing the Bing-Facebook search integration to the more than 400 million people using Facebook around the world.

Thursday, February 4, 2010

Anatomy of a Large-Scale Social Search Engine

Aardvark Q&A engine going to WWW10
  • Users can ask questions in natural language, not keywords
  • Content is generated “on-demand”, tapping the huge amount of information in peoples’ heads
  • The system is fueled by the goodwill of its users
  • 87.7% of questions sent to Aardvark got answered (very high answer rate!)
  • 75.0% of users who asked Aardvark a question also answered a question for someone else (very high participation rate!)
  • 70.4% of answer feedback had a rating of ‘good’ as opposed to ‘ok’ or ‘bad’ (high quality!)