Sunday, February 28, 2010

Ramdrive and Memory FS

Modern file systems automatically cache files in memory. Anyway, sometime is useful to work with a memory file system where you pre-load all the data.

On Unix this is typically realized mounting a memory file system:
 mount_mfs -s 20m swap /work
On Windows (XP/Vista/7), I found very useful to use this software Ramdisk from dataram, which creates a normal disk but in ram.

Saturday, February 27, 2010

Friday, February 26, 2010

Random Permutations

Let the array A[0, n-1], where A[i] contains the number i \in [0, n) with probability 1/n. Will the array A contain a uniformly random permutation of the numbers 0, .... n-1? Remember that there are n! permutations of the numbers [0, n-1)

Thursday, February 25, 2010

Get k tags out of n tags

You have a dataset of N document with N tags associated for each document. How many documents should you process before seeing k unique tags, for a given k with k << N with high probability?

Wednesday, February 24, 2010

Largest span of increasing pair in an array of integers

Given an array of integers A[N], find the maximum value of (j-k) such that A[k] <= A[j] & j>k

Tuesday, February 23, 2010


Let A[1 .. n] be an array of n distinct numbers. If i <> A[j], then the pair (i, j) is called an inversion of A. Suppose that each element of A is chosen randomly, independently, and uniformly from the range 1 through n. Compute the expected number of inversions.

Sunday, February 21, 2010

Malloc and Free

A certain program makes use of malloc and free several times. Anyway, you forgot to free one block of memory allocated with one malloc. What strategy do you use to check the place where the unpaired malloc is located?

Saturday, February 20, 2010

Find what is missing

You are given an unsorted list of n-1 distinct integers from the range 1 to n. Write a linear-time algorithm to find the missing integer. Please pay attention to potential overflow

Friday, February 19, 2010

Find all the patterns in a string

Find all the patterns which are present in the character array C. A pattern is a sub-array containing 2 or more chars and is having a frequency of more than one.

Wednesday, February 17, 2010

Google Now Includes MySpace Status Updates in Real-Time Search Results

MySpace and Google just announced that starting today, status updates from MySpace users will appear in Google's real-time search.

Tuesday, February 16, 2010

Google Buzz Kills Auto-Follow on Privacy Concerns

"Then there's advertising, which relies on data collection. Google wouldn't probably be as succe$sful if they didn't know that web users aren't very web savvy. Plus, when web users get a hint of data collection, they tend to swing to the other side of the pendulum and overstate the issue, ignoring things like anonymized data.

Perhaps Google got a little too comfortable with Internet ignorance. They certainly struck a nerve with the autofollowing and autosharing of Buzz. Still, it's doubtful this will do any serious damage to Google. The psychological relief of getting what you want from a company (in this case, greater perceived privacy) is easier than changing your email and search habits."

source: searchenginewatch

Monday, February 15, 2010

One fly and two colliding trains.

Two trains are running one against the other at 50 km/h each. They are 100Km far away. Soon they will collide. One fly is flying from one train to another, first it moves in one direction and as soon as it touches one train it flyies back in the other direction. The fly has a 30Km/h speed. How much space will the fly cover?

Sunday, February 14, 2010

Carl Icahn selling off Yahoo shares

"Carl Icahn has substantially cut his stake in Yahoo, according to regulatory filings made public Friday.

The billionaire investor had just under 12 million shares of Yahoo at the end of 2009, according to the new filing with the Securities and Exchange Commission. That compares with more 60 million shares he held last summer"

Saturday, February 13, 2010

Dark times at yahoo? Looking for next steve jobs

"I said: “many in this audience want Yahoo to be competitive and succeed but remain skeptical that you can.” Larry Cornett responded that he was at Apple during the “dark time” before the return of Steve Jobs, when critics were calling for the company to shut down. He likened the media perceptions of Yahoo Search now to that period in Apple’s history and promised that Yahoo Search would “be back.” He added that many of Yahoo’s innovations were being freely copied by its competitors but in more superficial ways."

Friday, February 12, 2010

Google to buy Aardvark

Aardvark, a company that lets you use IM, Twitter and e-mail to ask full-text questions and then get answers from people in or close to your social network, confirmed it signed a deal with Google.

Thursday, February 11, 2010

IAC writes down value of search unit by nearly $1 billion

Will the next step be a merge? I bet so.

"New York-based IAC (NASDAQ: IACI) wrote down $991.9 million from the goodwill on IAC Search & Media, the part of its business that contains and also the much smaller Goodwill is a company’s guess about the future earning power of an asset or company it has bought. It’s the difference between the price paid for the asset and its book value on the balance sheet. started life in Berkeley in 1996 as Ask Jeeves -- it was an early dot-com darling. Later, the business moved to a tower in downtown Oakland and was bought by IAC in 2005 in a deal that valued it at $1.85 billion."

Wednesday, February 10, 2010

New comscore out

Microsoft sites grew January core search volume by 49.6% Y/Y, Google search volume growth of 16.7% Y/Y, Ask grew January core search volume by 15.5% Y/Y, Yahoo! January core search volume decreased by 8.9% Y/Y
source: business insider

Tuesday, February 9, 2010

Towards Recency Ranking in Web Search

Academia and search R&D labs are publishing more and more papers about recenty ranking. I am pretty excited about that since I spent the last 3 years on this topic both in and in

Towards Recency Ranking in Web Search
is an high quality paper from Yahoo! about relevancy ranking. The main contribution of the paper is twofold: it presents a query classifier for recency and a ranking model for recent results.

The query classifier builds two models representing the Content and the Query data at time t, respectively. The two models are then compared on different instants of time and a query is considered recent if it increases his probability of being generated in two different istants. This approach is interesting. Nevertheless there are queries that would fresh results, even if they are constantly observed (such as "Obama", "Britney Spears", "stock quotation", etc).

The ranking model aims at learning a ranking function based on four categories of recency-related features: timestamp features, linktime features, webbuzz features and page classification
features. The learning algorithm is GBrank. To solve the recency data insufficiency problem, the authors explored several modeling approaches by utilizing regular ranking data. In compositional model the normal ranking output is used as a training feature, while in over-weighting model the normal ranking output is used with recency features and an emphirical optimal weight is derived. In adaptation model, training data from normal ranking is used for learning a regression tree model, which is then fine-tuned with recency ranking data.

The evaluation set is made up of 70,131 query-url pairs collected during a period of four months (Feb.∼May, 2009) judged by humans and is based on NDGC metrics. One final result is worth mentioning. In the paper, linktime features are the most important recency features among all recency features. Quoting the authors: "Thus, recency is competing with popularity, which is usually indicated by link-based features and click-based features. This leads to the interesting topic on how to appropriately deal with the relationship between recency and popularity"

Monday, February 8, 2010

Compute all the items which appears more than p% of time

Write the C++ code for an optimal algorithm -- both in time and space.

Sunday, February 7, 2010

Compute all the items which appears more than 50% of time

Given a stream of symbols, with a finite alphabet, compute all the items which appears more than 50% of time

Saturday, February 6, 2010

Slides for LinkedIn People Search

Thanks to Greg for pointing out them. Interesting work@LinkedIn based on Lucene's customizations.

Friday, February 5, 2010

Beautiful video on Twitter Creation

Twitter Code Swarm from Ben Sandofsky on Vimeo.

Bing to power Facebook Search

Second, we are extending our cooperation outside the US, bringing the Bing-Facebook search integration to the more than 400 million people using Facebook around the world.

Thursday, February 4, 2010

Anatomy of a Large-Scale Social Search Engine

Aardvark Q&A engine going to WWW10
  • Users can ask questions in natural language, not keywords
  • Content is generated “on-demand”, tapping the huge amount of information in peoples’ heads
  • The system is fueled by the goodwill of its users
  • 87.7% of questions sent to Aardvark got answered (very high answer rate!)
  • 75.0% of users who asked Aardvark a question also answered a question for someone else (very high participation rate!)
  • 70.4% of answer feedback had a rating of ‘good’ as opposed to ‘ok’ or ‘bad’ (high quality!)

Wednesday, February 3, 2010

Aol wil use Google, once again

So Bing is will probably power Yahoo, and Google will power AOL.
Who is out of this list?

Monday, February 1, 2010