Antonio Gulli's coding playground: May 2009

Saturday, May 30, 2009

Fresh answers and old ones

Sometime the answer to a question is changing according to time. In this case, Cerberus bougth Crysler for 7.4 Billions in 2007. Anyway, the company is currently evaluating a Chapter 11 and it sold a large part of his assets to Fiat, an Italian company.

Business is a circle. Fiat was in deep deep troubles 5 years ago. Then, everything changed when Sergio Marchione, the new manager accepted to run the company as CEO.

Ask is nailing the fresh answer, Google and Live give the old one, Yahoo is not nailing the story.

Estimating the ImpressionRank of Web Pages

ImpressionRank is the number of times users viewed the page, while browsing search results returned by the search engine. In turn, ImpressionRank has an intuitive correlation with power law search query distribution.

Estimating the ImpressionRank of Web Pages proposes a number of methodologies for estimating ImpressionRank. It is based on sampling of search suggestion services provided by search engines and on extraction of popular keywords extracted by web pages. The paper achieves quite interesting results in terms of recall and convergence speed.

I am pretty sure that the work is quite interesting for all the SEO companies.

Friday, May 29, 2009

Twitter Search is soaring

Search is alive and kicking

In the last couple of months we have seen many completely new initiatives.

Are we now out of the crisis? I say, YES we are!

Microsoft Bing, a serious new competitor in the arena. In my opinion Bing is moving in a direction of more focused search. There are some ideas inspired by Vivisimo, Seachme , and my old Snaket for the categories. Some other ideas taken from the Ask3D interface. A much better ranking, and more important a new way of accessing and organizing search results. It seems that Microsoft wants to invest more money in search and is a very good news for the whole sector. I will follow their performance in the next months.
Wolfram Alpha, is something very different and also very difficult to evaluate. They aims at having all systematic knowledge immediately computable by anyone. A very ambitious goal. For this reason they have my full respect. They provide amazing results for some queries and very poor results for other queries. Precision is very high, recall is low. I will observe their performance more and more in the next months.
Many proposals for RealTime Search with players such as Twitter Search, OneRiot, FriendFeed, and the many others. This is where freshness is important.

Having different search options is an absolute need for a truly open access to the information. Information belongs to people, and search engines are the gateway to it. Therefore is very good to say that Search is alive and kicking.

Thursday, May 28, 2009

Fresh Answers to Fresh Questions

Yahoo CEO One-on-One

What is bing?

Wednesday, May 27, 2009

Real Time Semantic Correlations

Algorithm is the key. Always. Here I am talking about fresh correlations among automatically extracted fresh semantic entities. This is a funny story, but I just wanted to explain a technology we had since many years.

Noemi Letizia is the new supposed italian Monica Lewinsky. Someone is saying that she has an affair with Silvio Berlusconi, this is not proved. But there are rumors. Veronica Lario is Berlusconi's wife. She asked for a divorce due to this story. Gino Flaminio is the ex-boyfriend of Noemi Letizia. He had an important interview with the main italian newspaper. I guess you know why Clinton is there.

Algorithm nailed the true essence of the story

Tuesday, May 26, 2009

Real time Semantic Search

Well everyone is talking about Semantic Search ( have you seen Wolfram Alpha ?). You know, I live in this place so far away from all the rest ;-) But here in this little and wasted land we keep asking a question ;-) !!

Who is noemi letizia?

Now do you know the answer? Did you ever hear about it?

I think that Semantic Search needs Freshness and realtime data as well

Monday, May 25, 2009

Modern Fresh Art

Is this a form of art? I believe so.

http://www.neoformix.com/Projects/TwitterStreamGraphs/view.php

http://www.neoformix.com/Projects/TwitArcs/TwitArcs.html

http://www.neoformix.com/Projects/TwitterSpectrum/TwitterSpectrum.html

Sunday, May 24, 2009

Google vs. the Real-Time Web

Google vs. the Real-Time Web is a nice article by the way of Gigaom.

"By contrast, real-time discovery engines like Twitter and Facebok use a more dynamic kind of democracy, linking to content that users finds worthwhile. As a result, content on the web is splitting into two basic models, and understanding this distinction makes clear why Google’s centralized role is being threatened"

I like the comments:

"There is no time available to develop meta data that separates the wheat from the chaff, or in recent terms, the stupid bacon jokes from real news about Swine Flu"
"It has to have a past to give people’s reactions time to develop. But yes, right now if you say realtime, you certainly can get funded. In the Valley, at least."

This is what I think:

I partially agree with Adam. We already saw some working Real time web search. Namely News blending into Web search. Now, Real time is not just news blending. It’s much more, but the news blending experience may be leveraged there.

Using Graphics Processors for High Performance IR Query Processing

Using Graphics Processors for High Performance IR Query Processing is a paper exploring the use of CUDA GPU as co-processor for serving search queries. The idea is quite intriguiging since GPUs may potentially offer amazing performances at very low cost.

The authors propose a parallel sum prefix based Rice encoding and a PForDelta encoding.
Rice coding encodes an integer (the gap between two consecutive docIDs) by choosing a number of bits b such that 2^b is close to the average of all the gaps, and then representing each integer
as q · 2^b + r for some r < 2b. Then the integer is encoded in a unary part, consisting of q 1s followed by a 0, and a binary part of b bits representing r. PForDelta first selects a value b such that most gaps are less than 2^b, and then uses an array of b-bit values to store all gaps less then 2^b while all other gaps are stored in a special format as exceptions.

The paper describes gap encoding, decoding, and merge operations. In addition, it discusses how to process ranked query, skip lists, dijunctive and conjiuntive queries.

Performances are good, a single server with a GPU and a CPU we can sustain a query arrival rate beyond 300 q/s, versus less than 100 for CPU only. The index size was ~25M documents.

Quite an interesting paper, indeed.

Saturday, May 23, 2009

Learning to Tag

Learning to Tag is a Microsoft paper about suggesting tags associated to Flicker images. The authors use traditional textual co-occurences and visual features (such as colour histogram, and moment). These features are combined by using a RankBoost learning process.

The results are compared with a naive linear combination of ranking signals, and with simple tag co-occurences. It seems that Microsoft likes to combine different ranking signals with RankBoost, since I saw many papers describing variations of this idea and the performance they achieve seems quite good.

Friday, May 22, 2009

American Idol: Freshness, Variety and UI

Who is the winner of american idol? Let's see what is the performance of various search engines. I cannot vote since I am involved in the contest. Anyway, these are the criteria:

1) Freshness, how fast do they broke the news
2) Accurateness, how much accurate are the results
3) Variety, do they provide a sense of variety in them (not just 10 blue links?)
4) UI, is the user experience good?

What is your opinion?

Ask.com has the winner on the top with News, Video and Audio

Google has the official site, wikipedia, and youtube videos

Yahoo has the news with images

Live has the news with images

Searchme has the best UI all around (IMO)

Twitter has all the rumors, and they broke the story

Hmm ... actually OneRiot broke the story !!

Mapping the World's Photos

Mapping the World's Photos is a fascinating paper from Jon Kleimberg et al. They used textual features and SIFT image signatures to geo-localize a large sample of Flicker images. It seems that the two different class of features show a mutual benefit.

It's fascinating to discover what are the most photographed world landmarks.. Applestore in 5th avenue, NYC is the 5th most photograped place in the world, while Rome's colosseum is just 41th...

Thursday, May 21, 2009

Mining Interesting Locations and Travel Sequences From GPS Trajectories

Mining Interesting Locations and Travel Sequences From GPS Trajectories . Hmm ... when I read this paper from Microsoft, I though "...we will follow you ". The key idea is to rank users and places using a HITS-based approach since there is an evident mutual reinforcement propriety among them. 107 people were tracked for one year.

Wednesday, May 20, 2009

Network Analysis of Collaboration Structure in Wikipedia

Network Analysis of Collaboration Structure in Wikipedia is a study about topic polarization and edit activities on Wikipedia. It seems that there are topics where people starts neverending revert wars. I wonder why the authors are not talking about spam and bots here....

Tuesday, May 19, 2009

Understanding User's Query Intent with Wikipedia

"Understanding User's Query Intent with Wikipedia" is a nice paper from Microsoft. It explains how to classify Web queries in order to trigger results from different verticals (such as news, images, video, travel, shopping).

Each category is bootstrapped with few keywords chosen by editors. These descriptions are then automatically expanded using a random walk on wikipedia's categories and concepts. The semantic concepts are then extracted by using Gabrilovich's esa. Results are provided on Live search query log for July 2007 (which has just ~2.6M frequent queries). Precision, Recall and F1 measures are quite impressive and this generic solution can compete with the best ad-hoc KDD2005 classification competition result.

Query classification is a very important topic for Search Engines, and leveraging Wikipedia is definitevely a good idea. (see also my previous posting for Yahoo's query classification)

Multidimension Scaling

Multidimension Scaling is a technique for projecting multi-dimensional points in a 2-d plan, so that the distances as preserved as much as possible.

Here you have a C++ code skeleton with STL and boost.

Monday, May 18, 2009

Range minimum queries and LCA

Range Minimum Query (RMQ) is used on arrays to find the position of an element with the minimum value between two specified indices. There is a trivial quadratic solution based on dynamic programming. I found this tutorial complete and useful. Many other solutions exist. In particular, I appreciated the one based on Segment Trees. A bookmark, and I plan to revisit it soon.

Sunday, May 17, 2009

Unsupervised Query Categorization using Automatically-Built Concept Graphs

Unsupervised Query Categorization using Automatically-Built Concept Graphs is a paper from Yahoo! for automatic query classification without any training phase. The key idea is to build a network of cross-referencing terms extracted from a search engine's snippets. Categories are then hooked in this graph, by using few seed terms. The categorization algorithm is a variation of a truncated random walk. The results are compared with a basic SVM classifier and with KDD2005. Yahoo! UK is using this system in production.

Saturday, May 16, 2009

Level statistics of words: Finding keywords in literary texts and symbolic sequences

Francesco pointed out this paper "Level statistics of words: Finding keywords in literary texts and symbolic sequences" where they extract relevant word from a single text document by leveraging word positions and frequency, without the need of a training corpus. This are the kind of papers I like since they create a bridge between different research fields.

Friday, May 15, 2009

45 watches and a table

There are 45 watches randomly ordered on a table. Prove that the sum of the distances among the center of the table and the ends of the minutes hands can be greater than the sum of the distances among the center of the table and the centers of the watches.

Thursday, May 14, 2009

The new frontier: Real-Time Search-Off

A wonderfult article from TechCrunch about Real Time Web search. The new frontier.

Wednesday, May 13, 2009

Y/N questions and T/F Men

Another variant, this time a bit harder. Men can say the Truth or the False, to any type of Yes or No question. They decide artbitrarly. Identify the type of man with just one question.

Tuesday, May 12, 2009

A bag of white and black balls

You have a bag with n balls, with colours either white or black. What is the probability to extract a sequence of two white balls.

Monday, May 11, 2009

Yahoo is going the (Search) Monkey and Pad way

Yahoo is trying to enrich the ten blue links:

Y/N questions and T/F Men

In a country there are men who always says the truth (T), men we always lie (F) and men who will alternatively lie or say the truth (ToF). Men can only answer Yes or No, if this kind of answer make sense. With only one question can you determine the type of man you are facing?

Sunday, May 10, 2009

Cut a 3D cube in slices

You have a 3x3x3 3D cube, like the rubik cube. Cut it into 27 pieces with the minum number of cuts.

Saturday, May 9, 2009

5 weights, how to identify them

You have 5 different weights of 1, 2, 3, 4, 5 grams. They are indentical when you observe them. Can you state how many different weighting you need by using a balance which can compare two weights.

Friday, May 8, 2009

Logic puzzle: the yes/no world

There is world where the people answers just "Yes" or "No". People are divided in two groups: the ones who always says the truth and the ones who always says the false. Can you formulate a question which cannot be answered by any of the two groups but is a yes/no question?

Thursday, May 7, 2009

Train to heaven or hell

You can go to heaven or to hell. Trains arrive with the same interval, namely 10 minutes. Heaven is on the south platform, hell is on the north platform. You take the first one which will arrive. How it comes that you have 9/10 probabilities to go to the hell?

Wednesday, May 6, 2009

Alice, Bob and the extraction of 3 numbers

Alice and Bob extract 3 numbers between 1..9 with no replacement. Can you suggest a winning strategy to Alice? The goal is to extract 3 numbers whose sum is 15.

Tuesday, May 5, 2009

Online Expansion of Rare Queries for Sponsored Search

Online Expansion of Rare Queries for Sponsored Search is a Yahoo paper which discusses how to expand rare queries for serving matching ads. The emphasis is on efficience. The key intuition is to find related "sentences" by hitting the search engine for the rare queries. The related sentences are extracted from search results snippets. High efficience is achieved by using an offline pre-processing step which aims at creating an offline index of related sentences.

The improvement in precision-recall is relevant, but i wonder why the authors are using human evaluators instead of a direct evaluation based on ads click rate for the expanded query set.

Another relevant result is distribution of rare queries

I wonder where the fresh queries are located here. I assume that "swine flu" was not so popular on the web until few months ago. Examples like "swine flu" are more common than one can imagine at first glance.

Monday, May 4, 2009

Tag Ranking

Tag Ranking is a paper from Microsoft which discusses how to rank the flat collections of tags associated to Flicker images. The key idea is to weight tags according to (a) their probabilistic relevance (a slightly sophisticated variation of well-known TF*IDF, where the probability density function is estimated with a KDE approach) and to (b) a random walk on a tag similarity graph.

The results obtained on a limited testbed are encorauging, but I want to see how this scales on a large database of tagged images. In addition, I would have leveraged the taggers and not just the tags... Maybe an idea for a future work paper.

Sunday, May 3, 2009

A Search-based Method for Forecasting Ad Impression in Contextual Advertising

A Search-based Method for Forecasting Ad Impression in Contextual Advertising is a paper from Yahoo to forecast ads monetization based on previous exposition of the similar ads. The estimation is precise when there is a large amout of previous data available.

This is a seminal paper, but there is a lot of room for improvement. I assume that there are three types of ads:

Those always shown such as "real estate"
Those with periodic peak and a slow growth like "u2 concert"
Those with instananeous and fast growth like "swine flu vaccine"

It seems that the paper is not addressing type 3

Saturday, May 2, 2009

Quicklink Selection for Navigational Query Results

Quicklink Selection for Navigational Query Results is a Yahoo paper about selecting QuickLinks by using toolbar click data. An interesting application of traffic rank, indeed.

Anyway, I wonder wether one can improve a lot the performance by taking into account some linguistic analysis on the suggested tags and titles.

Friday, May 1, 2009

An axiomatic approach for result diversification

Search engines have different goals, when they return 10 blue links. One goal is relevance. Another one is variety. Yet another one is freshness. The list if very long.

An axiomatic approach for result diversification is a Microsoft paper providing an axiomatic framework for addressing both relevance and variety. Very interesting paper, indeed.