Thursday, July 30, 2009

Constructing Folksonomies from User-Specified Relations on Flickr

Constructing Folksonomies from User-Specified Relations on Flickr is a simple paper for merging different user generated hierarchies on Flicker. A single user generated hierarchy is seen as a node. Two nodes can be merged when they support is above the null hyphotesis.

Wednesday, July 29, 2009

Tuesday, July 28, 2009

A Class-Feature-Centroid Classifier for Text Categorization

A Class-Feature-Centroid Classifier for Text Categorization is a nice paper where a simple variant of TFxIDF like weighting is used to classify text. I like the paper. It shows how simple ideas can be very effective even compared to start-of-the-art solutions like SVM.

Sunday, July 26, 2009

A Game Based Approach to Assign Geographical Relevance to Web Images

A Game Based Approach to Assign Geographical Relevance to Web Images is a paper presenting a simple game where players collaborate to assign geographical tags to images.

Funny reading, indeed .. but can you make it a product?

Friday, July 24, 2009

StatSnowball: a Statistical Approach to Extracting Entity Relationships

Can you extract relations among entities on Web scale? This is a very hard problem to address. For instance these days "Neil Armostrong" is related to "Buzz Aldrin" and to "Apollo". This is an easy one, but he is also related to "Barack Obama" and that is a bit more difficult since it depends on this particular instant of time.



A common approach is to provide a set of NPL rules and learn (1) what are the entities (2) what are the relations among the entities.

The paper "StatSnowball: a Statistical Approach to Extracting Entity Relationships" introduces an interesting methodology where new rules can be derived from a small set of examples. In order to achieve this goal, a bootstrapping techniques is adopted as in the figure.

Thursday, July 23, 2009

Learning to Recognize Reliable Users and Content in Social Media with Coupled Mutual Reinforcement

Learning to Recognize Reliable Users and Content in Social Media with Coupled Mutual Reinforcement is a paper about learning the importance of user generated answer in Yahoo! Answer community. They leverage the multual benefits between users, questions and answer and use a logistic regression to learn the optimal weights. Pretty good paper, indeed!

Wednesday, July 22, 2009

Delta and Gamma encoding: a great paper from Yahoo

Web index compression is one of the most interesting research field. I stressed the similarity with traditional bandwidth reduction in the past.

Compressed Web Indexes is a great theoretical result by Yahoo!. They provide a strong bound for delta and gamma encoding. The index size is 1/3 of the document size. Now we have a valid justification for this emphirical observation. Another nice side observation is that terms distribution follows a Double Pareto Law and not just a Zipf law.

Great results on Delta encoding, mr. Yahoo!.

Tuesday, July 21, 2009

Improved Techniques for Result Caching in Web Search Engines

Caching is an important aspect of modern search engine. There are some assumptions that I see in academic papers which might be different in industrial search. In particular:

1) Inverted lists are nowadays stored in memory; they are no longer stored on disks;
2) There is a periodic effect during the 24hours. There are morning queries and late night queries;
3) Geo-localization is very important; What are the benefits and the costs?
4) Verticals may invalidate cache. For instance you need to mix fresh news, with some cached results. how do you deal with that? What are the benefits and the costs?

The paper Improved Techniques for Result Caching in Web Search Engines models web search caching as a weighted problem. Queries results have the same size, but they may have different benefits. In this sense, the caching problem is not just the problem of maximizing the hit ratio. It becames the problem of maxizing the benefits.

I would like to see more papers where the above 4 considerations are taken into account.

Monday, July 20, 2009

Towards Context-Aware Search by Learning A Very Large Variable Length Hidden Markov Model from Search Logs

Can you learn how to re-rank or suggest similar queries or related url from past query sesssion?

The Microsoft paper "Towards Context-Aware Search by Learning A Very Large Variable Length Hidden Markov Model from Search Logs" defines a HMM model with a variable number of hidden states. The learning process is based on a variant of the EM-algorithm. A clustering pre-process step is adopted for reducing the search space of the model. A large scale computation is proposed based on the map-reduce model.

I think that this paper gives a general solution to many different search problem. Anyway, the precision and recall evaluations leave many room for improving the achieved quality of results.
I wonder if similar methodologies are adopted by the recently launched Yahoo's SearchPad.

Sunday, July 19, 2009

Detecting the Origin of Text Segments Efficiently

How to detect what is the original news article (or document) in a stream of articles (document) with near-duplicates? This is a generalization of the classical shingling problem, where the order of documents is important. For instance, in a news engine context you may want to identify the source which broke the story.

Detecting the Origin of Text Segments Efficiently is a Google paper which adopts different methodologies of generating Rabin's shingles. Then, the shingles are hashed into a fixed size cache and different cache eviction strategies are used for dealing with online processing. At query time, an estimation step is used to guess the origin of each shingle. The best combination of generation, eviction and estimation is able to detect the origin document with an accurary of 80/90% just using 1.4% of the shingled tokens! Pretty impressive!!

I wonder how many times the origin document is just the first one produced in temporal order.

Saturday, July 18, 2009

Caching and Similarity

Can you increase the hit ratio in a cache, when you allow a similarity match instead of an exact match?

Yahoo's paper "Nearest-Neighbor Caching for Content-Match Applications" answers this question in the context of search ads matching. Similarity is defined by using LSH, and different cache eviction strategies are compared.

One observation that I have is that Ads are continuosly changing their costs and benefits as time passes. As a consequence, cache should prevent similarity match with an ad that changed his value in the meantime. I think there is an additional constrain which have not been considered in the paper.

Friday, July 17, 2009

Bid optimization for Broad Match Ad Auctions

Can you bid on an ad network for a query q and have an automatic broad match for a set of queries T somehow related to q? Note that both q and all the queries in T have a cost and an associated utility. What you can potentially gain from one side can be lost on the other side. In addition, you --the advertiser -- have no control on what are the queries in T. You just select q and the engine will create T for you. This interesting question is answered by the Google's paper Bid optimization for Broad Match Ad Auctions.

Two models are presented. In the query language model, the advertiser can bet on all the queries with broad match. In this case, a Linear Programming solution is proposed. In the language model, the advertiser can bet on a subset of keywords as an exact or broad match. In this case, the problem is NP-Hard for any reasonable approximation factor. A constant factor approximation is then presented, when the profit exceed the cost.

The paper is a seminal ones. I expect to see more good models and algorithms exploring the "broadness" of a bid auction. For instance, I think that an interesting variant could be when an advertiser bid on an ad network. In this case the bid can be distributed to different places (engines), with different values of v(q) and (q). Even the q can be broad in the sense that different places can have a different matching strategies for q (e.g. maybe drop some word and similar stuff).

Thursday, July 16, 2009

Yahoo showing muscles

I like Carol Bartz! She stands up and say "This is what we've done". All the other people are considering her mission impossible. My sincere admiration for the results. She is a warrior.

Yahoo! is the second one in search and not just that..... first in News, first in Sport, first in Finance, first in Mail!!, not considering messanger and all the other stuffs. I like when people stand up and kick back, without changing they own nature but giving the correct value to their assets. Yahoo needs people like her.

News/Information Total Unique Visitors
Yahoo! News 51,192,000
New York Times Digital 44,789,000
CNN 35,119,000
News/Research Total Unique Visitors
Yahoo! Finance 20,083,000
AOL Money & Finance 14,058,000
MSN Money 11,240,000
Sports Total Unique Visitors
Yahoo! Sports 24,776,000
ESPN 20,984,000
FOXSports.com on MSN 12,576,000
Email Total Unique Visitors
Yahoo! Mail 103,755,000
Windows Live Hotmail 47,709,000
AOL Email 38,057,000
Google Gmail 36,649,000

Interesting article about semantic search, invariant (and contravariance?)

Ask.com posted an interesting blog article about semantic search. The key idea is to create a set of semantic equivalent phrases and test if the search engine results are invariant to the equivalence.

For instance a query Q1 such as "what is the obama program?" can be rephrased into a query Q2 "can you describe the salient aspects of the Obama program?". Now, Q1 and Q2 are semantically equivalent for humans. Therefore, the expected search results should be invariant (e.g. the same). Unluckly, current search engines do not show such property in many different situations.

I like the idea of invariance, but I suspect that we need to define an auxiliar concept of contravariance. The key intuition is that we can have queries such as "Who is Michelle obama?" and "where is michelle obama?" which are semantically different. Anyway, the returned search results can be the same just because the search engines are stopping the keyword "who" and "where". I believe you may want to consider those negative examples.

Monday, July 13, 2009

Compute efficiently a^b mod n

an interview question. Compute efficiently a^b mod n, in space and time.

Sunday, July 12, 2009

Bing leapfrogs Yahoo!

Source TechCrunch. It's also interesting to notice the inverse relation with Google.

Saturday, July 11, 2009

Can you find a prime number?

I am asking to find a prime number. What would you make?

Friday, July 10, 2009

VC: what is the status? and what is your next cool idea, Mr. Tech?

Interesting reading. I strongly suggest having a look at it.

"5-year VC returns through 2008 were 6 percent compared to 48 percent in 2000"

"VC funds shrank 39 percent to $4.3 billion in the first quarter, from $7.1 billion in the same quarter a year ago"

"But the simple fact remains that unless engineers can come up with new technologies that will spur a new wave of business reinvention, there will be little of real value in which VCs can invest their dwindling capital resources. The few IPOs of VC-backed companies in this decade have been related to consumer technologies that only support one or two companies, rather than an entire business ecosystem."

"Without such a wave of new companies, it will be difficult to support the kind of IPO boom that makes VC pay off for its investors. And without an IPO market, at $11.4 trillion in national debt, the U.S. is just the world's biggest debtor -- and that's not a prescription for economic growth."

C++ Web Graph library for g++ 4.3.2 on ubuntu (dis)

Ported the Dis library to g++ 4.3.2 on ubuntu.

I hate all the recent syntactical C++ changes. At the same time, I love all the recent C++ changes they produced. Sigh! porting can be rather bothering at times!!

Wednesday, July 8, 2009

Fresh Correlations, a valid data source

News and blogs are a valid source to correlate entities. An alternative to query logs.



Google announced an OS

Let's see how different engines are tracking the just posted story







Tuesday, July 7, 2009

Dynamic_cast on references

I wonder why people are used to work with dynamic_cast(p) on pointers, but almost none thinks about dynamic_cast(r) about run time type id for references?

Do you know how to capture the bad_cast there? lol '-)

Monday, July 6, 2009

Generate all the permutations of a given string s.

An interview question: Given a string s, generate all the permutations in optimal space and time.

Sunday, July 5, 2009

Generate all the substrings of len k, of a given string s

An interview question: generate all the substrings of len k, for a given string s. In optimal time and space.

Saturday, July 4, 2009

Generate all the anagrams of a string

Interview question: Generate all the anagrams of a string. Assume you have a dictionary of words (like English dictionary). In efficient space and time.

Friday, July 3, 2009

Map a phone number

Map a phone number (such as 023-254435-13) into all the strings which can be potentially generated with a phone keyboard. Please remeber that 1 has no mapping, 2 can be mapped into 'a' or 'b' or 'c', 3 can be mapped into 'd', or 'e', or 'f' and so on. Note that 7 can be mapped into 'p', 'q', 'r', 's', while 9 can be mapped into 'w', 'x', 'y', 'z'.

Thursday, July 2, 2009

N cars waiting for pitstop

N cars are waiting for pitstop at box. At anytime, either a car enter the box or one car exit from the box. Can you generate all the possible combinations of car inse the box?

Wednesday, July 1, 2009

Transform a dice

You are given a dice with 4 faces, which generate numbers with uniform probability {1, 2, 3, 4}. Generate a number in {1, 2, 3, 4, 5, 6, 7} with uniform probability.