Antonio Gulli's coding playground: October 2010

Sunday, October 31, 2010

Looking for a new search major paradigm (part I, where we are)

Search is the business of mapping keywords to urls. This paradigm started a couple of years after the invention of the WWW (say 1996-97) thanks to pioneers such as Lycos, Excite, and Altavista. Then, Google came and they understood how to make money out of search. They leveraged ideas coming from Lycos and Overture, but they were able to apply them in a large profitable scale. Also, Google invented Pagerank, a first large scale link based ranking algorithm. 1999 was a memorable moment with a lot of innovations coming at the same time (PageRank, Salsa, Hypersearch, ExpertRank, and so on and so forth). After that moment, I saw just three additional but minor turning points for algorithmic search:

First, what i called the run for the biggest index with relevant search results, with champions such as Fast, Google, Teoma. Crystal Clear Winner: Google

Second, what Google called Universal search and Ask.com called Ask3d and Bing called Decision engine. Search is no more just about ten web blue links, but is an integrated experience with blended multimedia, fresh news, social stuffs, and so on and so forth. Winner: Bing & Google; Special Mention: Ask.com and Yahoo. Surprisingly, enough Ask.com gave up on this field where they were truly innovators.

Third, the adoption of large scale machine learning methodologies for ranking based on query logs and other signals, with champions like Yahoo and large scale adopters such as Bing. Google claims that they are not adopting machine learning for learning to rank, but I frankly doubt it. Winner: Bing.

So you can ask why are you saying the those are minor changes in search? Well, the business is still the same: mapping keywords to urls. I don't see any major switch from that paradigm. Do you ?

Probably it's all a matter of defining what's a major switch. So, let's avoid theories and go on the practical side. Let's learn by examples and analogies. The adoption of mp3 format was a major switch for the music industry. GUIs such as X-Windows, Windows, and MacOS were a major switch for the interaction with computers. The adoption of the Apps model was a major switch for the phone industry. Investment funds were a major switch for the finance industry. Kinetics is a major switch for the game industry (still to be proven, but in my personal opinion it will be huge. Yes, I work for Microsoft but I would be impressed by this technology even if working for Google). Electrical cars will not be a major switch for the automobile industry (still to be proven, but in my personal opinion they will not be).

So, have you seen any major switch for the search industry in the past fifteen years? Surely, you have not.

You may argue: search is a mature business there is no need to innovate there and switch the paradigm. Hmm, I gently but firmly disagree with that point of view. Actually, my argument is the opposite: since search is a mature business there is now an immense need of innovation in the field. Again let's learn by the examples. Computation was a mature business (do you remember about Digital, IBM?) when the visual UI on PCs appeared. Music industry was one of the most established one, when mp3 came and changed that world. Mobile Phones business was a very consolidated one when the App model broke all the already written rules. Game business was all based on devices you can hold with your hands when kinetics came and said: why do you need to hold something just use your hands you have already them with yourself.

Also, if you study the past examples you will notice that it took at least fifteen years to innovate an already established business with new break through ideas. Beside, if you consider the acceleration we had in technologies you would probably agree with me that we are already accumulating a delay in addressing this immense need of innovation for search business.

My question to you: do you have ideas for innovating in the search business and moving away from the mapping keywords to urls paradigm? Be naive, be simple, be stupid, but be creative.

I have a couple of them and will post in the next days.

DISCLAIMER: Like everywhere else in this blog site, I am expressing my personal opinions, aspirations and ideas and not expressing any consideration related to my work with Microsoft.

Saturday, October 30, 2010

Sum of twos sorted arrays

Given two sorted positive integer arrays A and B, let S = {(a,b) | a \in A and b \in B}. Now we want to get the n pairs from S with largest sum. Can we make it in O(n)?

Friday, October 29, 2010

Go and study: trie, patricia tree, succint tree and come back

Nowadays, those are basic tools for IR people

Thursday, October 28, 2010

Compete's Top 50 Sites: Bing Grows 108%

"Bing actually saw the largest growth of any site on the list over the past year, with Ask.com following with a 75% growth over last year. Google and Yahoo, in comparison, stayed relatively even, with 1% and 0.5% losses respectively."

My forecast? next year Facebook would be at 1st.

Wednesday, October 27, 2010

search for sex; search for money

Tuesday, October 26, 2010

Write a C program to sort a stack in ascending order

complexity and space?

Monday, October 25, 2010

Given a string S find all the distinct substrings

complexity and space

Sunday, October 24, 2010

Find substrings or words

Given a large string of characters, Sbig, and a a small set of characters, Ssmall, Find smallest substring of Sbig which contains all characters in Ssmall. What is the difference if you want define words as \s+ separated tokens and you want to find the smallest substring of Sbig which contains all the words in Ssmall?

Saturday, October 23, 2010

Center of a tree

Given a tree T, the center node is the midpoint of the diameter of the three. The diameter is the largest distance between any two nodes. Find the center node.

Friday, October 22, 2010

Matrix again

Given a matrix sorted row-wise and column-wise and a value X. Can we find A[i1][j1] and A[i2][j2] such that there sum = X. What is the complexity?

Thursday, October 21, 2010

estimate the questions

A website has n questions. If you're shown one random question per click, how many clicks on average takes to see all questions.

Wednesday, October 20, 2010

Given a sorted array find two elements with sum = c

in linear time

Tuesday, October 19, 2010

too many names in common

you are given two files each one containing a huge list of names, one per line. Say you have more than 1 trillion of lines per file. How to find the names that are in common?

Monday, October 18, 2010

Search for space

The Number Resource Organization (NRO) announced today that less than five percent of the world’s IPv4 addresses remain unallocate

Sunday, October 17, 2010

subarray sum k

find the subarray whose sum is k in an array of negative and positive numbers

Saturday, October 16, 2010

External memory : how to get the top elements in an array

Given a set of 1 Trillion integers on external memory, find the smallest 1 million of them. You can fit at most 1 million integers in memory at a time. State the fastest solution you can think of. Can you use this strategy to sort 1 trillion of integers in external memory?

Friday, October 15, 2010

Stock Trends: IAC is close to Apple

Thursday, October 14, 2010

Google beats the expectation

Really great results, mr G.!

Wednesday, October 13, 2010

Concepts in modern C++

Interesting article from DrDobs

enable_if is quite simple in and of itself -- if the first template parameter (which must be a boolean constant expression) evaluates to true then the nested "type" member is a typedef to the second template parameter. If the the first parameter evaluates to False then there is no nested "type" member.

The SFINAE rules mean that if enable_if::type is used in the signature of a function template (as here), then that function overload is discarded if some-expression is False -- that overload is only enabled if some-expression is True.

Tuesday, October 12, 2010

Another maximum length subsequence variant

Find the maximum length subsequence in the given array that contains only 0's and 1's elements and condition is that number of 1's equal to the number of 0's. Can you solve it in linear time?

Monday, October 11, 2010

Stock Markets: Human and Algorithms failures

Recently, I started to be interested in financial markets catastrophic mistakes produced by humans or determined by algorithms.

On 2001, a trader of UBS Warburg meant to sell 16 shares of a company at 600,000 yen. Instead, he submitted an order for selling 610,000 shares at 6 causing a catastrophic failure with an estimated cost of 100 million of dollars. This is just the start, since things become more and more interesting when machine learning algorithms and artificial intelligence are used. ML and AI can turn a catastrophic failure into an epochal failure.

On 1987, October 17 a new algorithmic mechanism for automatic selling shares below a certain ground price has been put in production. October 17 was Monday and the new system introduced an un-expected latency which left some un-processed orders from the past week in the back-log queue. Guess what happened when all of them were processed all together? The price dropped instantaneously and the automatic mechanism for selling had an huge amplification effect. October 17, 1987 become a Black Monday with a 22% of loss in one single day, the worst day since 1929.

Today, we live in the Flash trade epoch. Humans are buying stocks aiming at keeping them for days, weeks, month or years. Flash trading algorithms have a different approach. The idea is to buy a stock and selling it after few micro-seconds. Talking about fast processing and investing on a long term basis. You don't care about what happens in two hours, you just try to predict what happens in a couple of micro-second away. Note that those algorithms can sometime see orders before they are processed, and this can probably make the prediction easier. Flash trading is taking about 25% of London Stock Exchange transactions.

A couple of weeks ago, US authorities published a report saying that on May 6 2010 the market suddenly dropped of 10% generating 1 trillion $ of loss (that is 1000 billions, wow!). Why? a single mistake! One order of selling 4.1 billion dollars linked to a future contract and submitted by a Kansans medium-size firm was accepted with no lower bound selling price limit, by mistake.

No one bought the contract and all the flash trading algorithms started to sell all the stocks all together in an amazing amplifying effect.

Did i say that I love Machine learning and Artificial intelligence?

Sunday, October 10, 2010

Shipping a feature in multiple markets and multiple languages

Saturday, October 9, 2010

Serialize a tree

Given a generic tree, serialize it.

Friday, October 8, 2010

Re-org: Live Labs absorbed in Bing

And Gary Flake is not joining. The best result they produced is probably the amazing Photosynth.

Thursday, October 7, 2010

Array sorted by frequency

You are given a large array of integers, output them sorted by frequency. What if the array is too large to fit in memory?

Wednesday, October 6, 2010

Shipping a feature in multiple markets

Tuesday, October 5, 2010

Some days are containing months and months within. Thanks

Monday, October 4, 2010

Blood donors

If you pay the blood donors, the donations will increase or decrease?

Sunday, October 3, 2010

The Surprising Truth About What Motivates Us

Jim suggested reading "The Surprising Truth About What Motivates Us" . The book claims that Self-organization is a result of motivation. A couple of important criteria are considered:

Autonomy
Back in 1945, 3M pioneered the idea of giving free time to engineers. As a consequence, 3M obtained a bunch of innovative products including the "Post-It" stickers. 3M credo is "hire good people and leave them alone". They introduced the idea of giving free 15% time per week to each employeer but maintain the IP of the resulting inventions. Then, this approach has been championed by Google.

ROWE
Companies such as BestBuy adopted a Result Oriented Working Enviroment (ROWE), where suprisingly enough there is no Office Time at all. Everyone can self-schedule his time and can work from home when she decides that this is appropriate.

My opinion: Self-Organizing or Hierarchical?
In my opinion, there are several key advantages for self-organization:

"Self-correcting and localizing what is wrong": mistakes or bad decisions(™), either technical or management related, are self-corrected by peers via intense discussions. In this sense, anything inherently "wrong" is confined into a local state and the system is agile.
"Selft-Organizing means Dynamic Evolution of the Hierarchy (Deh!), not Chaotic": Let's change a classical wrong assumption. A self-organizing system is not chaotic. A self-organizing system is one where a hierarchy is naturally emergining due to merits and competencies. In other words, the hierarchy is dynamically changing according to different needs that will change over the time. "If I am you boos, this is not meaning that I am more expert than you" (™)

On the other hand, a static hierarchical model with decisions taken top down with no self-corrections are bad since they percolate good and bad decisions by using the same channels and there is no effective way to discriminate between them.

Saturday, October 2, 2010

Large Scale Tree Data Structures: Reading a fascinating paper on succinct data structure

Giuseppe pushed me to investigate one area where I had rather old knowledge. During the past 3 years, we had an amazing progress in representing succinct data structure. Paolo and Roberto, please forgive my ignorance about this specific area. I am just a new apprentice.

We are all familiar with Trees, and we assume that pointers are an required evil needed for encoding the structure of a tree. Anyway, why do you need to spend 64bit pointers, when you can represent the structure with just few bits? Those ideas are not just theory, they are used in real implementation of Large Scale Tree data Structures that needs to fit in memory.

Let's start from this example taken from the paper Representing Trees of Higher Degree (E. Demain is one of the authors and his studies about Khipu are fascinating, but this is another story...)

An ordinal tree T can be represented as a bit array. The degree of each tree (number of sons) is encoded with a sequence of 1 terminated by a 0. All the nodes are represented with a level degree visit of T (commas are used just for helping the reader but we don't need to represent them). This representation is known as Level Ordered Unary Degree Sequence (LOUDS). Using auxiliary data structures for Rank and Select operations, LOUDS supports in constant time, finding the parent, the ith child, and the degree of a node.

The number of 0 in the binary array is the number of node in the three.
Each node is son of another node and therefore has a 1 representing it (we need to find which 1 is the correct one)
As a consequence, we need at least a space of 2n, plus something for navigation

Now, we need to introduce two operations which can be computed in constant time with additional 0(n) space (see the paper).

Rank1(j) returns the # of 1 up and including position j. Rank0(j) is a similar operation for 0
Select1(j) returns the position of of the j-th 1. Select0(j) is a similar operation for 0

Given the two above operations we can navigate the encoded representation of the tree. For instance:

Compute the degree of a node in position p. This is the number of 1 up to the next 0 and can be computed as Select0(rank0(p)+1) - p
Compute the parent of a three as in Select1(rank0(p)+1)
Compute the ith children of a tree;

Et voila. No more costly pointers, just bit-vectors and some additional space. In total no more than 2n + o (n). Looks like magic, but it is not. Think about an heap where you represents a binary three with no pointers. LOUDS are using a similar idea but they are more generic.

Another magic representation is based on balanced parenthesis and it is due to Munro and Raman. Suppose you visit the three depth first and open a parenthesis on the way down, and close the parenthesis on the way up. The ordinal tree is encoded again ad a bitvector but now we have the additional benefit that the nodes of a subtree are very close in the bitvector. This is a consequence of the depth first visit. It is possible to navigate the three with Rank & Select operations.

Benoit, Demain, Munro, Raman proposed a variant of LOUDS based on depth first named Depth First Unary Degree Sequence (DFUDS), which preserves locality and still allows tree navigation in terms of Rank & Select. In addition, they proposed further optimization for finding the ith children of a node, if the tree is a trie with k outdegree.

More information about this fascinating word is available in Succinct Trees in Practice, 2010

Friday, October 1, 2010

More Fired UP then ever

Source: KARA

Yahoos,

Carol from the foxhole here. No, actually I’m in Atlanta to give a speech later tonight, and doing some client meetings.

But I’m sure you’ve seen the rumors about Hilary Schneider, David Ko, and Jimmy Pitaro leaving. Yes, they’re leaving, but each for different reasons that suit their life–and Hilary will send a separate note to her team with more about the Americas org. later today.

After nearly four years with Yahoo!, Hilary has decided to move on to the next phase in her career. Hilary has played a major role at Yahoo! in driving our strategies for content, online advertising and more. I want to thank her for her contributions over the years, and wish her the best as she moves on.

We expect to announce the new head of the Americas region before the end of the year. There is a lot of interest in joining our team for such a key position and, in the meantime, Hilary will stay on to help with the transition.

Now, you all know we’ve been intensely focused on improving our operations and top line growth. And we’ve seen good progress. But there’s a lot more to do, and as we transition to a new leader for the Americas we’re taking this opportunity to double down on these efforts. Hilary’s successor will clearly be maniacally focused on growing our top line in the region.

So everyone stay calm–we have a good plan in place. In fact, I’m more fired up than ever and can roll with the punches. Yahoo! is a great place.

Carol

I forecasted an acquisition, Apparently is partnership for Facebook & Skype

Couple of weeks ago, I was suggesting / wondering "Should Facebook buy Skype?". Apparently, a huge deal is coming.