Friday, December 31, 2010

Divide in two equal parts

You are given an array of positive integers, and you have to find a split point such that the two parts have equal sum

Thursday, December 30, 2010

Longest increasing sub-sequence

This is a well know algorithm, but it is useful to know for interviews

Wednesday, December 29, 2010

Subsequences

Given a sequence of integers, find a continuous subsequence which maximizes the sum of its elements

Tuesday, December 28, 2010

DP bracket problem

Given an increasing sequence of integers s1, ...., sk how many balanced parenthesis strings can you generate?

Monday, December 27, 2010

Count the number of inversions in an array

Given an array A unsorted, count the number of inversions.

Friday, December 24, 2010

Nesting envelopes

Given k different types of envelopes. The dimensions of envelope type i are xi × yi. You can place envelope A inside envelope B if and only if the dimensions A are
strictly smaller than the dimensions of B. how many possible combination you can have?

Thursday, December 23, 2010

Wednesday, December 22, 2010

Maximize the number of items you collect

This is a classical DP problem. A table composed of N*M cells with certain items non zero. Start from (0, 0) and you can go down or right one cell. Design an algorithm to find the maximum number of non zero items you can collect, if you are moving from upper-left corner to bottom-right corner.

What is the complexity and the space needed?

Tuesday, December 21, 2010

Launched News in UK, and FR


Another year, another news product. Optimized local rankers, classifiers and clustering. Oh yes, clustering my dear.

Monday, December 20, 2010

Are you the average of your friend actions?

A lot of people are enthusiastic about the benefits of social signals for selecting and ranking information. My question for you is: do you believe that you are well represented by the average actions of your most active social friends? or do you also have differences and a personal identity?

Sunday, December 19, 2010

Related Searches shipped in 8 countries


What an amazing ride. We shipped 8 countries in less than one year with localized filtering, aggregation, and ranking. Italy is part of the list (yeah!!!)


More pictures here

Saturday, December 18, 2010

Maximize and expression with no parenthesis

Given an expression E of n numbers with *, + operations and no precedence among the operators, identify the set of parenthesis which maximize the results for E.

Friday, December 17, 2010

Autosuggest shipped in 5 countries


What an amazing ride. We shipped 5 countries in less than one year with localized filtering, aggregation, and ranking.

More pictures here

Thursday, December 16, 2010

Encode and Decode a document

A 'Document' is defined as a collection of 'Fields'. Some fields can be optional. A 'Field' can be a first-order type (int, char, float), or an array of first order types.

Write the code to serialize and de-serialize a collection of 'Documents'

Wednesday, December 15, 2010

Hive, SQL like, Facebook and the map join optimization

Hive is the SQL-like processor on the top of Hadoop, the open source Map-Reduce implementation. Facebook recently proposed an interesting optimization for the sort and merge phase adopted by the Reducer: "The motivation of map join is to save the shuffle and reduce stages and do the join work only in the map stage. By doing so, when one of the join tables is small enough to fit into the memory, all the mappers can hold the data in memory and do the join work there. So all the join operations can be finished in the map stage. However there are some scaling problems with this type of map join. When thousands of mappers read the small join table from the Hadoop Distributed File System (HDFS) into memory at the same time, the join table easily becomes the performance bottleneck, causing the mappers to time out during the read operation".

This is the optimization: "Hive-1641 solves this scaling problem. The basic idea of optimization is to create a new MapReduce local task just before the original join MapReduce task. This new task reads the small table data from HDFS to an in-memory hash table. After reading, it serializes the in-memory hash table into a hashtable file. In the next stage, when the MapReduce task is launching, it uploads this hashtable file to the Hadoop distributed cache, which populates these files to each mapper's local disk. So all the mappers can load this persistent hashtable file back into memory and do the join work as before."

So they are doing some pre-computation before the join operation itself for computing an hash table which is then pre-loaded in the memory of the local mapper. The performance gains are amazing since optimized map join is 12 to 26 times faster than the previous one.

I wonder how a fast compression scheme could help here. Paolo?

Tuesday, December 14, 2010

K-th sum for two sorted arrays

You are given 2 arrays A, B sorted in decreasing order of size m and n respectively and a number k, such that 1 <= k <= m*n. You need to fine the kth largest sum(a+b) possible where a \in A, and b \in B.

Hint: i like this problem, a possible solution is by induction and the complexity is O(n+m).

Monday, December 13, 2010

Interesting google talk

Available here, with more information about current G's index compression, spanner project, and many other features.

Sunday, December 12, 2010

Out of band: Red power led on dg834g

My router dg834g decided to die on Sunday. It died in a strage way. When turned on the power led remains red. Then, the wireless router starts the power on-cycle but it goes in an infinite reboot cycle. Anyway, sometime it simply starts correctly.

It turns out that the problem was with power supply unit. I just replaced it with an equivalent one.

Saturday, December 11, 2010

Out of band: vnc connection to server behind a dg834g nat

I wanted to use my dg834g for accessing my servers behind a nat with dynamic ip.

The solution was to use dynamic dns (supported natively by the router), and ultravnc with security encryption. The dynamic ip allows to have a unique name, regardless of the dynamic ip assigned to the router. Then, I needed a way to reach the vnc severs. My solution was to a assign static ip to each server in my lan (instead of relaying on dhcp), and to have different range of vnc ports for each server (from server property page). Finally, i setup a static port forwarding set of rules for those ports (for each service configure a "service" in router's content filtering, and then open the "service" in the router's firewall and send to a specific lan server).

And voila, now my servers are accessible with 2048-bit RSA keys and 256-bit AES keys security encryption.

Friday, December 10, 2010

Out of band: vpn with a dg834g router

I wanted to have a native vpn with my dg834g router. This is the configuration. Then you can adopt a client, such as the greenbow


Facebook data store

I used this in my interviews, now not anymore. Device an infrastructure for storing data in Facebook. There are 500M users and let's assume the average number of friends is 100. Each user can post a message, or can read all the fresh messages produced by her friends.

Thursday, December 9, 2010

Google's Dremel


Greg pointed out this interesting G paper "Dremel: Interactive Analysis of Web-Scale Datasets" (PDF)"

"By combining multi-level execution trees and columnar data layout, it is capable of running aggregation queries over trillion-row tables in seconds. The system scales to thousands of CPUs and petabytes of data, and has thousands of users at Google. In this paper, we describe the architecture and implementation of Dremel, and explain how it complements MapReduce-based computing."

The key idea is very simple, but the implementation is complex: "our goal is to store all values of a given field consecutively to improve retrieval efficiency". This is the so called columnar vs. record-oriented storage.

Then, SQL relational operations are implemented by using the following intuition: "Think of a nested record as a labeled tree, where each label corresponds to a field name. The selection operator prunes away the branches of the tree that do not satisfy the specified conditions." .. "Dremel uses a multi-level serving tree to execute queries (see Figure 7). A root server receives incoming queries, reads metadata from the tables, and routes the queries to
the next level in the serving tree."

Dremel scans quadrillions of records per month.

Wednesday, December 8, 2010

Segments with points

You are given a blue segment S of length n. There are n points p_i (0 <= i < n) distributed uniformly at random. You mark in red all the segments connecting points p_i and p_j what would be the predominant colour in S after that?

Tuesday, December 7, 2010

RTTI

I was using this in my interviews. Now no longer: "What is RTTI?"

Monday, December 6, 2010

Subarray sorted

Given an unsorted array arr[0..n-1] of size n, find the minimum length subarray arr[s..e] such that sorting this subarray makes the whole array sorted.

Sunday, December 5, 2010

Numbers (by the way of Geofffrey Dean)



A nanosecond (ns) is one billionth of a second (10-9 s

Saturday, December 4, 2010

Facebook friend suggestions

I used this in my past interviews: now not any longer. Devise an algorithm for suggesting friends in Facebook.

Friday, December 3, 2010

Queries and time

You are given a log of queries annotated with a time-stamp, where a single entry is . Define a data-structure for efficiently identify the top k most frequent queries in the time interval [t0, t0 + \tau]

Thursday, December 2, 2010

Celebrity twitters

In a n twitter set, a celebrity is defined as someone who is followed by everyone, but who follows no one. You need to identify the celebrities by asking to twitters the question: "do you know person x?" The answer will be yes/no. Find the celebrity by asking only O(n) questions.

Hint: how many information can you infer with one question?

Wednesday, December 1, 2010

Ants in a row

You have a set of ants moving on an horizontal segment of length n. Each ant can randomically move either right or left. When an ant takes a direction, she stays in that direction until she either drops out of the segment or it bumps with another ant. In this case, the ant changes her direction. Can you formalize the system and describe what will happen after t iterations?

(thanks giuseppe, muthu)

Tuesday, November 30, 2010

A classical one: find the element which appears 5/8 of times

Given an Array A, with very large size N find the element which appears 5/8 of times.

(this is a classical must solve// know problem)

Monday, November 29, 2010

Out of Band but interestingç "Microsoft Gives the Cloud to Scientists"

http://bits.blogs.nytimes.com/2010/11/17/microsoft-gives-the-cloud-to-scientists/?src=tptw

"Microsoft has hit on a direct path to university researchers’ hearts and minds: give them free tools and easy access to huge data sets.

The software maker has started grafting popular scientific databases and analysis tools onto its Windows Azure cloud computing service. This basically means that researchers in various fields can get access to a fast supercomputer of their very own and pose queries to enormous data sets that Microsoft keeps up to date. For the time being, Microsoft will allow some research groups to perform their work free, while others will have to rent calculation time on Azure via a credit card."

Disclaimer: I work for Microsoft

Sunday, November 28, 2010

A classical one: a triangle and a point

Given a triangle ABC and a point P, device an algorithm for understanding whether P is inside or outside the triangle.

PS: sometime you forgot old geometry, but this problem is simply a generalization of a point P and a segment S. Is the point above or below S?

Thursday, November 25, 2010

Friendship and Polymorphism

I was using this for my interviews. Now no more: Can you explain the use of friendship and polymorphism?

Monday, November 22, 2010

Exchange coins

A classical DP problem: you have a set of coins in your currency. If one coin has a value of n, i will exchange it with coins in my currency of value n/2, n/3, n/4. What is the best exchange strategy for you?

Sunday, November 21, 2010

Serialization and deserialization

Google has a nice alternative to boost::serialize. Those ProtocolBuffers are a good way of marshalling data structures over the wire.

Saturday, November 20, 2010

Friday, November 19, 2010

Everyone is allied to someone

Yahoo made an alliance with Bing, Ask.com apparently made an alliance with Google, and now Myspace made an alliance with Facebook. Also, Bing is allied with Facebook. Isn't that what they call market consolidation?

Wednesday, November 17, 2010

Arrays comparison. don't always sort them.

Two arrays are given A[n] and B[n+1]. Array A contains n integers and B contains n+1 integers of which n are same as in array B but in different order and one extra element x. Find x.

Tuesday, November 16, 2010

Facebook is opening read-only the Message service

The OpenGraph API for google is becoming richer. They just launched the new Message service and decided to open read only the Message inbox via an API. So what products do you believe we need to develop over this API?



Interestingly enough, Message works on the top of HBase

Monday, November 15, 2010

Looking for a new ads major paradigm shift (part I, where we are)

Search ads is the business of selling paid links, together with a short description of the destination sites. The price of the ads is decided by means of an auction where advertisers bid for position and pay just enough to beat their runner-up. When the user submits a query, a set of related paid ads are served together with the search results, and the bidder pays a cost if the ads are clicked (pay per click). For Google, this business has a value of about 5.8billions per quarter. Bing has a proportional business value. Probably, you are thinking.. I already know all this stuff, what is new? Nothing is new, I just want to analyze the above scenario under a different perspective.

We have hundreds of millions of users producing zillions of web pages, blogs, real time updates, and so on and so forth. Then, search engines index this content and serve search results, when users submit relevant queries. In addition, search engines serve commercial ads together with the indexed content. Now, why the content producers receive no money for their content? After all, if there no content then there will be no index, no search results and no commercial ads.

You may say ... well your analysis is partial since commercial ads are not the only way to monetize the content. Another element to consider is the traffic sent to the content producer by the search. Fair enough. This observation is certainly pertinent. However, receiving traffic is very much important for commercial sites, but is not necessarily the same of getting direct money.

For instance, say that I love the music of the 80s. This is my hobby, and so I write about it. I would be interested in receiving some traffic for content produced for this hobby. Anyway, I would be more happy to get some direct money anytime that my content is accessed by using a search engine. After all, the search engine has probably served some commercial ads together with the index built on the top of my content (and on the top of the content written by other similar content producers).

Nowadays, A similar mechanism is missing .. and I am not sure why? Would you consider it a new ads major paradigm shift? And could it be a pair with a new search major paradigm ?

Sunday, November 14, 2010

Friday, November 12, 2010

Personalized Facebook Autosuggest. I like it.

Facebook is quietly doing an amazing work with Autosuggest. Until a couple of days ago, Autosuggest was a method to find People. Now it’s a true personalized search tool where Facebook is injecting articles posted by your friends (see below on the left), Wikipedia pages (see 2nd result below on the right), question and answers (see 3rd result), Facebook pages (4th result). I believe that the experience with Facebook is potentially more exciting that the one exposed in Google Instant. Apparently, they are trying to move away from the SERP paradigm.



Update: they also have Web pages where users expressed a "Like it". So they are effectively crawling them (or at least leveraging the "Like" anchor). Is Facebook starting to become a true search engine?

Thursday, November 11, 2010

Bing, if you want to play a role in algorithmic search

TO: my ex-colleagues in Ask.com and my friends in Yahoo!. Bing is hiring.

There are just two players with algoritmic search out of there. Search follows fractal geometries. You can take advantage of your experiences and, by learning and generalizing, you can anticipate and recognize problems with similarities :)

Wednesday, November 10, 2010

Stock had a 40% increase last year, IAC's Barry Diller Surrenders to Google, Ends Ask.com's Search Effort

IAC's Barry Diller Surrenders to Google, Ends Ask.com's Search Effort. A first glance, this is strange for a company where "Search represented nearly half of revenues ($205 million). The search business grew 20 percent, goosed primarily by a 55 percent increase in active toolbars to 97 million. IAC’s toolbar business is its secret distribution weapon, but those searches tend to generate lower revenue per query than those on Ask.com, which itself is still growing and is now ranked as the sixth largest website in the U.S”

Tuesday, November 9, 2010

Repeating numbers

All numbers in an array repeat twice except two numbers which repeat only once. All the numbers are placed randomly. Find out efficiently the two numbers that have not repeated with O(1) extra memory.

Monday, November 8, 2010

Find the min difference among two numbers in an Array

Given an array of n integers A, find the minimum difference between two numbers in A. It's easy to make it in O(n log n). Can you make it in O(n)?

Sunday, November 7, 2010

Store the file and save space

A file contains a number of repeated but not consecutive zeros. Can you read and store the file such that you save space, but you still can reconstruct the original file? Please C++ code.

(PS: this was an old interview coding question, which I am not going to use anymore)

Saturday, November 6, 2010

Database of queries and similarity

You are given a database of queries Q, with 100 millions of queries. Given a query q return all the queries that have a similarity of 50%. Make your own assumptions and write some C++ code for that.

Friday, November 5, 2010

sharing lists.

Given 2 linked lists find intersection point(s) in the form of X (so the lists share only one element), or Y (so the lists share more than one element). How many other shapes can you have?

Thursday, November 4, 2010

Find two numbers in an unsorted array

Given an unsorted array A find two numbers a, b in A such that a - b = x. In linear time!

Wednesday, November 3, 2010

Alice and Bob play with an array

Alice and Bob play with an array of integers. In turn, each one can pop a number from one of the two ends of the array. What strategy can be used to maximize the sum of popped integers for Alice?

Tuesday, November 2, 2010

News Personalization

Greg, do you mind to comment on this one?
I am really interested in your opinion
http://www.projectemporia.com/

Sunday, October 31, 2010

Looking for a new search major paradigm (part I, where we are)

Search is the business of mapping keywords to urls. This paradigm started a couple of years after the invention of the WWW (say 1996-97) thanks to pioneers such as Lycos, Excite, and Altavista. Then, Google came and they understood how to make money out of search. They leveraged ideas coming from Lycos and Overture, but they were able to apply them in a large profitable scale. Also, Google invented Pagerank, a first large scale link based ranking algorithm. 1999 was a memorable moment with a lot of innovations coming at the same time (PageRank, Salsa, Hypersearch, ExpertRank, and so on and so forth). After that moment, I saw just three additional but minor turning points for algorithmic search:

First, what i called the run for the biggest index with relevant search results, with champions such as Fast, Google, Teoma. Crystal Clear Winner: Google

Second, what Google called Universal search and Ask.com called Ask3d and Bing called Decision engine. Search is no more just about ten web blue links, but is an integrated experience with blended multimedia, fresh news, social stuffs, and so on and so forth. Winner: Bing & Google; Special Mention: Ask.com and Yahoo. Surprisingly, enough Ask.com gave up on this field where they were truly innovators.

Third, the adoption of large scale machine learning methodologies for ranking based on query logs and other signals, with champions like Yahoo and large scale adopters such as Bing. Google claims that they are not adopting machine learning for learning to rank, but I frankly doubt it. Winner: Bing.

So you can ask why are you saying the those are minor changes in search? Well, the business is still the same: mapping keywords to urls. I don't see any major switch from that paradigm. Do you ?

Probably it's all a matter of defining what's a major switch. So, let's avoid theories and go on the practical side. Let's learn by examples and analogies. The adoption of mp3 format was a major switch for the music industry. GUIs such as X-Windows, Windows, and MacOS were a major switch for the interaction with computers. The adoption of the Apps model was a major switch for the phone industry. Investment funds were a major switch for the finance industry. Kinetics is a major switch for the game industry (still to be proven, but in my personal opinion it will be huge. Yes, I work for Microsoft but I would be impressed by this technology even if working for Google). Electrical cars will not be a major switch for the automobile industry (still to be proven, but in my personal opinion they will not be).

So, have you seen any major switch for the search industry in the past fifteen years? Surely, you have not.

You may argue: search is a mature business there is no need to innovate there and switch the paradigm. Hmm, I gently but firmly disagree with that point of view. Actually, my argument is the opposite: since search is a mature business there is now an immense need of innovation in the field. Again let's learn by the examples. Computation was a mature business (do you remember about Digital, IBM?) when the visual UI on PCs appeared. Music industry was one of the most established one, when mp3 came and changed that world. Mobile Phones business was a very consolidated one when the App model broke all the already written rules. Game business was all based on devices you can hold with your hands when kinetics came and said: why do you need to hold something just use your hands you have already them with yourself.

Also, if you study the past examples you will notice that it took at least fifteen years to innovate an already established business with new break through ideas. Beside, if you consider the acceleration we had in technologies you would probably agree with me that we are already accumulating a delay in addressing this immense need of innovation for search business.

My question to you: do you have ideas for innovating in the search business and moving away from the mapping keywords to urls paradigm? Be naive, be simple, be stupid, but be creative.

I have a couple of them and will post in the next days.

DISCLAIMER: Like everywhere else in this blog site, I am expressing my personal opinions, aspirations and ideas and not expressing any consideration related to my work with Microsoft.

Saturday, October 30, 2010

Sum of twos sorted arrays

Given two sorted positive integer arrays A and B, let S = {(a,b) | a \in A and b \in B}. Now we want to get the n pairs from S with largest sum. Can we make it in O(n)?

Friday, October 29, 2010

Thursday, October 28, 2010

Compete's Top 50 Sites: Bing Grows 108%

"Bing actually saw the largest growth of any site on the list over the past year, with Ask.com following with a 75% growth over last year. Google and Yahoo, in comparison, stayed relatively even, with 1% and 0.5% losses respectively."



My forecast? next year Facebook would be at 1st.

Sunday, October 24, 2010

Find substrings or words

Given a large string of characters, Sbig, and a a small set of characters, Ssmall, Find smallest substring of Sbig which contains all characters in Ssmall. What is the difference if you want define words as \s+ separated tokens and you want to find the smallest substring of Sbig which contains all the words in Ssmall?

Saturday, October 23, 2010

Center of a tree

Given a tree T, the center node is the midpoint of the diameter of the three. The diameter is the largest distance between any two nodes. Find the center node.

Friday, October 22, 2010

Matrix again

Given a matrix sorted row-wise and column-wise and a value X. Can we find A[i1][j1] and A[i2][j2] such that there sum = X. What is the complexity?

Thursday, October 21, 2010

estimate the questions

A website has n questions. If you're shown one random question per click, how many clicks on average takes to see all questions.

Tuesday, October 19, 2010

too many names in common

you are given two files each one containing a huge list of names, one per line. Say you have more than 1 trillion of lines per file. How to find the names that are in common?

Monday, October 18, 2010

Search for space

The Number Resource Organization (NRO) announced today that less than five percent of the world’s IPv4 addresses remain unallocate

Sunday, October 17, 2010

subarray sum k

find the subarray whose sum is k in an array of negative and positive numbers

Saturday, October 16, 2010

External memory : how to get the top elements in an array

Given a set of 1 Trillion integers on external memory, find the smallest 1 million of them. You can fit at most 1 million integers in memory at a time. State the fastest solution you can think of. Can you use this strategy to sort 1 trillion of integers in external memory?

Wednesday, October 13, 2010

Concepts in modern C++

Interesting article from DrDobs

enable_if is quite simple in and of itself -- if the first template parameter (which must be a boolean constant expression) evaluates to true then the nested "type" member is a typedef to the second template parameter. If the the first parameter evaluates to False then there is no nested "type" member.

The SFINAE rules mean that if enable_if::type is used in the signature of a function template (as here), then that function overload is discarded if some-expression is False -- that overload is only enabled if some-expression is True.

Tuesday, October 12, 2010

Another maximum length subsequence variant

Find the maximum length subsequence in the given array that contains only 0's and 1's elements and condition is that number of 1's equal to the number of 0's. Can you solve it in linear time?

Monday, October 11, 2010

Stock Markets: Human and Algorithms failures

Recently, I started to be interested in financial markets catastrophic mistakes produced by humans or determined by algorithms.

On 2001, a trader of UBS Warburg meant to sell 16 shares of a company at 600,000 yen. Instead, he submitted an order for selling 610,000 shares at 6 causing a catastrophic failure with an estimated cost of 100 million of dollars. This is just the start, since things become more and more interesting when machine learning algorithms and artificial intelligence are used. ML and AI can turn a catastrophic failure into an epochal failure.

On 1987, October 17 a new algorithmic mechanism for automatic selling shares below a certain ground price has been put in production. October 17 was Monday and the new system introduced an un-expected latency which left some un-processed orders from the past week in the back-log queue. Guess what happened when all of them were processed all together? The price dropped instantaneously and the automatic mechanism for selling had an huge amplification effect. October 17, 1987 become a Black Monday with a 22% of loss in one single day, the worst day since 1929.

Today, we live in the Flash trade epoch. Humans are buying stocks aiming at keeping them for days, weeks, month or years. Flash trading algorithms have a different approach. The idea is to buy a stock and selling it after few micro-seconds. Talking about fast processing and investing on a long term basis. You don't care about what happens in two hours, you just try to predict what happens in a couple of micro-second away. Note that those algorithms can sometime see orders before they are processed, and this can probably make the prediction easier. Flash trading is taking about 25% of London Stock Exchange transactions.

A couple of weeks ago, US authorities published a report saying that on May 6 2010 the market suddenly dropped of 10% generating 1 trillion $ of loss (that is 1000 billions, wow!). Why? a single mistake! One order of selling 4.1 billion dollars linked to a future contract and submitted by a Kansans medium-size firm was accepted with no lower bound selling price limit, by mistake.

No one bought the contract and all the flash trading algorithms started to sell all the stocks all together in an amazing amplifying effect.

Did i say that I love Machine learning and Artificial intelligence?

Saturday, October 9, 2010

Friday, October 8, 2010

Re-org: Live Labs absorbed in Bing

And Gary Flake is not joining. The best result they produced is probably the amazing Photosynth.

Thursday, October 7, 2010

Array sorted by frequency

You are given a large array of integers, output them sorted by frequency. What if the array is too large to fit in memory?

Monday, October 4, 2010

Blood donors

If you pay the blood donors, the donations will increase or decrease?

Sunday, October 3, 2010

The Surprising Truth About What Motivates Us

Jim suggested reading "The Surprising Truth About What Motivates Us" . The book claims that Self-organization is a result of motivation. A couple of important criteria are considered:

Autonomy
Back in 1945, 3M pioneered the idea of giving free time to engineers. As a consequence, 3M obtained a bunch of innovative products including the "Post-It" stickers. 3M credo is "hire good people and leave them alone". They introduced the idea of giving free 15% time per week to each employeer but maintain the IP of the resulting inventions. Then, this approach has been championed by Google.

ROWE
Companies such as BestBuy adopted a Result Oriented Working Enviroment (ROWE), where suprisingly enough there is no Office Time at all. Everyone can self-schedule his time and can work from home when she decides that this is appropriate.

My opinion: Self-Organizing or Hierarchical?
In my opinion, there are several key advantages for self-organization:
  1. "Self-correcting and localizing what is wrong": mistakes or bad decisions(™), either technical or management related, are self-corrected by peers via intense discussions. In this sense, anything inherently "wrong" is confined into a local state and the system is agile.
  2. "Selft-Organizing means Dynamic Evolution of the Hierarchy (Deh!), not Chaotic": Let's change a classical wrong assumption. A self-organizing system is not chaotic. A self-organizing system is one where a hierarchy is naturally emergining due to merits and competencies. In other words, the hierarchy is dynamically changing according to different needs that will change over the time. "If I am you boos, this is not meaning that I am more expert than you" (™)
On the other hand, a static hierarchical model with decisions taken top down with no self-corrections are bad since they percolate good and bad decisions by using the same channels and there is no effective way to discriminate between them.

Saturday, October 2, 2010

Large Scale Tree Data Structures: Reading a fascinating paper on succinct data structure

Giuseppe pushed me to investigate one area where I had rather old knowledge. During the past 3 years, we had an amazing progress in representing succinct data structure. Paolo and Roberto, please forgive my ignorance about this specific area. I am just a new apprentice.

We are all familiar with Trees, and we assume that pointers are an required evil needed for encoding the structure of a tree. Anyway, why do you need to spend 64bit pointers, when you can represent the structure with just few bits? Those ideas are not just theory, they are used in real implementation of Large Scale Tree data Structures that needs to fit in memory.

Let's start from this example taken from the paper Representing Trees of Higher Degree (E. Demain is one of the authors and his studies about Khipu are fascinating, but this is another story...)




An ordinal tree T can be represented as a bit array. The degree of each tree (number of sons) is encoded with a sequence of 1 terminated by a 0. All the nodes are represented with a level degree visit of T (commas are used just for helping the reader but we don't need to represent them). This representation is known as Level Ordered Unary Degree Sequence (LOUDS). Using auxiliary data structures for Rank and Select operations, LOUDS supports in constant time, finding the parent, the ith child, and the degree of a node.
  • The number of 0 in the binary array is the number of node in the three.
  • Each node is son of another node and therefore has a 1 representing it (we need to find which 1 is the correct one)
  • As a consequence, we need at least a space of 2n, plus something for navigation

Now, we need to introduce two operations which can be computed in constant time with additional 0(n) space (see the paper).
  • Rank1(j) returns the # of 1 up and including position j. Rank0(j) is a similar operation for 0
  • Select1(j) returns the position of of the j-th 1. Select0(j) is a similar operation for 0
Given the two above operations we can navigate the encoded representation of the tree. For instance:
  1. Compute the degree of a node in position p. This is the number of 1 up to the next 0 and can be computed as Select0(rank0(p)+1) - p
  2. Compute the parent of a three as in Select1(rank0(p)+1)
  3. Compute the ith children of a tree;
Et voila. No more costly pointers, just bit-vectors and some additional space. In total no more than 2n + o (n). Looks like magic, but it is not. Think about an heap where you represents a binary three with no pointers. LOUDS are using a similar idea but they are more generic.

Another magic representation is based on balanced parenthesis and it is due to Munro and Raman. Suppose you visit the three depth first and open a parenthesis on the way down, and close the parenthesis on the way up. The ordinal tree is encoded again ad a bitvector but now we have the additional benefit that the nodes of a subtree are very close in the bitvector. This is a consequence of the depth first visit. It is possible to navigate the three with Rank & Select operations.

Benoit, Demain, Munro, Raman proposed a variant of LOUDS based on depth first named Depth First Unary Degree Sequence (DFUDS), which preserves locality and still allows tree navigation in terms of Rank & Select. In addition, they proposed further optimization for finding the ith children of a node, if the tree is a trie with k outdegree.

More information about this fascinating word is available in Succinct Trees in Practice, 2010

Friday, October 1, 2010

More Fired UP then ever

Source: KARA


Yahoos,

Carol from the foxhole here. No, actually I’m in Atlanta to give a speech later tonight, and doing some client meetings.

But I’m sure you’ve seen the rumors about Hilary Schneider, David Ko, and Jimmy Pitaro leaving. Yes, they’re leaving, but each for different reasons that suit their life–and Hilary will send a separate note to her team with more about the Americas org. later today.

After nearly four years with Yahoo!, Hilary has decided to move on to the next phase in her career. Hilary has played a major role at Yahoo! in driving our strategies for content, online advertising and more. I want to thank her for her contributions over the years, and wish her the best as she moves on.

We expect to announce the new head of the Americas region before the end of the year. There is a lot of interest in joining our team for such a key position and, in the meantime, Hilary will stay on to help with the transition.

Now, you all know we’ve been intensely focused on improving our operations and top line growth. And we’ve seen good progress. But there’s a lot more to do, and as we transition to a new leader for the Americas we’re taking this opportunity to double down on these efforts. Hilary’s successor will clearly be maniacally focused on growing our top line in the region.

So everyone stay calm–we have a good plan in place. In fact, I’m more fired up than ever and can roll with the punches. Yahoo! is a great place.

Carol

I forecasted an acquisition, Apparently is partnership for Facebook & Skype

Couple of weeks ago, I was suggesting / wondering "Should Facebook buy Skype?". Apparently, a huge deal is coming.

Thursday, September 30, 2010

Louis Mounier is back with Qwiki

Yeah! Yeah! Yeah! Louis is back. Do you remember him from Altavista? Qwiki has taken the top prize at TechCrunch Disrupt San Francisco. Check it out the demo is quite compelling, I subscribed the alpha service.

Now I want to see Fuzzy Mauldin, back in action

Facebook 2nd Video site in U.S. and they do not host videos?

Comscore for video is out. Facebook is the second one, but they do not host video. Yahoo! is not longer the second one in the category

Tuesday, September 28, 2010

TechCrunch editor is tired and he sold to AOL

I am a big fan of TechCruch whose posting I read voraciously. I am not sure that I was happy when I read this:

"The truth is I was tired. But I wasn’t tired of writing, or speaking at events. I was tired of our endless tech problems, our inability to find enough talented engineers who wanted to work, ultimately, on blog and CrunchBase software. And when we did find those engineers, as we so often did, how to keep them happy. Unlike most startups in Silicon Valley, the center of attention at TechCrunch is squarely on the writers. It’s certainly not an engineering driven company.

AOL of course fixes that problem perfectly. They run the largest blogging network in the world and if we sold to them we’d never have to worry about tech issues again. We could focus our engineering resources on higher end things and I, for one, could spend more of my day writing and a lot less time dealing with other stuff."

So they sold to AOL

Launched Video Answer in Paris

Got a(nother) patent for Video Hover Preview

Hmm.. today is a really strange day since I discovered that my ex-Ask.com Pisa group received another patent assigment.

This is time is about Video Search. Back in 2005, we patented the idea of Video Hover preview. These days, Video Hover is a quite popular features among many video sites.

Abstract: A method and system to preview video content. The system comprises an access component to receive a search request and a loader to simultaneously stream a plurality of videos associated with the search request. The system may further comprise a trigger to detect a pointer positioned over a first video and a mode selector to provide the first video in a preview mode.

Inventors:

Gulli; Antonino (Pisa, IT)
Savona; Antonio (Sora, IT)
Veri; Mario (Rocca San Giovanni, IT)




Yet another achievement for Pisa, the "nowhere place"

Monday, September 27, 2010

Peter Thiel: Facebook Won’t IPO Until 2012 At The Earliest

Quoting Techcrunch: "Facebook’s first investor Peter Thiel told Fox Business today that Facebook will not IPO until 2012. Earlier this summer, Bloomberg reported that Facebook was holding off on its IPO for another two years and Thiel seems to confirm this line of thinking.

Thiel says that Facebook wants to follow the example of Google, and doesn’t want to IPO until late in the process (which he claims is a byproduct of Sarbanes Oxley and other regulation)."

Sunday, September 26, 2010

Fishes in a lake

There is an unknown number of fishes in a lake and you need to estimate them. You start fishing and you capture 50 fishes, then you mark them all with a special pen and throw them back in the water. You decide to start fishing again and you recapture 2 marked fishes. So how many fishes there are in the lake?

Saturday, September 25, 2010

Got a patent for Image Search

So I got a patent. It's a quite long process. Started back in Sept 2005 and endend in Sept 2010.

Abstract: "A system and method for determining if a set of images in a large collection of images are near duplicates allows for improved management and retrieval of images. Images are processed, image signatures are generated for each image in the set of images, and the generated image signatures are compared. Detecting similarity between images can be used to cluster and rank images."

Please note that the patent has been submitted before the classical paper PageRank for Product Image Search (2008) from Google, which presented similar ideas.

The algorithms described in the patent were used in Ask.com Image Search, which had the following judgement by PC-World 2007 "Returned very accurate image results, and it has a well-designed search results page. This is a worthy alternative to Google"

Inventors are:

Gulli'; Antonino (Pisa, IT)
Savona; Antonio (Sora, IT)
Yang; Tao (Santa Barbara, CA)
Liu; Xin (Piscataway, NJ)
Li; Beitao (Piscataway, NJ)
Choksi; Ankur (Somerset, NJ)
Tanganelli; Filippo (Castiglioncello, IT)
Carnevale; Luigi (Pisa, IT)

This idea has been ispired by a quite famous Andy Warhol painting



I am particularly proud of this patent, since it proves that innovation can be carried out even from Pisa, a "nowhere place in Italy" ;-). Bet won!

Friday, September 24, 2010

Ensemble Methods in Data Mining: Improving Accuracy Through Combining Prediction

A very interesting reading and the PDF is free: "Ensemble methods have been called the most influential development in Data Mining and Machine Learning in the past decade. They combine multiple models into one usually more accurate than the best of its components. Ensembles can provide a critical boost to industrial challenges -- from investment timing to drug discovery, and fraud detection to recommendation systems -- where predictive accuracy is more vital than model interpretability. Ensembles are useful with all modeling algorithms, but this book focuses on decision trees to explain them most clearly. After describing trees and their strengths and weaknesses, the authors provide an overview of regularization -- today understood to be a key reason for the superior performance of modern ensembling algorithms. The book continues with a clear description of two recent developments: Importance Sampling (IS) and Rule Ensembles (RE). IS reveals classic ensemble methods -- bagging, random forests, and boosting -- to be special cases of a single algorithm, thereby showing how to improve their accuracy and speed. REs are linear rule models derived from decision tree ensembles. They are the most interpretable version of ensembles, which is essential to applications such as credit scoring and fault diagnosis. Lastly, the authors explain the paradox of how ensembles achieve greater accuracy on new data despite their (apparently much greater) complexity."

Launched autosuggest in Paris

My group made it.

Thursday, September 23, 2010

Facebook down -- Internet is crying


So it happens... When you are a little baby and you are about to become a kid. You need to make your own mistakes and sometime you need to fall down. Nothing wrong with it. Your mistakes will make you an adult.

So it happens.. Facebook went down for 2.5hours. My friend Nick wonders whether this increased the productivity in all the world ...

Apparently, the problem was generated by an automatic system used to maintain consistency in cache. Now Facebook is an adult, like Google, Bing, and Yahoo. All of them went down

Wednesday, September 22, 2010

Find the majority

I used this question in my interviews (medium-hard). You have a very long array of integers can you find the int that appears the majority of times with no additional memory and in less than linear time?

Tuesday, September 21, 2010

Random generation

Given a random generator function rand04() that generates random numbers between 0 and 4 (i.e., 0,1,2,3,4) with equal probability. Write a random genartor rand07() that generates numbers between 0 to 7 (0,1,2,3,4,5,6,7) with equal probability.

Monday, September 20, 2010

Maximum contiguous sum in an array

Given an array of integers find the contiguos sequence with maximum sum

Your Personal Ads is waiting you

A dear friend of mine pointed out the work of ImmersiveLabs on targeted ads. Their experience made me formalize some random thoughts I had in my mind for a long period.

Last century was the age of massive and unpersonalized ads on TV, radio, and newspapers.

Ten years ago we saw the beginning of personalized ads with Google. The model was simple and therefore a huge revolution. You search what you want, I give you the results but we agree that I can use your query for serving targeted ads. Actually, the model was proposed originally by Goto.com and Lycos but they missed to understand the value .. and this is another story.

A couple of years ago we started to see another use of personalization. Again the model is simple. You are in a given place and access a service from there (either with your phone or with your desktop). I provide you the service, but will use your geo-location for serving local ads.

These are the days of multiple sensors. ImmersiveLabs is using a cam to detect your face and your sex. Your Iphone can send information about your walking style or your attitude. Your game console can send information about your training style or the what do you like to play. Your tv will soon send information about your entertainment needs. Facebook knows every web page you Like.

How many other ad sensors are there out in the world?

Gianni!!

Tuesday, September 14, 2010

Bing is the second one in US according to Nielsen

For the month, Bing garnered 13.9% of all U.S. search traffic, up from 13.6% in July, according to Nielsen. Yahoo's share over the same period declined from 14.6% to 13.1%. Google, meanwhile, made a slight gain, as its share increased from 64.2% in July to 65% in August.

Monday, September 13, 2010

Why Facebook should work on a social graph based e-bay engine

I believe that Facebook should work on Ebay 2.0, the social graph evolution of the traditional auction market.

On Ebay, a user can rank another user according to the quality of the transaction they concluded. Usually, each user take a look to the seller's ranking for assessing the potential risk of a bet. Even if each user is identified with a login and a password, the seller and the bidder do not know each others.

Facebook and the social graph will add an additional level of trust since both the seller and the bidder can identify a path of users that connect both of them and can provide references. Linkedin adopted a similar mechanism for introducing each others unknown persons. Facebook can use it for creating a new Ebay. In alternative, they can buy a competitor and integrate it in the social graph.

Sunday, September 12, 2010

count the frogs

You have a pond with frogs. Count the frogs.

Saturday, September 11, 2010

Chocolates and roses

You have 21 chocolates and 91 roses, how many bouquets can you make without discarding neither a rose nor a chocolate. All the bouquets must have the same number of chocolates and roses for you.

Wednesday, September 8, 2010

Instant Search

Google just launched a new form of Instant search. I have a couple of questions for the academic people or maths lovers:
  1. Estimate what is the percentage of your query log that can you cover with say X millions queries. Say that this is p%;
  2. Estimate what is the average lenght of a query;
  3. Evaluate whether would be better to leverage traditional inverted lists or tries;
  4. Estimate the impact of caching and your communication framework;
Use all those elements and create your own Instant search recipe. What is the amount of space you need?

Cosmos Massive computations in Microsoft (comparison with Map Reduce, Hadoop and Skeleton Programming)

Map & Reduce is a distributed computation paradigm made popular by Google. Programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines. Many people believes that Map&Reduce has been inspired by the traditional fold and reduce operation typically adopted in functional programming. I believe that there are additional similarities with the less popular skeleton programming paradigm (the equivalent of design patterns for parallel computation). In fact, Map&Reduce can be considered as a particular type of farm, where the collector/reducers are guaranteed to receive the data in a sorted-by-key order. Both Map&Reduce and Skeleton Programming, delegate to the compiler the task of mapping the expressed computation on the top of the execution nodes. In addition, both of them hide the boring data marshaling/un-marshaling operations. Some people pointed out that one limit of Map&Reduce is the difficulty to express more complex type of computations, such as loops or pipes. In general, those operations are emulated by adopting some an external program that coordinates the execution of several map&reduce sequential stages. Within a given map&reduce stage the system can exploit parallelism, but different stages are executed in sequence.

The adoption of SQL-like interfaces is one interesting extension to Map&Reduce. SQL-like statements allow to express parallel computations with no need to write code in a specific programming language (Java, C++ or whatever). These extensions has been originally inspired by Google's Sawzall and they have been made popular by Yahoo's Hadoop Pig and by Facebook's Hadoop Hive. All those extensions are very useful syntactic sugar, but they do no extend the Map&Reduce model.

Scope (Structured Computations Optimized for Parallel Execution) is a new declarative and extensible scripting language targeted massive data analysis. Quoting the paper "Scope: Easy and Efficient Parallel Processing of Massive Data Sets": However, this model (Map&Reduce) has its own set of limitations. Users are forced to map their applications to the map-reduce model in order to achieve parallelism. For some applications this mapping is very unnatural. Users have to provide implementations for the map and reduce functions, even for simple operations like projection and selection. Such custom code is error-prone and hardly reusable. Moreover, for complex applications that require multiple stages of map-reduce, there are often many valid evaluation strategies and execution orders. Having users implement (potentially multiple) map and reduce functions is equivalent to asking users specify physical execution plans directly in database systems. The user plans may be suboptimal and lead to performance degradation by orders of magnitude"

In other words, you can leverage loops, pipes and many other types of parallel patterns with NO need of emulating those steps as in pure Map&Reduce. The process is transparent to the user since she can just express a collection of SQL-like statements (a "Script") and the compiler will generate the more appropriate parallel execution patterns. For instance, (Example 3, in the article)
R1 = SELECT A+C AS ac, B.Trim() AS B1
FROM R WHERE StringOccurs(C, “xyz”) > 2

#CS public static int StringOccurs(string str, string ptrn)
{ int cnt=0; int pos=-1;
while (pos+1 < str.Length) {
pos = str.IndexOf(ptrn, pos+1);
if (pos < 0) break;
cnt++;
}
return cnt;
}
#ENDCS

where A, B, C are colums in a SQL-like schema, and particular StringOccurs C# string function is used to filter the column C. This example shows how to write a user-defined function in scope.

Scope runs on the top of Cosmos storage system. "The Cosmos Storage System is an append-only file system that reliably stores petabytes of data. The system is optimized for large sequential I/O. All writes are append-only and concurrent writers are serialized by the system. Data is distributed and replicated for fault tolerance and compressed to save storage and increase I/O throughput."

The language Scope supports:
  • Join: SQL-like;
  • Select: SQL-like;
  • Reduce: "it takes as input a rowset that has been grouped on the grouping columns specified in the ON clause, processes each group, and outputs zero, one or multiple rows per group";
  • Process: "it takes a rowset as input, processes each row in turn, and outputs a sequence of rows";
  • Combine: "it takes two input rowsets, combines them in some way, and outputs a sequence of rows";
One important difference with Map&Reduce based SQL-like interfaces is that each SCOPE script is compiled into a flexible execution directed acyclic graph (DAG) instead of a rigid map&reduce structure. The graph represents the execution plan. "A physical execution plan is, in essence, a specification of Cosmos job. The job describes a data flow DAG where each vertex is a program and each edge represents a data channel. A vertex program is a serial program composed from SCOPE runtime physical operators, which may in turn call user-defined functions. All operators within a vertex program are executed in a pipelined fashion, much like the query execution in a traditional database system. The job manager constructs the specified graph and schedules the execution. A vertex becomes runnable when its inputs are ready. The execution environment keeps track of the state of vertices and channels, schedules runnable vertices for execution, decides where to run a vertex, sets up the resources needed to run a vertex, and finally starts the vertex program. The translation into an execution plan is performed by traversing the parse tree bottom-up. For each operator, SCOPE has default implementation rules. For example, implementation of a simple filtering operation is a vertex program using SCOPE‟s built-in physical operator “filter” provided with a function that implements the filtering predicate. Following the translation, the SCOPE compiler combines adjacent vertices with physical operators that can be easily pipelined into (super) vertices."

Note that the users need to know neither the optimal mapping nor the optimal number of processes allocated for a particular SCOPE script execution. Everything is transparently computed by the SCOPE compiler and mapped on the top of a well defined set of "parallel design patterns". Dryad provides the basic primitives for SCOPE execution and Cosmos job allocation. Please refer this page if you are interested in more information about Dryad, or if you want to download the academic release of DryadLINQ.

This power of expression allows to express massive computations with very little SQL-like scripting (again from the above paper):
SELECT query, COUNT() AS count FROM "search.log"
USING LogExtractor GROUP BY query
HAVING count > 1000
ORDER BY count DESC;
OUTPUT TO "qcount.result";
Here petabytes (and more!) of searchlogs are stored on the distributed Cosmos storage and processed in parallel to mine the one with at least 1000 "counts". Another example, this time with joins:

SUPPLIER = EXTRACT s_suppkey,s_name, s_address, s_nationkey, s_phone, s_acctbal, s_commen FROM "supplier.tbl" USING SupplierExtractor;
PARTSUPP = EXTRACT ps_partkey, ps_suppkey, ps_supplycost FROM "partsupp.tbl" USING PartSuppExtractor;
PART = EXTRACT p_partkey, p_mfgr FROM “part.tbl" USING PartExtractor;
// Join region, nation, and, supplier
// (Retain only the key of supplier)
RNS_JOIN = SELECT s_suppkey, n_name FROM region, nation, supplier WHERE r_regionkey == n_regionkey AND n_nationkey == s_nationkey;
// Now join in part and partsupp
RNSPS_JOIN = SELECT p_partkey, ps_supplycost, ps_suppkey, p_mfgr, n_name FROM part, partsupp, rns_join WHERE p_partkey == ps_partkey AND s_suppkey == ps_suppkey;
// Finish subquery so we get the min costs
SUBQ = SELECT p_partkey AS subq_partkey, MIN(ps_supplycost) AS min_cost FROM rnsps_join GROUP BY p_partkey;
// Finish computation of main query
// (Join with subquery and join with supplier
// again to get the required output columns)
RESULT = SELECT s_acctbal, s_name, p_partkey, p_mfgr, s_address, s_phone, s_comment FROM rnsps_join AS lo, subq AS sq, supplier AS s WHERE lo.p_partkey == sq.subq_partkey AND lo.ps_supplycost == min_cost AND lo.ps_suppkey == s.s_suppkey ORDER BY acctbal DESC, n_name, s_name, partkey;
// output result OUTPUT RESULT TO "tpchQ2.tbl";

Cosmos and Scope are an important part of the Bing & Hotmail systems, as described in "Server Engineering Insights for Large-Scale Online Services, an in-depth analysis of three very large-scale production Microsoft services: Hotmail, Cosmos, and Bing that together capture a wide range of characteristics of online services

Wired, Is the Web really dead? Reading numbers the way you like

Wired published an article, "The Web Is Dead. Long Live the Internet", containing this enclosed image.

According to the authors, the whole Web model is "dead" since this type of traffic is progressively reducing. App is the new rising start.

Many people expressed perplexity about the relation between the graph (traffic on internet) and the conclusion of the paper “Web is dead . Long live to Internet the App model”.

In more details, the fundamental misleading assumption is to consider the relative percentage growth instead of the absolute growth. Quoting Boing Boing: “In fact, between 1995 and 2006, the total amount of web traffic went from about 10 terabytes a month to 1,000,000 terabytes (or 1 exabyte). According to Cisco, the same source Wired used for its projections, total internet traffic rose then from about 1 exabyte to 7 exabytes between 2005 and 2010.”

The Web traffic had a huge increase but this increase has been relatively smaller when compared to video traffic. Probably this is also due to the data dimension of each video ;-). Same consideration for email, dns, etc.

Another misleading assumption is the definition of Video. Cisco created a such category which includes things like (Skype) video calls, Netflix, but ALSO youtube & hulu web traffic. It is questionable if this definition is correct. Perhaps, Youtube video traffic should be legitimately considered part of the Web traffic because this is a type of Web browser traffic.

Tuesday, September 7, 2010

Longest non decreasing subsequence

The problem of Longest increasing sub-sequence is defined and solved here. Starting from that, can you solve the following: Find out Non-Decreasing subsequence of length k with maximum sum in an array of n integers ?

Monday, September 6, 2010

sum to 0

Given a list of n integers (negative and positive), not sorted and duplicates allowed, you have to output the triplets which sum to 0

Sunday, September 5, 2010

Number of 1 in a sequence

Compute the number of sequences of n binary digit that do NOT contain consecutive 1 symbols.

Saturday, September 4, 2010

Partition an array

You have and array with numbers only in 0-9. You have to divide it into two array such that sum of elements in each array is same.

Friday, September 3, 2010

How Facebook works?

After the posting about Google (WSDM09) an interesting talk about how things are done @ Facebook . A lot of open source including Mysql, CFengine (for configuration), Thrift (for RPC), Hadoop (for MapReduce and data storage), Scribe (for data migration), Bittorrent (for code deployment, which is crazy but fascinating) , Hive for SQL like map-reduce, Ganglia for Clustering Monitoring, apache, memcached, etc.

Thursday, September 2, 2010

Two beautilful Hadhoop books

Hadoop has become "de-facto" standard for industrial parallel processing. Congrats to Doug who inspired and worked on this project since the very beginning.

Recently, I reviewed two beautiful books that I would suggest for your education
  • Data-Intensive Text Processing with MapReduce which focuses on how to think and design algorithms in "MapReduce", with an emphasis on text processing algorithms common in natural language processing, information retrieval, and machine learning.
  • Hadoop: The Definitive Guide with an emphasis on how to program a large hadoop cluster, and with real examples of industrial use in large organizations such as Last.Fm, Facebook and other.

Google and AOL renewed the contract

Nothing new, this was expected. "Mr Armstrong, who took over as the company’s top executive last year, has vowed to restructure the once formidable internet service provider into one of the internet’s biggest content producers"

How many search providers are still out of there?


Different approaches to Universal Search.

Universal search is about having an etherogenous set of vertical retrieval systems (Web, Image, Video, News, Books, etc) which are collaborating together for retrieving the right set of etherogeneous results. There are three approaches:
  1. Having a set of classifiers, one for each domain that will guess whether that particular domain will have a set of relevant results for the given query. The pros is that you would avoid to send all the queries to all the verticals. The cons is that you will best guess the validity of the results, since you are not sending the real query to the vertical. This is Google pre-2004
  2. You have an infrastructure good enough to support Web-like traffic in all the different verticals. This is google post 2004. Note that this is a pull approach
  3. Another solution, not adopted by Google , is to push set of relevant queries from the vertical to the Query distributor. Consider that each vertical knows, at every given instant of time, the type of queries he can serve with good enough quality.
What solution would you adopt 1, 2, or 3 and when?

Wednesday, September 1, 2010

Hive and the SQL-Like fashion

Hive is a nice SQL like interface on the top of Hadoop, an open source platform realized by Doug Coutting. More and more the database is used for non online transactions.

Fulcrum was the first time I saw a SQL-like interface for retrieval back in 1996, now it's funny to see the Map/Reduce paradigm expressed in SQL-like as in

  FROM (
FROM pv_users
MAP pv_users.userid, pv_users.date
USING 'map_script'
AS dt, uid
CLUSTER BY dt) map_output

INSERT OVERWRITE TABLE pv_users_reduced
REDUCE map_output.dt, map_output.uid
USING 'reduce_script'
AS date, count;

Tuesday, August 31, 2010

Sorting exceed the 1G keys/sec on GPUs

This project implements a very fast, efficient radix sorting method for CUDA-capable devices. For sorting large sequences of fixed-length keys (and values), we believe our sorting primitive to be the fastest available for any fully-programmable microarchitecture: our stock NVIDIA GTX480 sorting results exceed the 1G keys/sec average sorting rate (i.e., one billion 32-bit keys sorted per second).

Monday, August 30, 2010

Apache Mahout: a formidable collection of Data mining algos on the top of Hadoop (Map Reduce)

I am playing with Mahout these days and I am loving it. Mahout is a formidable collection of data mining and machine learning algorithms implemented on the top of Hadoop, the open source Map Reduce programming paradigm.

Here you find some hints to run Mahout on the top of Amazon EC2. Here a collection of algorithms implemented . They include:

Classification

Logistic Regression (SGD implementation in progress: MAHOUT-228)

Bayesian

Support Vector Machines (SVM) (open: MAHOUT-14, MAHOUT-232 and MAHOUT-334)

Perceptron and Winnow (open: MAHOUT-85)

Neural Network (open, but MAHOUT-228 might help)

Random Forests (integrated - MAHOUT-122, MAHOUT-140, MAHOUT-145)

Restricted Boltzmann Machines (open, MAHOUT-375, GSOC2010)

Clustering

Canopy Clustering (MAHOUT-3 - integrated)

K-Means Clustering (MAHOUT-5 - integrated)

Fuzzy K-Means (MAHOUT-74 - integrated)

Expectation Maximization (EM) (MAHOUT-28)

Mean Shift Clustering (MAHOUT-15 - integrated)

Hierarchical Clustering (MAHOUT-19)

Dirichlet Process Clustering (MAHOUT-30 - integrated)

Latent Dirichlet Allocation (MAHOUT-123 - integrated)

Spectral Clustering (open, MAHOUT-363, GSoC 2010)

Pattern Mining

Parallel FP Growth Algorithm (Also known as Frequent Itemset mining)

Regression

Locally Weighted Linear Regression (open)

Dimension reduction

Singular Value Decomposition and other Dimension Reduction Techniques (available since 0.3)

Principal Components Analysis (PCA) (open)

Independent Component Analysis (open)

Gaussian Discriminative Analysis (GDA) (open)

Evolutionary Algorithms

see also: MAHOUT-56 (integrated)

Sunday, August 29, 2010

Should Facebook buy Skype?

Skype has more than 560 million registered users, and it filed for a 136Million IPO . Facebook has about 500 million registered users and a pre-IPO value of 50 billion $.

Are the two business complementary? Probably: 

Having the ability to call their friends from the favorite social network would be a great feature for final users. Increasing the Social graph and having a client installed in users' desktop would be a great opportunity for Facebook. Having an access to  the Facebook computing power would be a chance for scaling the Skype platform and reducing their costs.

After all, Skype Could Trump Facebook in Social Networking

Saturday, August 28, 2010

Relational Join on Map Reduce

There is a simple trick for emulating relational join on the top of Map Reduce. Since all the items with the same key are sent to the same reducer,  it is therefore easy to emulate a join on the reduce side. What are the other little tricks you need to adopt?

Obviously, you can alway use an higher level abstraction such as Hive on the top of Hadoop

Friday, August 27, 2010

Google Jeff Dean's key note @ WSDM 2009

Slides (PDF) from his talk available, available on videolectures.net. See also the comments given by Greg. I found particurlarly interesting the part on posting compression and the one on universal search

Thursday, August 26, 2010

Parallel In-Place N-bit-Radix Sort

Good article about parallel in-place sort "This high performance linear-time O(n) algorithm outperforms STL sort() by more than 60X. Parallel Counting Sort benefited by 2-3X when running on a quad-core processor, accelerating from parallelization of the counting portion of the algorithm. ..combine In-Place N-bit-Radix Sort and Parallel Counting Sort to develop a Parallel In-Place N-bit-Radix Sort algorithm, using Intel's Threaded Building Blocks library. "