Tuesday, August 31, 2010

Sorting exceed the 1G keys/sec on GPUs

This project implements a very fast, efficient radix sorting method for CUDA-capable devices. For sorting large sequences of fixed-length keys (and values), we believe our sorting primitive to be the fastest available for any fully-programmable microarchitecture: our stock NVIDIA GTX480 sorting results exceed the 1G keys/sec average sorting rate (i.e., one billion 32-bit keys sorted per second).

Monday, August 30, 2010

Apache Mahout: a formidable collection of Data mining algos on the top of Hadoop (Map Reduce)

I am playing with Mahout these days and I am loving it. Mahout is a formidable collection of data mining and machine learning algorithms implemented on the top of Hadoop, the open source Map Reduce programming paradigm.

Here you find some hints to run Mahout on the top of Amazon EC2. Here a collection of algorithms implemented . They include:

Classification

Logistic Regression (SGD implementation in progress: MAHOUT-228)

Bayesian

Support Vector Machines (SVM) (open: MAHOUT-14, MAHOUT-232 and MAHOUT-334)

Perceptron and Winnow (open: MAHOUT-85)

Neural Network (open, but MAHOUT-228 might help)

Random Forests (integrated - MAHOUT-122, MAHOUT-140, MAHOUT-145)

Restricted Boltzmann Machines (open, MAHOUT-375, GSOC2010)

Clustering

Canopy Clustering (MAHOUT-3 - integrated)

K-Means Clustering (MAHOUT-5 - integrated)

Fuzzy K-Means (MAHOUT-74 - integrated)

Expectation Maximization (EM) (MAHOUT-28)

Mean Shift Clustering (MAHOUT-15 - integrated)

Hierarchical Clustering (MAHOUT-19)

Dirichlet Process Clustering (MAHOUT-30 - integrated)

Latent Dirichlet Allocation (MAHOUT-123 - integrated)

Spectral Clustering (open, MAHOUT-363, GSoC 2010)

Pattern Mining

Parallel FP Growth Algorithm (Also known as Frequent Itemset mining)

Regression

Locally Weighted Linear Regression (open)

Dimension reduction

Singular Value Decomposition and other Dimension Reduction Techniques (available since 0.3)

Principal Components Analysis (PCA) (open)

Independent Component Analysis (open)

Gaussian Discriminative Analysis (GDA) (open)

Evolutionary Algorithms

see also: MAHOUT-56 (integrated)

Sunday, August 29, 2010

Should Facebook buy Skype?

Skype has more than 560 million registered users, and it filed for a 136Million IPO . Facebook has about 500 million registered users and a pre-IPO value of 50 billion $.

Are the two business complementary? Probably: 

Having the ability to call their friends from the favorite social network would be a great feature for final users. Increasing the Social graph and having a client installed in users' desktop would be a great opportunity for Facebook. Having an access to  the Facebook computing power would be a chance for scaling the Skype platform and reducing their costs.

After all, Skype Could Trump Facebook in Social Networking

Saturday, August 28, 2010

Relational Join on Map Reduce

There is a simple trick for emulating relational join on the top of Map Reduce. Since all the items with the same key are sent to the same reducer,  it is therefore easy to emulate a join on the reduce side. What are the other little tricks you need to adopt?

Obviously, you can alway use an higher level abstraction such as Hive on the top of Hadoop

Friday, August 27, 2010

Google Jeff Dean's key note @ WSDM 2009

Slides (PDF) from his talk available, available on videolectures.net. See also the comments given by Greg. I found particurlarly interesting the part on posting compression and the one on universal search

Thursday, August 26, 2010

Parallel In-Place N-bit-Radix Sort

Good article about parallel in-place sort "This high performance linear-time O(n) algorithm outperforms STL sort() by more than 60X. Parallel Counting Sort benefited by 2-3X when running on a quad-core processor, accelerating from parallelization of the counting portion of the algorithm. ..combine In-Place N-bit-Radix Sort and Parallel Counting Sort to develop a Parallel In-Place N-bit-Radix Sort algorithm, using Intel's Threaded Building Blocks library. "


Wednesday, August 25, 2010

ClueWeb 09

ClueWeb is a wonderful Web dataset available for the research community. I

  • 1,040,809,705 web pages, in 10 languages
  • 5 TB, compressed. (25 TB, uncompressed.)
  • Unique URLs: 4,780,950,903 (325 GB uncompressed, 105 GB compressed)
  • Total Outlinks: 7,944,351,835 (71 GB uncompressed, 24 GB compressed)

Tuesday, August 24, 2010

A very interesting talk about Social Graphs

Facebook has probably the largest Social Graph out of there. In a social graph you have Friends of Friends relationships (FoF) and Object of Friends (OoF) relations. All those relations are used to perform a form of social ranking at search time which is personalized according to the interest of your own social community.

This turns out to be a nice graph problem, where you need to efficiently compute a prefix search on the huge social graph, with 4x10^8 nodes and 7.7x10^10 edges in FoF and 10^9 nodes and 10^10 edges in OoF.

Four options have been explored
:
  1. a classical trie which is very expensive for pointer memory consumption;
  2. a sorted vector searched with range queries and binary search which doesn't scale and it is expensive in terms of dynamic insertion;
  3. brute force search;
  4. a version bloom filtered architecture, which optimizes brute force search;

Monday, August 23, 2010

Number puzzle

I was not able to solve it: if an integer n has just 3 factors, how many factors has n^2. Today I learned something new.

A lillte puzzle with balls

Little interview puzzle: A container is filled up with balls. Each ball has a value. Red = 7, Yellow = 5, Green = 2, Blue 2. Some balls are removed from the container and the product of their value is 147,000. How many red balls have been removed?

Sunday, August 22, 2010

Lycos from 13 billions (2001) to 36 millions (2010)

Lycos has been sold once again, from 13 billions back in 2001 down to 36 millions in 2010. Lycos was probably the first search engine ever. I started playing with search engines back in 1994, with `pursuit` the CMU open source code, later on evolved into the Lycos code.

Technology was really good at that time, but people belived that "You cannot make money with search" or "Search is nothing, Portal is the future". Later on this became "You cannot compete with Google, you need to make something different".

Friday, August 20, 2010

Solution to the Facebook programming puzzle of "mutual likes"

Disclaimer, you need to create your own Facebook working access_token and replace it in the below urls
  • These are your friends [ https://graph.facebook.com/me/friends?access_token= ]
{
"data": [
{
"name": "",
"id": "31913"
},
{
  • This is what you like [ https://graph.facebook.com/me/friends?access_token= ]
{
"data": [
{
"name": "Information theory",
"category": "Field of study",
"id": "107921045903425",
"created_time": "2010-08-20T17:47:47+0000"
},
{
"name": "Data modeling",
"category": "Interest",
"id": "107298675966720",
"created_time": "2010-08-20T17:47:18+0000"
},
{
"name": "Statistical inference",
"category": "Interest",
"id": "108195032537847",
"created_time": "2010-08-20T17:46:49+0000"
},
  • This is what my friend Tony likes [ https://graph.facebook.com/1452108295/likes?access_token=]
{
"data": [
{
"name": "BB King",
"category": "Musicians",
"id": "43780515976",
"created_time": "2010-07-20T19:57:00+0000"
},
{
"name": "Luciano Moggi",
"category": "Public_figures_other",
"id": "56461948713",
"created_time": "2010-07-09T08:12:34+0000"
},
{
"name": "Bing Jobs Europe",
"category": "Technology",
"id": "132723843410879",
"created_time": "2010-06-30T15:13:13+0000"
},
{
"name": "10 Worst Tattoo Spelling Mistakes EVER!!!",
"category": "Public figure",
"id": "133641006660295",
"created_time": "2010-06-27T12:02:37+0000"
},
Therefore the solution is to enumerate all your friends, get their ids, get what they likes and intersect with what you like.

Since the puzzle requires to "The problem is to find the friend who shares the most number of items with you", you can simply store all your "like ids" and then, for each friend, count how many likes ids you have in common. You just need to store the current maximum, and the correspondent friend id.

Facebook opened an office in seattle, with a puzzle

Disclaimer. I don't like many puzzles, unless there is some math or algorithm intuition behind them. So, Facebook is going to Seattle and they have a puzzle. Here it is

Let's see: 

The first occurrence of 'e' comes before 1 occurrence of 's'.
The first occurrence of 'i' comes before 1 occurrence of 'n'.
The first occurrence of 'i' comes before 1 occurrence of 'i'.
The first occurrence of 'n' comes before 2 occurrences of 'e'.
The first occurrence of 'e' comes before 1 occurrence of 'e'.
The first occurrence of 'i' comes before 1 occurrence of 'v'.
The first occurrence of 'n' comes before 1 occurrence of 'i'.
The first occurrence of 'n' comes before 1 occurrence of 'v'.
The first occurrence of 'i' comes before 1 occurrence of 's'.
The first occurrence of 't' comes before 1 occurrence of 's'.
The first occurrence of 'v' comes before 1 occurrence of 's'.
The first occurrence of 'v' comes before 2 occurrences of 'e'.
The first occurrence of 't' comes before 2 occurrences of 'e'.
The first occurrence of 'i' comes before 2 occurrences of 'e'.
The first occurrence of 'v' comes before 1 occurrence of 't'.
The first occurrence of 'n' comes before 1 occurrence of 't'.
The first occurrence of 'v' comes before 1 occurrence of 'i'.
The first occurrence of 'i' comes before 1 occurrence of 't'.
The first occurrence of 'n' comes before 1 occurrence of 's'

  • What is the dictionary of distinct letters? This is given by the symbols used for the above rules. Therefore: i, n, e, v, t, s, 
  • Are there repetitions? yes, "2 occurrences of e" and "2 occurrences of i". So the letters are D={i, n, e, v, t, s, e, i} and the string has length 7 because all the rules are given
  • So the solution is an anagram of D which respects the rules

I am too lazy to test all the rules, let's test the solution directly 

http://www.facebook.com/seattle/invitees/


MemCached @ Facebook

Interesting talk about MemCached usage at Facebook. Optimizations are about using UPD, tuning memory allocation, and synchronizing different datacenters. Interestingly enough the memory allocation problem is similar to the one seen for vectors allocation in STL.

An interesting post about hacking culture

Apparently hacking culture is seen as a value by a lot of people.

Thursday, August 19, 2010

An hidden gem: Peter Norvig (Google) talks at Facebook about machine learning

Don't be surprised. I work @ Microsoft and I am pointing out this hidden gem. A public talk that Peter Norvig gave at Facebook about machine learning. Peter Norvig is the Director of Research at Google and his talk is about different problems including segmentation, spelling, building n-grams on web scale, Google set (kind of related words) and machine translation.

It seems to me that the lesson suggested is to build "simple probabilistic models" applied to a lot of data, rather then to adopt complex models working on small datasets.

Off-Topic: Real Virtuality

Puzzle: Slicing a pizza in 3 parts

This is a nice puzzle when you are with your friends and you eat a pizza. There are two guys and a pizza. How can you cut the pizza so that everyone is happy about the slice received? This is pretty easy for 2 guys, but how about 3 guys and , more generally, n guys?

Wednesday, August 18, 2010

Facebook UI Front-end technologies

I am not a UI expert, but I do enjoyed this FB presentation about how to optimize UIs. At the end is a matter of reducing communications, re-use common patterns, event delegations and be as much lazy as possible. Surprisingly enough you should also care about making things easy but not so easy...

Tuesday, August 17, 2010

Yahoo changes its API

Yahoo! announced its search changes:

"On October 1, 2010, we will close the SearchMonkey developer tool, gallery, and app preferences"

"The
Yahoo! Site Explorer team is planning tighter integration between Site Explorer and Bing Webmaster Center to make the transition as smooth as possible for webmasters"

"Several search-related web services will continue to be supported, but strictly through YQL. These include the
Yahoo! Term Extraction Web Service, Related Suggestion, and Spelling Suggestion. Other non-BOSS search APIs such as Web Search, Image Search, News Search and Site Explorer APIs will shut down with no further support in YQL"

"Search remains critical to Yahoo! and we’re happy to announce that we will continue to offer the
BOSS program (Build your Own Search Service). In the not too distant future, BOSS will provide web and image search results from Microsoft along with other search-related services and content from Yahoo!, such as news"

Sunday, August 15, 2010

Haystack the Facebook Fileserver

Haystack is the Facebook fileserver for photos. Instead of using costly NFS operations (as they did in the past) they adopted a simple log based file where the typical operation is an append to an already open file. Obviously they have a separate index structure, and many cache server in front of the system.

This is a typical solution when you have a lot of new objects inserted in the system (and a limited amount of delete operation). If you have many delete, you probably should consider something like the COSS (Cyclic Object Storage System) developed for Squid.

Saturday, August 14, 2010

Esemble of Classifiers (part I)

Ensemble of Classifiers is a powerful machine learning technique where several (weak) classifiers are combined to obtain a more powerful classifier.

In this code skeleton, a generic Classifier class is defined which takes a generic input T1 type as input and classifies (transforms) into a generic type T2 (the categories). The generic classifier is then specialized into a weak classifier class with T1 = vector [integer] and T2 = bool. The weak classifier has some random noise which is used to generate mistakes.

A collection of weak classifiers will be then combined into an ensembles of classifiers, in the next days.

Friday, August 13, 2010

Thursday, August 12, 2010

China goes search mode

Xinhua, China Mobile to establish internet search engine joint venture.

"Called "Search Engine New Media International Communications Co.," the joint venture will focus on creating a leading search engine while actively developing other businesses, including the internet, print media and advertising, said Zhou Xisheng, vice president of Xinhua, at a signing ceremony.

"Search engines, which have powerful information integration abilities, play an increasingly important role in disseminating information and influencing public opinion," he said."

Wednesday, August 11, 2010

K-Means Soft

K-Means Soft is a simple clustering algorithm where items are softly assigned to multiple clusters. The idea is that each item moves closer to the centroid it is closest too. Here you have the code using boost::ublas::vectors.

Tuesday, August 10, 2010

Fresh Search tools we are working on



I heard that a new young Talent is becoming famous in U.S. but I was not sure about her name.

So I started to write "jacki", and Bing suggested how to complete the query.

When I selected the suggestion, Bing gave me the result. Lovely.

Monday, August 9, 2010

VectorSpace based on Boost::Matrix

Boost::Matrix provides a convenient way to represent a compact VectorSpace, in particular with its sparse matrices support. Here you have the code

Sunday, August 8, 2010

Saturday, August 7, 2010

Implement a tree visit with no recursion and no stack

Interesting approach here. Visit of the tree obviously just requires O(1) memory. So why we keep teaching it with a stack or by using recursion?

Friday, August 6, 2010

Thursday, August 5, 2010

Count the strings with 0 and 1

Count the set of strings S such that each 0 might be followed by a 0 or a 1, but each 1 might be followed by just a 0. The alphabet is {0, 1} and the maximum string lenght is N.

Scalability of findability: effective and efficient IR operations in large information networks

Scalability of findability: effective and efficient IR operations in large information networks is a simulation paper about distributed retrieval with a large number of nodes (100, 1000, 10000) but with an incredibly small number of documents (4M?). Different strategies of search are compared including Random Walk, TFxIDF similarity, Degree, etc. All the experiments are interesting but I don't like the very small dataset which is un-realistic in any industrial context.

Chelsea got married, follow the story here


Gossip engine is tracking the story of Chelsea.

Wednesday, August 4, 2010

Boxes within Boxes

A box is a 3d object described as (x,y,z) = (length, breath, high). Given a collection of boxes B find the maximum subset that can fit each other

Google Wave is gone. Don't complain. Users must decide

One year ago Google introduced Google Wave. Today they discontinued it. Good move Google. Innovation cannot be planned. Innovation is a collection of successes and failures. You never know. Users will decide.

Discovering the right advertising formula, Larry Page confessed to a Stanford class, “probably was an accident more than a plan.”




"But despite these wins, and numerous loyal fans, Wave has not seen the user adoption we would have liked. We don’t plan to continue developing Wave as a standalone product, but we will maintain the site at least through the end of the year and extend the technology for use in other Google projects."

Lindsay Lohan and the gossip engine


18 days, Lindsay Lohan went to the Jail and the system kept algorithm track of the Gossip event

Tuesday, August 3, 2010

Sort in linear time

Numbers in the range 1... N, sort in O(N)

Monday, August 2, 2010

Generic Programming and the STL: Using and Extending the C++ Standard Template Library

An good reference book about STL, but I would have expected more information about how to extend STL. Anyway, I would recommend to buy it,

Sunday, August 1, 2010

How good is a span of terms?: exploiting proximity to improve web retrieval

Proximity is probably the most important ranking signal ever. In my experience, it is sometime more important than static ranking (such as PageRank) or dynamic ranking (based on some machine learning approach). The problem with proximity is how to determine the span window of maximum allowed proximity among matching query words.

How good is a span of terms?: exploiting proximity to improve web retrieval is an interesting paper by the way of Microsoft research, which extends traditional BM25 ranking with the idea of having a dynamically chosen span window. The optimal window is then selected using a machine learning approach based on LambdaRanking

Two interesting results showed in the paper are (a) the fact that head queries and tail queries need to have different span windows, and (b) the fact that sentences extracted by large collections such as Wikipedia produce limited benefit.

One assumption made by the paper is that the goodness of a document can be modeled as the sum of the span vectors describing each feature. Machine learning is then used to learn the weights in the associated linear combination. Not sure if the linear model is the most appropriate here.

A good summary of what is hot on a stream of queries

An interesting research problem about the ice queries for a stream of data.