Random commentary about Machine Learning, BigData, Spark, Deep Learning, C++, STL, Boost, Perl, Python, Algorithms, Problem Solving and Web Search
Tuesday, August 31, 2010
Sorting exceed the 1G keys/sec on GPUs
Monday, August 30, 2010
Apache Mahout: a formidable collection of Data mining algos on the top of Hadoop (Map Reduce)
Here you find some hints to run Mahout on the top of Amazon EC2. Here a collection of algorithms implemented . They include:
Classification
Logistic Regression (SGD implementation in progress: MAHOUT-228)
Support Vector Machines (SVM) (open: MAHOUT-14, MAHOUT-232 and MAHOUT-334)
Perceptron and Winnow (open: MAHOUT-85)
Neural Network (open, but MAHOUT-228 might help)
Random Forests (integrated - MAHOUT-122, MAHOUT-140, MAHOUT-145)
Restricted Boltzmann Machines (open, MAHOUT-375, GSOC2010)
Clustering
Canopy Clustering (MAHOUT-3 - integrated)K-Means Clustering (MAHOUT-5 - integrated)
Fuzzy K-Means (MAHOUT-74 - integrated)
Expectation Maximization (EM) (MAHOUT-28)
Mean Shift Clustering (MAHOUT-15 - integrated)
Hierarchical Clustering (MAHOUT-19)
Dirichlet Process Clustering (MAHOUT-30 - integrated)
Latent Dirichlet Allocation (MAHOUT-123 - integrated)
Spectral Clustering (open, MAHOUT-363, GSoC 2010)
Pattern Mining
Parallel FP Growth Algorithm (Also known as Frequent Itemset mining)
Regression
Locally Weighted Linear Regression (open)
Dimension reduction
Singular Value Decomposition and other Dimension Reduction Techniques (available since 0.3)
Principal Components Analysis (PCA) (open)
Independent Component Analysis (open)
Gaussian Discriminative Analysis (GDA) (open)
Evolutionary Algorithms
see also: MAHOUT-56 (integrated)
Sunday, August 29, 2010
Should Facebook buy Skype?
Skype has more than 560 million registered users, and it filed for a 136Million IPO . Facebook has about 500 million registered users and a pre-IPO value of 50 billion $.
Are the two business complementary? Probably:
Having the ability to call their friends from the favorite social network would be a great feature for final users. Increasing the Social graph and having a client installed in users' desktop would be a great opportunity for Facebook. Having an access to the Facebook computing power would be a chance for scaling the Skype platform and reducing their costs.
Saturday, August 28, 2010
Relational Join on Map Reduce
There is a simple trick for emulating relational join on the top of Map Reduce. Since all the items with the same key are sent to the same reducer, it is therefore easy to emulate a join on the reduce side. What are the other little tricks you need to adopt?
Obviously, you can alway use an higher level abstraction such as Hive on the top of Hadoop
Friday, August 27, 2010
Google Jeff Dean's key note @ WSDM 2009
Thursday, August 26, 2010
Parallel In-Place N-bit-Radix Sort
Wednesday, August 25, 2010
ClueWeb 09
- 1,040,809,705 web pages, in 10 languages
- 5 TB, compressed. (25 TB, uncompressed.)
- Unique URLs: 4,780,950,903 (325 GB uncompressed, 105 GB compressed)
- Total Outlinks: 7,944,351,835 (71 GB uncompressed, 24 GB compressed)
Tuesday, August 24, 2010
A very interesting talk about Social Graphs
This turns out to be a nice graph problem, where you need to efficiently compute a prefix search on the huge social graph, with 4x10^8 nodes and 7.7x10^10 edges in FoF and 10^9 nodes and 10^10 edges in OoF.
Four options have been explored:
- a classical trie which is very expensive for pointer memory consumption;
- a sorted vector searched with range queries and binary search which doesn't scale and it is expensive in terms of dynamic insertion;
- brute force search;
- a version bloom filtered architecture, which optimizes brute force search;
Monday, August 23, 2010
Number puzzle
A lillte puzzle with balls
Sunday, August 22, 2010
Lycos from 13 billions (2001) to 36 millions (2010)
Technology was really good at that time, but people belived that "You cannot make money with search" or "Search is nothing, Portal is the future". Later on this became "You cannot compete with Google, you need to make something different".
Saturday, August 21, 2010
Forecast: Facebook will surpass Yahoo during next month
Friday, August 20, 2010
Solution to the Facebook programming puzzle of "mutual likes"
- These are your friends [ https://graph.facebook.com/me/friends?access_token= ]
{
"data": [
{
"name": "",
"id": "31913"
},
{
- This is what you like [ https://graph.facebook.com/me/friends?access_token= ]
{
"data": [
{
"name": "Information theory",
"category": "Field of study",
"id": "107921045903425",
"created_time": "2010-08-20T17:47:47+0000"
},
{
"name": "Data modeling",
"category": "Interest",
"id": "107298675966720",
"created_time": "2010-08-20T17:47:18+0000"
},
{
"name": "Statistical inference",
"category": "Interest",
"id": "108195032537847",
"created_time": "2010-08-20T17:46:49+0000"
},
- This is what my friend Tony likes [ https://graph.facebook.com/1452108295/likes?access_token=]
{Therefore the solution is to enumerate all your friends, get their ids, get what they likes and intersect with what you like.
"data": [
{
"name": "BB King",
"category": "Musicians",
"id": "43780515976",
"created_time": "2010-07-20T19:57:00+0000"
},
{
"name": "Luciano Moggi",
"category": "Public_figures_other",
"id": "56461948713",
"created_time": "2010-07-09T08:12:34+0000"
},
{
"name": "Bing Jobs Europe",
"category": "Technology",
"id": "132723843410879",
"created_time": "2010-06-30T15:13:13+0000"
},
{
"name": "10 Worst Tattoo Spelling Mistakes EVER!!!",
"category": "Public figure",
"id": "133641006660295",
"created_time": "2010-06-27T12:02:37+0000"
},
Since the puzzle requires to "The problem is to find the friend who shares the most number of items with you", you can simply store all your "like ids" and then, for each friend, count how many likes ids you have in common. You just need to store the current maximum, and the correspondent friend id.
Facebook opened an office in seattle, with a puzzle
Disclaimer. I don't like many puzzles, unless there is some math or algorithm intuition behind them. So, Facebook is going to Seattle and they have a puzzle. Here it is.
Let's see:
The first occurrence of 'e' comes before 1 occurrence of 's'.
The first occurrence of 'i' comes before 1 occurrence of 'n'.
The first occurrence of 'i' comes before 1 occurrence of 'i'.
The first occurrence of 'n' comes before 2 occurrences of 'e'.
The first occurrence of 'e' comes before 1 occurrence of 'e'.
The first occurrence of 'i' comes before 1 occurrence of 'v'.
The first occurrence of 'n' comes before 1 occurrence of 'i'.
The first occurrence of 'n' comes before 1 occurrence of 'v'.
The first occurrence of 'i' comes before 1 occurrence of 's'.
The first occurrence of 't' comes before 1 occurrence of 's'.
The first occurrence of 'v' comes before 1 occurrence of 's'.
The first occurrence of 'v' comes before 2 occurrences of 'e'.
The first occurrence of 't' comes before 2 occurrences of 'e'.
The first occurrence of 'i' comes before 2 occurrences of 'e'.
The first occurrence of 'v' comes before 1 occurrence of 't'.
The first occurrence of 'n' comes before 1 occurrence of 't'.
The first occurrence of 'v' comes before 1 occurrence of 'i'.
The first occurrence of 'i' comes before 1 occurrence of 't'.
The first occurrence of 'n' comes before 1 occurrence of 's'
- What is the dictionary of distinct letters? This is given by the symbols used for the above rules. Therefore: i, n, e, v, t, s,
- Are there repetitions? yes, "2 occurrences of e" and "2 occurrences of i". So the letters are D={i, n, e, v, t, s, e, i} and the string has length 7 because all the rules are given
- So the solution is an anagram of D which respects the rules
I am too lazy to test all the rules, let's test the solution directly
http://www.facebook.com/seattle/invitees/
MemCached @ Facebook
Thursday, August 19, 2010
An hidden gem: Peter Norvig (Google) talks at Facebook about machine learning
It seems to me that the lesson suggested is to build "simple probabilistic models" applied to a lot of data, rather then to adopt complex models working on small datasets.
Puzzle: Slicing a pizza in 3 parts
Wednesday, August 18, 2010
Facebook UI Front-end technologies
Tuesday, August 17, 2010
Yahoo changes its API
"On October 1, 2010, we will close the SearchMonkey developer tool, gallery, and app preferences"
"The Yahoo! Site Explorer team is planning tighter integration between Site Explorer and Bing Webmaster Center to make the transition as smooth as possible for webmasters"
"Several search-related web services will continue to be supported, but strictly through YQL. These include the Yahoo! Term Extraction Web Service, Related Suggestion, and Spelling Suggestion. Other non-BOSS search APIs such as Web Search, Image Search, News Search and Site Explorer APIs will shut down with no further support in YQL"
"Search remains critical to Yahoo! and we’re happy to announce that we will continue to offer the BOSS program (Build your Own Search Service). In the not too distant future, BOSS will provide web and image search results from Microsoft along with other search-related services and content from Yahoo!, such as news"
Monday, August 16, 2010
A DryadLINQ Tool Kit for Information Retrieval
Sunday, August 15, 2010
Haystack the Facebook Fileserver
This is a typical solution when you have a lot of new objects inserted in the system (and a limited amount of delete operation). If you have many delete, you probably should consider something like the COSS (Cyclic Object Storage System) developed for Squid.
Saturday, August 14, 2010
Esemble of Classifiers (part I)
In this code skeleton, a generic Classifier class is defined which takes a generic input T1 type as input and classifies (transforms) into a generic type T2 (the categories). The generic classifier is then specialized into a weak classifier class with T1 = vector [integer] and T2 = bool. The weak classifier has some random noise which is used to generate mistakes.
A collection of weak classifiers will be then combined into an ensembles of classifiers, in the next days.
Friday, August 13, 2010
Thursday, August 12, 2010
China goes search mode
"Called "Search Engine New Media International Communications Co.," the joint venture will focus on creating a leading search engine while actively developing other businesses, including the internet, print media and advertising, said Zhou Xisheng, vice president of Xinhua, at a signing ceremony.
"Search engines, which have powerful information integration abilities, play an increasingly important role in disseminating information and influencing public opinion," he said."
Wednesday, August 11, 2010
K-Means Soft
Tuesday, August 10, 2010
Fresh Search tools we are working on
I heard that a new young Talent is becoming famous in U.S. but I was not sure about her name.
So I started to write "jacki", and Bing suggested how to complete the query.
When I selected the suggestion, Bing gave me the result. Lovely.
Monday, August 9, 2010
VectorSpace based on Boost::Matrix
Sunday, August 8, 2010
Out of band: but stunning (if Correct) P != NP
Saturday, August 7, 2010
Implement a tree visit with no recursion and no stack
Friday, August 6, 2010
Interesting TR about ranking technologies used by MSR
Thursday, August 5, 2010
Count the strings with 0 and 1
Scalability of findability: effective and efficient IR operations in large information networks
Wednesday, August 4, 2010
Boxes within Boxes
Google Wave is gone. Don't complain. Users must decide
Discovering the right advertising formula, Larry Page confessed to a Stanford class, “probably was an accident more than a plan.”
"But despite these wins, and numerous loyal fans, Wave has not seen the user adoption we would have liked. We don’t plan to continue developing Wave as a standalone product, but we will maintain the site at least through the end of the year and extend the technology for use in other Google projects."
Tuesday, August 3, 2010
Sort in linear time
Monday, August 2, 2010
Generic Programming and the STL: Using and Extending the C++ Standard Template Library
Sunday, August 1, 2010
How good is a span of terms?: exploiting proximity to improve web retrieval
How good is a span of terms?: exploiting proximity to improve web retrieval is an interesting paper by the way of Microsoft research, which extends traditional BM25 ranking with the idea of having a dynamically chosen span window. The optimal window is then selected using a machine learning approach based on LambdaRanking
Two interesting results showed in the paper are (a) the fact that head queries and tail queries need to have different span windows, and (b) the fact that sentences extracted by large collections such as Wikipedia produce limited benefit.
One assumption made by the paper is that the goodness of a document can be modeled as the sum of the span vectors describing each feature. Machine learning is then used to learn the weights in the associated linear combination. Not sure if the linear model is the most appropriate here.