Thursday, December 31, 2009

Sort a huge file

You are given a huge file with say 100 terabytes of integers. They don't fit in internal memory, what strategy would you use?

Wednesday, December 30, 2009

Selecting the median in an array

Ok, this is a typical "easy" interview question: you are given an array of integers, select the median value. What is the optimal solution?

Tuesday, December 29, 2009

A good tool for POS tagging, Named Entity recognition, and chunking,

YamCha is a good tool for POS tagging, Named Entity recognition, and chunking, based on SVM.

Monday, December 28, 2009

Introduction to Computational Advertising : Andrei Broder's course online

"Computational advertising is a new scientific discipline, at the intersection of information retrieval, machine learning, optimization, and microeconomics. Its central challenge is to find the best ad to present to a user engaged in a given context, such as querying a search engine ("sponsored search"), reading a web page ("content match"), watching a movie, and IM-ing. As such, Computational Advertising provides the foundations for building ad matching platforms that provide the technical infrastructure for the $20 billion industry of online advertising.

In this course we aim to provide an overview of the technology used in building online advertising platforms for various internet advertising formats."

Handsout of Andrei's course on Computatitional Advertising are now on line

Sunday, December 27, 2009

KD-Tree: A C++ and Boost implementation

According to wikipedia a kd-tree (short for k-dimensional tree) is a space-partitioning data structure for organiizing points in a k-dimensional space. In this implementation, points are represented as a boost ublas matrix (numPoints x dimPoints) and the kd-tree can be seen as a row permutation of the matrix. The tree is built recursively. At depth k, the (k % dimPoints) coordinates are analyzed for all the points and the median of them is selected with a quickselect algorithm. The quickselection induces a natural row-index permutation for points in two sets, which are recursively partitioned on the next levels of the tree.

Here you have the code.

Saturday, December 26, 2009

Xmas present: Machine Learning: An Algorithmic Perspective

A very useful book for Machine Learning, with a lot of examples in python. Machine Learning: An Algorithmic Perspective is a must have in your library.

Friday, December 25, 2009

MultiArray: a useful container.

MultiArray is a useful boost container. Think about a hierarchical container which contains other containers. You can also have recursive definitions, where each level of the container can have other MultiArrays. Therefore a multi_array[float,2] is a matrix of float (dim=2), a multi_array[float,3] is a cuboid of float (dim =3). Instead, multi_array[multi_array[float,2], 3] is an object that you cannot represent with a 'simple' matrix: we are representing a cuboid where each cell is a matrix of floats.

MultiArrays can be accessed with the traditional iterator pattern, or accessed with the conventional bracket notation.

Maybe the most useful feature of MultiArray is the possibility to create views, where a subset of the underlying elements in a MultiArray as though it were a separate MultiArray.

Thursday, December 24, 2009

Building an HTTP proxy for filtering content

I am very agnostic for what is concerned programming languages.

C++ is a must for any production ready code, such as online search code (and if you use STL and boost you have a rich lib set). C# is useful for offline code, due to the inner rich set of lib set. Some people loves Java for this stuffs (e.g. Lucene, Hadoop, and other very good enviroment). Then, you you are in the scripting coding for fast and dirty prototyping you may want to use python (which is good for strong typed system) or perl (which is good for the vast set of modules, see CPAN).

So, I need to write an HTTP proxy for filtering some content. What is my choice? In this case, perl with HTTP::Proxy module, which allows me to filter both headers and content.

Wednesday, December 23, 2009

A commodity vector space class

Sometime you need to realize a vectors space with dense vector. Here is a commodity class code.

PS: getting the sparse version is easy if you use Boost Ublas.

Tuesday, December 22, 2009

Is this... the real 3D Mandelbrot then?

Fashinating article. Are fractals a form for search? In my opinion, Yes they are.
"The original Mandelbrot is an amazing object that has captured the public's imagination for 30 years with its cascading patterns and hypnotically colorful detail. It's known as a 'fractal' - a type of shape that yields (sometimes elaborate) detail forever, no matter how far you 'zoom' into it (think of the trunk of a tree sprouting branches, which in turn split off into smaller branches, which themselves yield twigs etc.).

It's found by following a relatively simple math formula. But in the end, it's still only 2D and flat - there's no depth, shadows, perspective, or light sourcing. What we have featured in this article is a potential 3D version of the same fractal
"

Monday, December 21, 2009

Facebook is growing big .. and I mean REALLY BIG!

Most Recent Facebook Common Stock Sale Values Company At $11 Billion, techcrunch reported. I already reported the Distribution of Facebook Users (~300M world wide). Anyway, what is really impressive is that FB is generating a lot of incoming traffic for many other web sites (you know when you embed some content). Very similar to what search engines are doing...

Sunday, December 20, 2009

Conditional Random Fields

Conditional Random Fields are a framework to build probabilistic models for segmenting
and labeling sequence data and are a valid alternative to HMMs. CRF++ is a valid package for computing CRF

Saturday, December 19, 2009

Social, Mobile and the Local search

Ok, Google is going to buy yelp. Anyway, who cares about directories anymore? I am particularly impressed by the combination of Local, Social and Mobile. Oh yes, I meant foursquare. Check it out. I say real time search with a reason.

Friday, December 18, 2009

SAT and hill-climbing

SAT is a classifical NP problem, now what is the best known algorithm for building an approximate solution? WalkSAT

Pick a random unsatisfied clauses
Consider 3 moves; flipping each variable
If (any improve the evaluation)
{
accept the best
}
else
{
probability 0.5 : take the least bad
probability 0.5: pick a random move
}

Impressive no? and what this algorithm reminds to you?

Thursday, December 17, 2009

Ellg: an open source social platform

I do encorauge to try elgg platform. Ok, it is in PhP which I don't like, but is cool and rich which I like. Consider that Facebook was written in PhP, after all.

Wednesday, December 16, 2009

Trackle

Have a look I like the idea of tracking and sharing: Trackle.com

Monday, December 14, 2009

Neural Networks and backpropagation

I am studying once again NN and back-propagation. Any paper out of there that compares standard grandient descend update rule (delta rule), with the one where you add a momentum?

Any other suggestions for different update rules?

Sunday, December 13, 2009

CTR and search engines

Chitka Research has released a report showing Bing users, the new search engine from Microsoft, clicking on ads 75% more often than Google users.

Good sign? Ask and AOL has more CTR than bing

Saturday, December 12, 2009

Google: a usefult set of slides to build a distributed system

Designs, Lessons and Advice from Building Large Distributed Systems. Classical discussion about Map&Reduce and BigTable, plus an interesting overview about next incoming Snapper

Friday, December 11, 2009

Power Law and an Enrico Fermi's motivation

Power Laws are very frequent in data mining. For instance, you can find a power law in many properties of the web graph. I found a quite interesting explanation of power laws for modelling particles and cosmic radiation (Enrico Fermi 1949)

In this model, particles are produced at constant rate and the particles' distribution at time t is an exponential of the form \lamda * e ^ - (\lamda * t) (in other words there is a negative exponential contribution). In addition, particles gain energy when time is passing such as C e ^ (\alpha * t) (in other words this is giving a positive exponential contribution).

Under these two forces the particles will have a power law density. Can you explain why?

Thursday, December 10, 2009

Google is running fast on News (Please do not forget History)

I do have a lot of respect about the last Google's initiatives in News. "Living Stories" is a very cool prototype: "Today, on Google Labs, we're unveiling some of the work we've done in partnership with two world-class news organizations: The News York Times and The Washington Post. The result of that experiment is the Living Stories prototype, which features new ways to interact with news and the quality of reporting you've come to expect from the reporters and editors at The Post and The Times."

The key idea is that: "Each story has an evolving summary of current developments as a well as an interactive timeline of critical events. Stories can be explored by themes, significant participants or multimedia."

Google, I do like the idea of a collaboration with NYT and WP. Anyway, the idea of following the evolution of the story is not a new idea. It has been published in a patent Systems and methods for clustering information and used live by Ask.com as the 'History' feature:

"A clustering algorithm is used to cluster the information according to the selected window of time .omega.. New clusters can be periodically linked to chains, or new topic clusters can be identified, periodically. The new clusters are compared to other clusters to discover similarities in topic. When similarities are found among clusters in different time windows, the clusters are linked together to form a chain or are added to a preexisting chain. This comparison with clusters in previous time windows can stop if no similar information is found for a period of time proportional to the extension of the current cluster or to an extension of the chain. The chain of clusters is organized in a hierarchy according to the temporal information of each cluster: the most recent cluster is typically displayed at the top of the chain and the oldest cluster is typically displayed at the bottom of the chain."

On Feb 2008, this article gave a positive judgment about the 'History' feature: "My vote – thumbs up. Ask.com has done well integrating social media with news in a clean, easy-to-use interface. Features like story history could save me time"

Wednesday, December 9, 2009

What are the zettabytes?

An interesting study about the information consumed in U.S.

"In 2008, Americans consumed information for about 1.3 trillion hours, an average of almost 12 hours per day. Consumption totaled 3.6 zettabytes and 10,845 trillion words, corresponding to 100,500 words and 34 gigabytes for an average person on an average day. A zettabyte is 10 to the 21st power bytes, a million million gigabytes. These estimates are from an analysis of more than 20 different sources of information, from very old (newspapers and books) to very new (portable computer games, satellite radio, and Internet video).

Video sources (moving pictures) dominate bytes of information, with 1.3 zettabytes from television and approximately 2 zettabytes of computer games. If hours or words are used
as the measurement, information sources are more widely distributed, with substantial amounts from radio, Internet browsing, and others. All of our results are estimates.

Previous studies of information have reported much lower quantities. Two previous How Much
Information? studies, by Peter Lyman and Hal Varian in 2000 and 2003, analyzed the quantity of original content created, rather than what was consumed. A more recent study measured consumption, but estimated that only .3 zettabytes were consumed worldwide in 2007.
Hours of information consumption grew at 2.6 percent per year from 1980 to 2008, due to a combination of population growth and increasing hours per capita, from 7.4 to 11.8. More surprising is that information consumption in bytes increased at only 5.4 percent per year. Yet the capacity to process data has been driven by Moore’s Law, rising at least 30 percent per year.

One reason for the slow growth in bytes is that color TV changed little over that period. High-definition TV is increasing the number of bytes in TV programs, but slowly.
The traditional media of radio and TV still dominate our consumption per day, with a total of 60 percent of the hours. In total, more than three-quarters of U.S. households’ information time is spent with noncomputer sources. Despite this, computers have had major effects on some aspects of information consumption. In the past, information consumption was overwhelmingly passive, with telephone being the only interactive medium.

Thanks to computers, a full third of words and more than half of bytes are now received interactively.

Reading, which was in decline due to the growth of television, tripled from 1980 to 2008, because it is theoverwhelmingly preferred way to receive words on the Internet."

Tuesday, December 8, 2009

Google is moving into Real-Time search

I am quite impressed of what google is doing in the real-time field, where they claim the monitor about 1 billion of fast moving urls.

Monday, December 7, 2009

Yahoo! and Microsoft cement 10-year search deal

Yahoo! and Microsoft have finalised the terms of their search agreement, five months after announcing the deal.

From The Telegraph
"Microsoft’s Bing and Yahoo! search are pooling their efforts in order to try and take on the dominance of Google in the search market. According to Net Applications’ most recent global figures, Google accounted for 85 per cent of all searches, while Bing took 3.3 per cent share and Yahoo! search accounted for 6.22 per cent of the total market. "

Sunday, December 6, 2009

Inverting a matrix?

Can you leverage the techiques of yesterday's question to invert a matrix?

Saturday, December 5, 2009

Dividing a number with no division

This could be an interview question, but do not expect me to ask this question anymore ;-)

"Given a>0 can you compute a^-1 without any division?"

Friday, December 4, 2009

Google Customizes More of Its Search Results

This is an important step for Google and I do appreciate the initiative. Personalized search has been one of the hot topic in the academia, with a lot of people trying to obtain a personalized version of page rank. Well, personalization is more about Dynamic ranking and not about Static ranking. Other search engines are also adopting personalized forms of ranking, but here google is raising the bar..

"For many of its users, Google offers Web search results that are customized based on their previous search history and clicks. For example, if someone consistently favors a particular sports site, Google will put that site high in the results when they look up sports topics in its search engine.

But there has always been one catch: people had to be signed in to a Google account to see such customization.

On Friday Google said it was extending these personalized search results to people who are not logged into the service."

Thursday, December 3, 2009

Quoting Wikipedia, which cites me and Alessio

This is a study that Alessio and me carried out back in 2004. Alessio is now Director of Search, Oneriot the Realt Time search engine beyond Yahoo!. Wikipedia is now citing us.

A more recent study, which used Web searches in 75 different languages to sample the Web, determined that there were over 11.5 billion Web pages in the publicly indexable Web as of the end of January 2005.[61]


Wednesday, December 2, 2009

Search capitalization


This is a one year comparison about market capitalization on 1 Dec 2009. I used Apple as benchmark since that tech stock had a great performance this year.

Tuesday, December 1, 2009

Web Search signals

Search Engine Watch just published an article about 'new' search signals. Well, I don't believe these are new signals since they are quite well discussed in already published seach patents (seo by the sea is a great source for search patents). In addition, query session data analysis is missing in the list and that is another very important class of signals.

Monday, November 30, 2009

Minimize a quadratic function

How to minimize the quadratic function f(x) = 1/2 x^T A x + b^T x + c, subject to the constrain Vx=d, where T is the usual transpose operator, A is a matrix, and b, c, d are vectors?

Pretty elegant application of Karush–Kuhn–Tucker conditions

Sunday, November 29, 2009

Vivisimo, SnakeT, Ask, Bing and now Google side bar

Vivisimo and SnakeT experimented with it. Bing made it for real. Ask3d proposed it and then left to become more Google like. Now google wants to be similar to Ask3d and Bing. Quoting pandia. Life is a circle. In any case, the Categorized search with automatically generated sidebar proposal was an important part of my Ph.D thesis, back in 2004-2005. This is about being a bit early than others. SnakeT is no longer maintained but still working back from the university days.

"Unlike Bing and Yahoo, Google does not have a permanent left hand sidebar with additional links for more narrow searches. Instead there is a link at the top of the page called “Show options”.

Click on it and Google will add a sidebar which helps you refine your search query. You may, for instance, limit your search to new pages from the last hour.

Search Engine Land reports that Google will change it’s search result pages next year and give them a more coherent look and feel.

Most importantly: It seems the sidebar will become a permanent feature on all search result pages.

Google sidebar

The sidebar will include links to Images, News, Books, Maps and “More”, as well as related searches and links that let you limit the search to a specific time period.

Google will give you the alternatives (or “modes”) it thinks is most relevant to your search.

Ask.com launched search result pages like this in 2007. Because of this Ask.com became one of our favorite search engines. Ask later abandoned its “3D” search in order to become more like Google!"



Saturday, November 28, 2009

Friday, November 27, 2009

Thursday, November 26, 2009

Search Layers

Search Engines tend to organize indexes in layers. The intuition is that: If "enough good" documents are retrieved from the i-th layer, then you don't need to go to the (i+1)-th layer. Questions:

1. how do you partition the documents?
2. what is "good"?
3. what is "enough"?

Please formalize your answers.

Wednesday, November 25, 2009

Distribution of Facebook Users (~300M world wide)


Facebook claims that they have more than 300 Million of users world wide. I sampled their ads user database and found the following geographical users' distribution:
  1. 34.32% are in U.S.
  2. 8.15% are in U.K.
  3. 5.05% are in France
  4. 4.99% are in Canada
  5. 4.50% are in Italy
  6. 4.42% are in Indonesia
  7. 2.65% are in Spain
Here you can find the distribution of all the other nations.

Monday, November 23, 2009

Weights and Scale: a variation.

You have 6 weights which appears almost identical but 3 of them are slightly heavier than the remaining 3. The 6 weigth have 3 colours: red, green and blue. For each colour there is a pair of weights and one of the weight in the pair is slighlty heavier than the other. How many times do you need to use a scale with two plates for identifying the lighter weights?

Sunday, November 22, 2009

Ranking teams

Suppose that you have N teams playing a football league. Each team has a rank r_i parameter> 0 and assume there are no ties. Questions:
  1. What is the probability for team i to win on team j?
  2. What is the probability of the whole season (each team plays against the remaining ones)?
  3. Find an algorithm to rank the teams
Please provide your ideas.

Saturday, November 21, 2009

Random Decision Trees

Random Decision Trees are an interesting variant of Decision Trees. Here the key elements:
  1. Different training sets are generated from the N objects in the original training set, by using a bootstrap procedure which randomly samples the same example multiple times. Each sample generate a different tree and all the trees are seen as a forest;
  2. The random trees classifier takes the input feature vector, classifies it with every tree in the forest, and outputs the class label that recieved the majority of “votes”.
  3. Each node of each tree is trained on a random subset of the variables. The size of this set is a training parameter (in general sqrt(#features)). The best split criterium is chosen just considering the random sampled variables;
  4. Due to the above random selection, some training elements are left out for evaluation. In particular, for each left-out vector, find the the class that has got the majority of votes in the trees and compare it to the ground-truth response.
  5. Classification error estimate is computed as ratio of number of misclassified left-out vectors to all the vectors in the original data.
As you know, I love the idea of Boosting (such as AdaBoost) and the ideas behind Random Decision Trees are quite intriguing too.

Friday, November 20, 2009

Bing UK -- Out Of Beta Tag For Handling Search Overload In The UK

Nice article here

It is true that Internet has drastically grown in the past few years and has become more complex, but Search Engines are still on the verge of evolution. In order to make search engines more reliable information resource for users, Microsoft launched Bing in June, 2009.

Bing was launched under Beta tag in the UK. Microsoft at that time promised to remove the tag only under one condition i.e if its experience would be different from the competition and if the results would be outperforming in terms of UK relevancy.

The Bing team reached its objective on November 12, 2009 and the credit goes to London-based Search Technology Center. Microsoft says that 60 engineers behind the project in Soho have done extensive job at localizing the Bing global experience for the UK users in just 5 months.

Thursday, November 19, 2009

Is this the future of Search?

The word search is never mentioned in this video, but what this guy just about Visual Searching. Is this our future? This is going to be open-source and the current cost is 300 dollar.

WOW.


Scaling Internet Search Engines: Methods and Analysis

A great Ph.d. thesis to read if you are interested in search engine architecture. The main focus is how to build scalable indexes distributed on thousand of search nodes. A lot of low level details such as building inverted indexes, suffix arrays, distributed caching systems, query optimization, etc.
This technology has been adopted by Fast and later on by Yahoo.

Wednesday, November 18, 2009

Visual Studio and Parallel computation

Visual Studio 2010 Beta 2 added an extensive support for parallel (multi-core) computation, with an impressive debugger and profiler. This article gives a good introduction.

Tuesday, November 17, 2009

Bing Gains Another Half Point Of Search Share In October

  • During October, Bing represented 9.9% of the market, up from 9.4% in September, according to comScore.
  • Yahoo got slammed, losing almost a full percentage point of the market, to 18.0%, down from 18.8% in September.
  • Google gained a bit of share, to 65.4% in October, up from 64.9% in September.

Total search volume increased 13.2% in October, below 17.3% growth in September.

More information here

Monday, November 16, 2009

Bing - Out Of Beta Tag For Handling Search Overload In The UK

"The Bing team reached its objective on November 12, 2009 and the credit goes to London-based Search Technology Center. Microsoft says that 60 engineers behind the project in Soho have done extensive job at localizing the Bing global experience for the UK users in just 5 months. The team worked under the direction of Jordi Ribas' has done a miraculous job by crunching historical algorithmic results for identifying local intents and including them in the Bing's categorized search."

More information here

Saturday, November 14, 2009

Directly Optimizing Evaluation Measures in Learning to Rank

Directly Optimizing Evaluation Measures in Learning to Rank

"Experimental results show thatthe methods based on direct optimization of evaluation measure scan always outperform conventional methods of Ranking SVM andRankBoost. However, no significant di fference exists among the performances of the direct optimization methods themselves."

In this case, my preference goes to AdaRank for its semplicity and clear understanding of the key intuitions behind it.

Friday, November 13, 2009

A taxonomy of classifiers

I found this taxonomy of classifiers useful:

  1. Statistical

  2. Structural

Classifiers are generally combined in an Ensembles of classifiers. Many of the above methods are implemented in OpenCV

Thursday, November 12, 2009

A collection of public works on Learning to Rank from Microsoft

Nowadays, Learning to Rank is a quite popular subject. Search engines are no longer using only convential Random Walks on web graphs (e.g. PageRank and similar algos), but they learn from past user queries and clicks. This is a collection of recent public works @ Microsoft:
  • RankNet [2005]
  • LambdaRank [2006-2009], works directly on optimizing DCG
  • BoostTreeRank [2006], ranking as a classification problem and uses boosting
  • LamdbaMart [2009], which combines the above with boosting, regression trees and allows to have submodels

Wednesday, November 11, 2009

GO a new language from Google

Go is a new language from Google. Robert Pike and Ken Thompson are behind it. (I believe they are Godness for programmer ;-). At first glance, Go stays in the middle between C++ and Python.

I am trying to understand a bit more about the language. Garbage collection is there, type inference is there, lamba/closure is there, but where are modern things like generic/collections and exceptions?

Generic is a commonly accepted programming paradigm that any modern programmer is using (C++ has it, Java has it, etc, etc?)

BTW, there was already a programming language called Go and google missed it ?

Tuesday, November 10, 2009

Reoder an array in mimum number of steps

You have the array 876501234. The numbers on the right of the 0 are given in ascending order (1234), the numbers on the left of the zero are given in descending order (8765). 0 is an empty position. A number can move either in an adjacent empty position, or it can jump one single number and land in an empty position. Number cannot "move back" (e.g. the ascending number can just move from right to the left, the descending viceversa). You want to get the array 123408765, with the minimum number of moves.

Monday, November 9, 2009

DBSCAN clustering algorithm

DBSCAN is a well-known clustering algorithm, which is easy to implement. Quoting Wikipedia: "Basically, a point q is directly density-reachable from a point p if it is not farther away than a given distance ε (i.e., is part of its ε-neighborhood), and if p is surrounded by sufficiently many points such that one may consider p and q be part of a cluster.... For practical considerations, the time complexity is mostly governed by the number of getNeighbors queries. DBSCAN executes exactly one such query for each point, and if an indexing structure is used that executes such a neighborhood query in O(logn), an overall runtime complexity of O(n \cdot \log n) is obtained."

Some DBSCAN advantages:
  1. DBScan does not require you to know the number of clusters in the data a priori, as opposed to k-means.
  2. DBScan can find arbitrarily shaped clusters.
  3. DBScan has a notion of noise.
  4. DBScan requires just two parameters and is mostly insensitive to the ordering of the points in the database
The most important DBSCAN disadvantages:
  1. DBScan needs to materialize the distance matrix for finding the neighbords. It has a complexity of O((n2-n)/2) since only an upper matrix is needed. Within the distance matrix the nearest neighbors can be detected by selecting a tuple with minimums functions over the rows and columns. Databases solve the neighborhood problem with indexes specifically designed for this type of application. For large scale applications, you cannot afford to materialize the distance matrix
  2. Finding neighbords is an operation based on distance (generally the Euclidean distance) and the algorithm may find the curse of dimensionality problem

Here you have a DBSCAN code implemented in C++, boost and stl

Sunday, November 8, 2009

Invalidating iterators

If you use iterators to access a std::vector, can you modify the vector? Surprisingly enough a lot of C++ programmers do not know the answer to this question. Therefore, they don't know how to avoid the problem of invalidating iterators. What about you?

Saturday, November 7, 2009

A tutorial on static polymorphism

A good tutorial for template-based “compile-time polymorphism" . It shows how the technique can compete with traditional run-time polymorphism, and how the two can work together.

Friday, November 6, 2009

Generic B-Tree

A generic implementation of Btree. Here the skeleton of the code

Thursday, November 5, 2009

C++ Polymorphism., static or dynamic inheritance (part II)

Just a little extension of the previous posting about static polymorphism . Here, I define two different types of distances:
  • Distance euclidean d
  • Distance cosine d
Both of them works on generic type vectors. In the example, I leveraged boost dense vectors, but you can play with other type of vectors as well.

Here the code

Tuesday, November 3, 2009

C++ Polymorphism., static or dynamic inheritance

Traditional C++ programmers are used to think that everything should be modelled as a collection of classes and objects, organized in a hierarchy.

Inheriting is a great thing, but sometime you don't want to pay the overhead of a virtual method. I mean, anytime you overide a method the compiler will add a new entry in the virtual table and at run-time a pointer must be deferenced. When performance is crucial or when you call a method several times in your code you may want to save this cost.

In these situations, Modern C++ programmers prefer to adop a kind of compile-time variant of the strategy pattern. At compile time, the appropriate class and method is called during template instatiation. In this sense, Policy-based design is very useful when you want to save the cost of the virtual table.

In this example, I wrote a distance class and a method distance which is potentially used very frequently by any clustering algorithm. The method can be either a EuclideanDistance, or any kind of different distance. And there is no need of any additional virtual table, every choice is made at compile time statically.

Here you find the code.

Monday, November 2, 2009

Sunday, November 1, 2009

Can Google Stay on Top of the Web?

Worth reading.

1. Microsoft's (MSFT) new Bing search engine picked up 1.5 percentage points of market share in August to hit 9.5%, according to market researcher Hitwise, while Google's share fell from 71.4% to 70.2%.

2. But longer term, Twitter, Facebook, and related services may pose a more fundamental threat to Google: a new center of the Internet universe outside of search. Twitter, now with 55 million monthly visitors, and Facebook, with 300 million, hint at an emerging Web in which people don't merely read or watch material but communicate, collaborate with colleagues, and otherwise get things done using online services.

3. Meanwhile, Google's very success and size are starting to work against it. In the past year the company has been the target of three U.S. antitrust inquiries and one in Italy. Most recently the Justice Dept. on Sept. 18 said Google's controversial settlement with authors and publishers, which would have allowed it to scan and sell certain books, must be changed to avoid breaking antitrust laws. Even Google's own paying customers—advertisers and ad agencies—say they're eager for alternatives to blunt Google's power. Says Roger Barnette, president of search marketing firm SearchIgnite: "People want a No. 2 that has heft and scale."

4. Most of the search quality group's contributions are less visible because its work is focused mostly on the underlying algorithms, the mathematical formulas that determine which results appear in response to a particular query. Google conducts some 5,000 experiments annually on those formulas and makes up to 500 changes a year

Friday, October 30, 2009

How do you evaluate a search engine performance?

Cumulated Gain Based Indicators of IR performances is a very interesting survey about the topic. I would suggest reading it to anyone interested in search engines' metrics

Thursday, October 29, 2009

a Facebook Fan Page for any site on the web?

OpenGraph is coming. "Ethan Beard, laid out the Open Graph as essentially a Facebook Fan Page for any site on the web. So you can imagine that you might be able to create a Facebook-style Wall to include on your site, but able to update your statuses from your site, leave comments, like items, etc. Again, it’s like a Facebook Page, but it would be on your site. And you can only include elements you want, and leave out others."

Wednesday, October 28, 2009

Twitter Growth (by the way of Alessio)

Alessio started his own blog and first data is really interesting : Twitter receives 26M tweets per day, 22% are URLs.

"According to my most recent studies Twitter currently receives about 26 Million tweets per day. It is impressive, especially if you consider that in January 2009 they were hovering around “only” 2.4 Million daily tweets!"

Alessio is the Director of Search @ OneRiot. See my previous posting.

Tuesday, October 27, 2009

Yahoo goes real time with Oneriot

Ehi Alessio, I am very proud of you my friend (if this rumor is confirmed)

"Techcrunch said Tuesday that Yahoo is planning to partner with OneRiot, which operates a real-time search engine and develops browser add-ons that do pretty much the same thing. The possible deal comes on the heels of separate plans announced by Microsoft and Google last week to integrate Twitter pages into their search results."

Sunday, October 25, 2009

Similarities: what is better?

Similarity functions are the core of many machine learning and data mining algorithms (hmm not just clustering and recomandation systems). There are many sim measures out of there.

What is the best one?

It depends. Anyway cosine similarity has a very good behaviour in a large scale experiment run by Google in the paper "Evaluating Similarity Measures: A Large-Scale Study in the Orkut Social Network". Other measures were evaluated such as L1-norm, Pointwise Mutual Information, Pointwise Mutual Infomation with negative feedback, TF*IDF, LogOdds. Dataset for the experiment is Orkut and 4,106,050 community pages with recommendations were considered. Cosine measure was the best one in terms of finding correct correlations between recommendations. I am pretty sure that different measures can have different performances for other datasets. Anyway, this is another example of why I love to KISS.

Keep it simple baby, KISS.

Saturday, October 24, 2009

Starting to be interested to AD Research

AD Research is one of the topic that I know the less. So it has been a pleasure to read this post by Muthu @ Google, with a lot of useful tutorials such as ads 101: ad exchanges, real time bidding

Friday, October 23, 2009

Yahoo: Carl Icahn Says His Work Is Done, Resigns From Yahoo’s Board

What would be the forecast for the stock? Hard to say

"It’s hard to determine whether Icahn is throwing in the towel on Bartz or this is actually a vote of confidence. If he really believes in where Bartz can take the company after the search deal is done, then you’d think he’d keep his board seat to have a stronger influence on the company’s direction. But Icahn has always been a transaction-oriented investor. He tries to push companies to do things that will move the stock in a big way, and then he takes his profits and he leaves."


Thursday, October 22, 2009

Twitter: anyone wants to have it.

Bing made a deal; Google made a deal; then there are a tons of real time search engines out of there. Is so cool to be real time these days.

Web was open in the past: you go out of there and you crawl the public content.
Is not longer like it was used to be: Facebook and Twitter are the faster growing part of the Web and their content is not public. They don't need to make it public to attract traffic from search engines, like traditional online newspapers.

So they keep their content private and they sell it. Is this good for the community? I don't think so. Is this good for Facebook and Twitter investors? Very Very much.

Wednesday, October 21, 2009

Facebook numbers, quite impressive I say

How impressive? Schroepfer threw out some huge numbers. Among them:

  • Users spend 8 billion minutes online everyday using Facebook
  • There are some 2 billion pieces of content shared every week on the service
  • Users upload 2 billion photos each month
  • There are over 20 billion photos now on Facebook
  • During peak times, Facebook serves 1.2 million photos a second
  • Yesterday alone, Facebook served 5 billion API calls
  • There are 1.2 million users for every engineer at Facebook
And they are doing this with just php, mysql and memcached ?

nahhhh

Tuesday, October 20, 2009

Little game to make your friends wonder you are a wizard

Ask your math pal to think a math formula, or better a polynomy of quadratic order p(x). Ask her not to disclose it, but to compute p(0), p(1), p(2) ang give to you the results. How can you surprise your pal by guessing the correct p(x)?

Monday, October 19, 2009

Cutting pancakes..

What is the maximum number of pieces into which a pancake can be cut by n straights line, each of which crosses each others?

Sunday, October 18, 2009

Do you need a search book but you don't want to bother with so much math?

I definitevely recomend the new Algorithms of the Intelligent Web. Very concise explanation of all the modern search, data mining, machine learning techniques. Some of the contents in the book:
  • evaluatation: precision, recall, F1, and ROC curves;
  • ranking: PageRank; DocRank; ClickThrough ranking.
  • classification: Naive Bayes, neural networks; decision trees, Bagging; Boosting, etc
  • collaborative filtering; recommendations;
  • clustering: k-means, ROCK, DBSCAN

Monday, October 12, 2009

Random walks for species multiple coextinctions

If a single species is lost, what is the impact on the other species? Is there a cascade effect? A strange answer comes by applying Eingenvector computations in "Googling Food Webs: Can an Eigenvector Measure Species' Importance for Coextinctions?"

Sunday, October 11, 2009

Puzzle game with dice

On average how many times do you need to roll dice before all the six faces appear?

Saturday, October 10, 2009

C++ check list

So, you are making an interview. An well, I'm really upset when you don't know what are the pros and cons for defining a method:
  • operator or non operator
  • free or class memeber
  • virtual or non virtual
  • pure virtual or virtual
  • static or non static
  • const or non-const
  • public, protected, private
And about arguments:
  • return by value, pointer, const pointer, reference, const reference
  • return const or non const
  • passing by value, pointer, reference
  • passing const or non const
And about costructor, destructor, copy
  • private, public, implicit, explicit constructor
  • private, public, implicit, explicit copy operator
  • virtual , non virtual destructor
Do you know exactly how and when to use all the above?

Friday, October 9, 2009

50 Years of C++

ACM Multimedia Center on 50 Years of C++ presented by Bjarne Stroustrup at ACM DC Chapter meeting. A wonderful overview of C++ evolution given by the creator of the language.

Thursday, October 8, 2009

Top world-wide universities

Italy is not in the list. London has many universities in good position.

Wednesday, October 7, 2009

DailyBeast and the power of good business deals

When I was in Ask.com my team co-lead the launch of DailyBeast. One lesson I learnt from that experience: never under-estimate the importance of a good business deal.

DailyBeast received another wonderful prize. That is not just because there is some good technology behind it. That is mostly because they know what is News and they are focused in that business. I believe that technology is neutral and the focus must come from business deals.

Congrats, Tina

very much deserved.

Tuesday, October 6, 2009

Well, you can't do this... Google vs Ask.com

If true it is funny. Auletta relates: "[Diller] said to Larry, 'Is this boring?' 'No. I'm interested. I always do this,' Page said. 'Well, you can't do this.' Diller said. 'Choose.' 'I'll do this,' Page said matter-of-factly, not lifting his eyes from his hand-held device. 'So I talked to [Google co-founder] Sergey [Brin],' Diller said."

Monday, October 5, 2009

Web Random Bits:

My proposal for these problems. What is yours?

1. What about using a Randomness extractor starting and considering Twitter search as an (high-entrophy) source. There are applications like this based on RF-noise, Lavalamp, and similar funny sources. I would adopt either MD5 or SHA-1 hash function. This will give all the benefits of the crypto-hash function and bias-reduction.

2. Synching is a bit more hard, since the query is not enough. Time is another important factor: the results you see at time t are potentially different from the ones seen at time t + \eps. The index can change and the load balance mechanism in search can bring you to a different search cluster. I guess you must relay on some proxy which must capture a snapshot of twitter and then you must syncronize on the query and the time. There are a couple of services like this out of there.

Sunday, October 4, 2009

CopyCat:: Web Random Bits

I found these problems by Muthu (@Google) quite interesting and funny. So I like to repost them:

"We often need random bits. People toss a coin or use noise from transistors or whatever to obtain random bits in reality. Here are two tasks.
  • Devise a method to get random bits from the web. The procedure should not use network phenomenon such as round trip delay time to random hosts etc, but rather should be at application level. Likewise, the procedure should not be to go to a website which tosses coins for you. Rather, it should use some existing web phenomenon or content. Btw, quantifying/discussing potential biases will be nice.
  • Devise a method for two people to coordinate and get public random bits from the web. The method can't assume that two people going to a website simultaneously will see identical content or be able to synchronize perfectly.
Of course, it should be fast and fun."

Saturday, October 3, 2009

Tweetmeme provides a stat service on Twitter

Tweetmeme, has some interesting stat service on Twitter
  • Click-through data
  • Retweet data
  • Trees of retweet reach
  • Potential visibility
  • Influential users
  • Tweet Locations
  • Referring domains
  • User stats
This is useful for dynamic ranking of Twitter's users and postings.

Friday, October 2, 2009

Yummy chocolate table

You are given a chocolate table of mxn unit squares. You want to separate all the unit squares. What is the number of breaks you need? (supppose that when you break the table, you must break either a full column or a full row).

Thursday, October 1, 2009

Wednesday, September 30, 2009

Ebay: online auctions is about search and viceversa

I believe that search and on-line auctions will be the same in the near future. Just curious about what Hugh is doing there. I am just curios, Hugh and I have great expectations ;-)

Tuesday, September 29, 2009

Google as borg?

"Welcome to the cusp of a new decade and what I am officially calling The Age of the Goog. I am starting to see Google as the real life version of the Borg in that it is slowly moving along and without most people realizing it, simply assimilating us into their collective offerings of services."

searchnewz

Great feedback on some contribute gave to the open source coding community

"The best thing about code is, Its simple, easy to implement any where else and all patterns in one place. I have never seen the simplicity of having all patterns in one place. I had been through Modern C++ programming and design patterns from Alexanderscu. You made more easier than that."
Cheers!
DJ

http://codingplayground.blogspot.com/2009/01/design-patterns-c-full-collection-of.html

Monday, September 28, 2009

Findory: an old post, I like to propagate it.

Findory was my model, when I started to work on news search back in 2003. I had the pleasure to meet with Greg and I enjoyed discussing with him. I like this post. It's worth reading.

Tuesday, September 22, 2009

Tricky puzzle

This puzzle is tricky. Mary writes two real numbers on the opposite side (A and B) of a paper. The numbers must be different. Then she gives the paper to John who can select one side of the paper. Say he selects A-side. He sees the number on A-side and, without having a look to B-side, he must bet if the number on A-side is greater of the number of B-side. They continue this game forever. Questions:

1. What is the probability to win for John?
2. Is there any strategy better than break-even?

Monday, September 21, 2009

c++ giving an order in the allocating static objects

When a static object uses another static object there is no guarantee about the order of initialization.
  • How can you guarantee that an object is allocated before another one?
  • What if you need to give an order in the de-allocation process of static objects?

Sunday, September 20, 2009

c++ Insulation (IV)

So let's talk about good reasons to adopt insulation. Suppose you define a class; then you may want to modify it. For instance, you need to add a bool. Any client of that class is forced to recompile. Not a good thing when you need to manage a large project, distributed in many different locations.

Let's see other situations where you can have dependencies.
  • default arguments. If you use default arguments in your method, and then you decide to change them your clients will be forced to recompile.
  • enumerations. If you put your enumerations in your .cpp file they are hidden from the client. If you put your enumerations in your .h file, you will introduce a dependency

Saturday, September 19, 2009

A probabilistic model for retrospective news event detection

A probabilistic model for retrospective news event detection is a 2005 paper that I still consider very valid. The problem faced is to detect when an event happened in the past. This can be very useful for building a chronicle service. The model prosed is a probabilistic model which incorporates both content and time information in a unified framework. Model parameters are estimated using Maximum Likelihood, where each article is charaterized by four indipendent probabilities. Namely, time, person, location and keywords. ML is extimated by using the classifical EM algorithm. Test are run on the TDT dataset.

Friday, September 18, 2009

c++ Insulation (III)

These are two of the most un-expected parts of C++.
  • Inline methods. Inlining is an easy yet powerful way of improving C++ performance. The problem is that if you inline a method than you client will replace its call to your method with the body of method itself. This means you introduce a very strong dependency on you; and your client will see its size increasing a lot. There is cost to pay for it. Just be sure of who is your bed and don't accept people you just met.
  • Private members. Ok this is very nasty at first glance. I tell you in one shot: If you declare some private member, and someone is inheriting from you, that guy has a dependency on you. This is very counter-intuitive, I know. You may think. That is private, how it comes that he knows something about my private life. My good friend Giovanni (a real C++ guru) once told me. Private is about access, not about knowledge. You see it, but you are not allowed to access it. A stellar story. Augh Giovanni, you make my day. So you have a dependecy on it. My question: how do you avoid it? What is the cost of your choice?
Tomorrow morning I will post the answer.

Addendum: can you identify other reasons why insulation is good and dependencies are bad?

Thursday, September 17, 2009

c++ Insulation (II)

So did you get the answer?

  • HasA/HoldsA: if you embedd a pointer or a reference to an object you don't need to know the physical layout of that object. There is no dependency apart from the name of the object itself. So you can make a forward declaration. No more dependency, no more cats to follow. I like cats; I don't like having them come to me just for milk ;-). In the below code fragment, you don't need to include any .h file for including the declaration of B class. no dependecy at all.
  
class B;
class A {
B * pBobj;
B & rBobj;
};

Wednesday, September 16, 2009

c++ Insulation (I)

Back to the old coding series. I wanted to discuss a bit about insulation, which is something essential for large C++ projects. Many people enjoyes encapsulation, as fundamental good practice for OO progamming. Insulation is a bit less known.

Basically, insulation is a way to reduce dependencies in code at compile time (my own pratical definition). What's bad with dependencies? Well, when you have a dependency you basically introduce some additional time to compile the unit on which you depend to. So, you can say: hey wait a minute if I depend on that piece of code this is meaning that I need that code. In C++ this is not always the case. There are some (implict) dependencies that you may want to avoid and you may want to be aware of. If you apply these suggestions, then your coding style will improve a lot. Let's see some of them:
  1. Include files. Whenever you include a .h, you depend on that code. Ok, Ok you say: if I include it, I need it. Sure; but what about including other .h files in your .c file? Why what you included should, in turn, be included by whatever is including your .h file? stop uncessary dependencies and be fair stella;
  2. Inheritance. Books say any good OO sytem should leverage inheritance. I say don't believe the hype. Use inheritance with care. In a lot of situations Generics and templates are more efficient. BTW, when you inherit from a class, then you introduce a dependency on it. Sometimes you need, sometime you don't. Just be aware. Anything has a cost. Don't go in that direction without following your brain, and just listening to your heart;
  3. Layering. When your class (HasA) another user-defined type, you have a dependency on it. Again: hey, if I embedd that object this means that I need it. Correct. So what can you do to avoid it?
The answer to the question and many other situations will be given in another post.

Monday, September 14, 2009

Microsoft Bing going Visual


Give it a try.

"Until now, I really hadn’t had much reason to switch to its Bing decision engine, which launched back in May, for my Web searching needs. Google was doing just fine. For a while, I was making an effort to use Yahoo but Google somehow always became the default.", ZDNET

Friday, September 11, 2009

Predicting query trends

Google just published a very interesting article "On the predictability of Search Trends", where past queries are used for predicting the trends of future queries. I like this paper.

"Specifically, we have used a simple forecasting model that learns basic seasonality and general trend. For each trends sequence of interest, we take a point in time, t, which is about a year back, compute a one year forecasting for t based on historical data available at time t, and compare it to the actual trends sequence that occurs since time t. The error between the forecasting trends and the actual trends characterizes the predictability level of a sequence, and when the error is smaller than a pre-defined threshold, we denote the trends query as predictable."

This is a more approach that you can use in many contexts. For instance, I have seen it used for understand the coverage and the precision of firing a vertical result time based (such as news, blogs, twitter) into the SERP.

Another observation about the paper. You can better predict wether a query has a predictable trend, by enriching your Querylog with other temporal based data such as Twitter, News and blogs.

Tuesday, September 8, 2009

Data is the king, not algorithms

In many years working in search, there is only few constant things that I observe when we talk about quality. One of them is that: "Data is the king, not algorithms".

If you wan to improve the search quality, quite often you need more data to analyze and you do not necessarly need a better algorithm. The best situation is when you are able to "contaminate" or to "enrich" you data with other information coming from different domains.

So you are working on Web search quality, maybe you can get a huge help from other domains such as News, blogs, Dns, Images, videos, etc. You can use these additional data sources to extract signals used to improve the Web search itself.

In many situation, a more sophisticate algorithm will not provide the same impact of some additional data source.

Tuesday, September 1, 2009

Search stats, how the search market is growing.

An interesting article from marketwatch about search growth trends.

"Google Inc. saw its share of the worldwide search market grow 58% in July compared to the same month last year, while Yahoo Inc. saw only slight growth and Microsoft Corp. saw its own share increase 41%, according to data published Monday by comScore Inc."

Monday, August 31, 2009

Free the H-1Bs, Free the Economy

"I have a suggestion for our President on how to boost economic growth without spending a penny: Free the H-1B's."

I agree.

Saturday, August 29, 2009

I joined the Bing Microsoft new Search Technology Centre (STC) in Europe

It’s official. I decided to join the Microsoft new Search Technology Centre (STC) in Europe. I work on Bing Search technology and will be leading all the engineering development for UX and verticals in Europe.

My office is in the London site of STC Europe which is located close to Carnaby Street, right in the centre of Soho a location full of artists, music, and creativeness. The environment and all the people around are having a galvanizing effect on me. I can feel the energy and new ideas flooding.

STC Europe has three sites: London, Munich and Paris. In addition, it has a strong connection with STC Asia in Beijing, the India Development Center, the new in-development STC center at Silicon Valley, and, obviously, the headquarters in Redmond. All in all, this gives me more opportunity to travel, work with smart people, and improve my skills.

After a week here, I like the environment open to experiment with new stuff. You simply say hey I have this new idea for search and you get the resources to experiment with it. If it works, it goes online. Search is all about continuous improvements and evolutions, isnt’it?

Friday, August 28, 2009

The Daily Beast -- Five who are changing the face of the Internet.

The Daily Beast received a nomination for "Five who are changing the face of the Internet." by the newsweek.

I am proud of the ex-group that I led when I was in Ask.com. They contribuited to deliver the News Search algorithmic experience for Dailybeast, together with the other R&D center in NJ (if you search on DailyBeast this is a service powered by Ask.com).

Wednesday, August 26, 2009

Pointers and Smart Pointers

I love when people imposes the use of smart pointers. Well, if you come from C you know that a pointer is nothing but a memory address. If you come from Java, it's another religion. I like you, but I am on the wild and spice side.

If you are into C++, then you must use smart pointers. Smart pointers are all about resource management. You want to be on the wild but safe side of life. So when you allocate something, you may want to be sure that it will be deallocated at the right time. Talking about sustainability.

It's easy: a smart pointer destructor takes the responsability of freing memory. Now, since the destructor is automatically called by the language when the object goes out of the scope.. you are on the wild but safe side of the life. It's all about RAII.

There are a bunch of smart pointers and you should know them all.
  1. std::auto_ptr and boost::scoped_ptr. Here the destructor will actually free the memory for you.
  2. boost::shared_ptr. Here the destrucor will decrement a reference count and when it gets zero counts then it will free the memory. Very useful if you have a share resource, and you are on the wild and open side.
Another important aspect is that you can transfer the ownership of the allocated object, if the semantic of the pointer allows it. For instance, returning an std::auto_ptr from a function will tranfer the ownership of the object to the caller. Useful, no?

Monday, August 24, 2009

Book review: Large Scale C++ Software design

Large-Scale C++ Software Design is a must read book, if you are in software industry. Sometime you may what to move down from Design patterns to low-level physical organization of C++ projects. I believe this is the very first book dealing with this important aspect of software, which is too frequently ingnored in favour to more "abstract" aspects.

On the negative side, the book is too redundant and could have been reduced. The most interesting Chapter is number 5. Go directly there and start reading from that point.

Saturday, August 22, 2009

Remember Cuil? Now It’s a Real-Time Search Engine

Cuil does not look good. A lot of hype when they launched and now moved to the Real time news search; but it does not seem better than Oneriot.

"Okay, so is this a competitor to Twitter Search? Maybe a little, but really it’s more like OneRiot in terms of real-time search. And to be honest, OneRiot blows Cuil out of the water in this vertical."

Thursday, August 20, 2009

I no longer work for Ask.com

It's official. After a long period in Ask.com, it's now time to work on something different.

More thant 4 years ago, I started with a small team of people soon expanded into the first Ask.com European R&D Center. Pisa has been selected as location for the excellent quality of life and for the good concentration of software engineers and academic researchers. Our office was magnificent, as you can see from this collection of pictures (1, 2, 3).

Different Pisa teams worked on various projects:
  • Image Search, co-lead -- "Ask new image search is a step ahead in a notoriously tricky area. With the quality of its image search results, combined with the new Zoom query refinement feature, I'll be using it as my default image search service going forward", SearchEngineWatch

  • News and Blog search, co-lead -- "Ask.com has a pretty original approach to the old-time, old-school, traditional maybe, view of news.", Techcrunch

  • Video News Search, lead -- "it's interesting that they've managed to integrate the video playing right into the main page since I doubt all the source videos are the same format (Flash, aspx, etc.)", Niraj user comment
  • DailyBeast, Tina Brown's news site, co-lead -- "How did IAC/Tina Brown's new Daily Beast do in its first month? Pretty well: The company says it attracted 2.3 million unique monthly visitors and served up 11.4 million page views. A great start for any publishing startup, Alleyinsider.
  • Core Web Search Infrastructure; Pisa was involved in the design and implementation of the middleware software providing the base of all the Ask.com search products. A number of people in Pisa worked on this project.
  • RealTime Fresh Web Ranking, lead; injecting and ranking fresh news, video and blogs into Web search results in realtime.
  • Frontend Platform for UK, Pisa was involved in the Jeeves rebranding in UK. A number of people in Pisa worked on this project.
  • A bunch of Search Patents, "Ask.com has been working hard since then at making itself a more useful resource for timely news information, and has started incorporating multimedia into that mix.", Seobythesea
Note that the above list includes some of the projects carried out by people that I hired in Pisa.

I posted different blogs pointing out some differentiating aspects of our technology:
Working in Ask.com was an amazing experience since the enviroment is very productive. Anyway, Ask.com recognizes your hard work. In 2007 I received the IAC Emerging Leader Award, and 2006 IAC Horizon Award, for being a top performer above and beyond the expectation.

I want to thank Pisa team for the impressive work we carried out together. I also want to thank all the people from other offices world wide (Edison, NJ ; Oakland & Campbell, CA; London, UK; Dublin, Ireland; Hangzhou, China). In no particular order: Kelly, Jim, Apostolos, Tomasz, Yufan, Yihan, Doug, Rona, Navid, Chuck, Tuoc, Nitin, Alex, Eric, Andy, Erik, Miguel, Dominic, Steve, Padriac, Peter, James, Michael, Juanita, John, Mary, Kurt, Brendan, Michelle, Danica, Amy, Cassie and so many other people that is difficult to mention in a small blog posting.

I am fortunate to have worked with many bright, talented teams. I learned a lot from them.

Wednesday, August 19, 2009

Tuesday, August 18, 2009

Off-Topic: Ryanair business model

Ryanair: How a Small Irish Airline Conquered EuropeI definitively recommend reading this book. If you are a manager, here you will find a lot of insights for running a company when the market is already under a monopoly. Ryanair took the Southwest business model of low-cost and no frills flights and adapted it to the European market. This market was much more protective of larger companies and with different cultures and ways of making business in different countries. There is no free dinner here: all the aspects of Ryanair growth are discussed. They were about to run out of money so many times, they had a very negative behavior with trade unions. Anyway, they were able to make low cost flights a commodity as taking a train or a car. And even better that this. If you plan to run a company, or if you want to have a look the cost-saving kingdom, or if you plan to be an ass-hole once in your life, then this is your book

Sunday, August 16, 2009

What Apple is doing with this big-ass data center?

Going search, Going social, Going Apps, Going something completely new?

Saturday, August 15, 2009

Google loosing shares

Quoting mashable

"Numbers released by Nielsen tell a similar story: while Google grew from June to July, it still lost market share to its competitors – from 66.1% in June to 64.8% in July, a 1.3 percentage point drop. However, a closer look at the numbers reveals that Bing wasn’t the primary culprit – it was Yahoo which stole Google’s market share."

Friday, August 14, 2009

Latent Space Domain Transfer between High Dimensional Overlapping Distributions

This paper combines different techniques for learning across different knowledge domains. SVM Regression is used to fill up missing values, and SVD is used to reduce dimensions. A theoretical bound for the two combined techniques is provided.

Tuesday, August 11, 2009

Knol: is not looking goog(d)

Hmm I had a similar idea and Google released it before. I guess I was wrong.

quoting marketpilgrim: "It’s been a little over a year since Google launched Google Knol. Now it appears the service may not make it to its 2nd birthday."

Monday, August 10, 2009

How do you define a query session?

How can you identify a query session? Smart Miner: A New Framework for Mining Large Scale Web Usage Data suggests using three major components: 1. temporal visit constrains; 2. the links among pages, and 3. maximal visit paths, computed using an a-priori like algorithm. I suggest reading the paper if you want to see reasonable ideas for identifying query sessions.

What I don't like is the experimental part. A site with 1,5K unique users and 5K pages cannot be considered a Large Web site...

Sunday, August 9, 2009

Book Review: Building Search Application


Building Search Applications: Lucene, LingPipe, and Gate is a pretty good introduction to Information Retrieval with a lot of pragmatic examples. Based on Lucene, Gate and LingPipe. I recomend to add it to your library if you like Lucene and Nutch or if you need to maintain or create a medium scale search application.

Saturday, August 8, 2009

Friday, August 7, 2009

OffTopic: Dear Kara, I love your blog

Dear Kara, I love your blog. Everyday, I cannot start working without checking it.

Thanks
Antonio.

Real Time Query Expansion -- Query Logs vs News (update)

After about 12 hours I checked two real-time search engines (Namely Twitter and Oneriot). None of them offer a real time query expansion service (Yet).
Both of them have a "trending topic", not related to the particular query submitted by the user



Thursday, August 6, 2009

Real Time Query Expansion -- Query Logs vs News

Many Search engines offer a related query suggestion service. For instance, when you search for "Obama" the search engine can suggest the query "Is Obama Muslin?". This happens because both the queries have been submitted very frequently by different users in the same search session. In information retrieval this process is called Query Expansion. A common approach is to extract correlations between query terms by analyzing user logs.

The query log based approach shows its limit when you deal with real time events. In this case, there might be no time to accumulate past queries since events are happening right now. For dealing with real time search query expansion, a new idea is to extract fresh correlation from news events.

For instance, Sonia Sotomayor has been just confirmed to the high court.



Judd Gregg, is one of the supporters



And the algorithm nailed the correlation



Now compare the query suggestion provided by Google, where no correlations are provided since the event is too recent.



And compare with the query suggestion provided by Bing, where related search query log based are shown



I believe that leveraging both past query logs and real time news events can provided a more complete and updated query expansion service, since you leverage the best of both the worlds.



(PS: In addition, please note that both Bing and Ask are showing a related fresh video, while Google is not)