Monday, February 28, 2011

Similar images

Given an index I of images and an M image, devise techniques for matching similar images in the index.

Sunday, February 27, 2011

A good book to read if you are in DM: "Ensemble Methods in Data Mining: Improving Accuracy Through Combining Prediction"

Ensemble Methods in Data Mining: Improving Accuracy Through Combining Prediction is a book that I definitively recommends if are interested in latest data mining trends.

A couple of years ago, I got fascinated by the simplicity of Adaboost algorithm and the amazing performance it offers for mining quite different sets of data. Adaboost is based on a very simple intuition: you start with a very simple and naive classifiers and then you improve the performance with boosting techniques. After that, I started to read papers about RandomForest, Bagging, Mart and many other boosting methodologies but I felt the lack of an unifying approach and description of all those techniques.

Ensemble Methods in Data Mining: Improving Accuracy Through Combining Prediction offers this view by giving a description of the ISLE framework in a very synthetic yet detailed exposition. Great work!

Saturday, February 26, 2011

Given a stream of items, select the k more frequent.

This is a nice problem to solve and a quite fascinating solution thanks to Vitter.

Friday, February 25, 2011

Interval with same sum

Given with two arrays A and B, each of size N where the elements of array contains either 1 or 0 we have to find such an interval (p,q)(inclusive) such that the sum of all the elements of A in this interval and sum of all elements of B in this interval is equal.

Wednesday, February 23, 2011

Area for historgrams

You are given an array representing heights of bars for an histogram. Find a rectangle in this histogram that has the maximum area.

Tuesday, February 22, 2011

Three lists

Given three lists: A, B and C of size N. Write a function which returns true if any 3 three numbers (1 from each list), sum up to K. False otherwise

Monday, February 21, 2011

Autosuggest is live in Italy, spain, and France Canada

We finally are live in 8 markets and 5 languages with geo-located aggregation, ranking, and filtering. This is really an amazing result achieved by my team

Sunday, February 20, 2011

Spell check problem

Given a dictionary find out if given word can be made by two words in dictionary. For example "spellchecker" is composed by "spell" and "checker". How can you detect it in efficient space and time?

Saturday, February 19, 2011

Sets intersection, union, membership

Given a set S1 and set S2 how can you estimate the intersection, the union, and the existence of at least one element in common. Assume that S1 is very large AND/OR S2 is also very large that they cannot fit in memory and you accept to have an approximate solution.

Friday, February 18, 2011

boxes: store them in minimal space

You have a set of n boxes, each one has dimensions (x, y, z) and you want to store them in minimal space. How do you proceed?

Thursday, February 17, 2011

Product at least k

How will you find the subarray whose product is at least k in an array of negative and positive numbers

Wednesday, February 16, 2011

Element that appears p% of times

You are given an array A of non unique elements and size M. Find an element that appears p% of times.

Tuesday, February 15, 2011

A feature suggestion to Spotify and Facebook (and youtube)

Spotify is probably one of the most useful social application so far. Here social means Facebook, even if there is a limited number of activities you can make on Twitter too. The number of social features available is amazing: I can share my music with my friends, create collaborative playlists, listen to what they listen, publish my playlist to my social Friends, and so on and so forth.

But.. wait a second. Could this be a new way of expanding my social network? I'd like to be friend with someone who shares my musical interest. Facebook is suggesting new friendships because You and the potential new friend(s) are both both having one or more friends in common.

So what about suggesting new friendships because You and the potential new friend(s) are both having similar musical interests? This would be an interesting new signal for making new friendships.

I wonder if this principle can be applied in other contexts as well...

Hey Youtube, I have one idea for you...

Monday, February 14, 2011

Processes and Processors

A process P_i has a e_i execution time, and must start at time t_i. The process can start just if a processor is available. There are PR processors in your systems and N processes P_i. Give an algorithm to test whether an execution is possible.

Sunday, February 13, 2011

Facebook usage of HBase and Haystack

This interesting video explains how Facebook uses Hbase for storing messages, email, etc. Hbase is a very good general purpose storage supporting fault-tolerance, replica and self balance. HBase is a public domain solution similar to google Hbase. So in a sense, Facebook is leveraging Google old ideas. Also, they are using an internal solution Haystack for storing attachments. Haystack is also used for the image product.

Saturday, February 12, 2011

Secret references, A feature suggestion to my Friends in Linked

LinkedIn is a wonderful social community for building your professional network, finding talents, and being contacted by tons of recruiters. One feature that is generally considered useful is "Recommendations" where your business partners, your colleagues or your manager can send a public recomendation about your experience, skills, and professional results based on their direct past work relations established with you.

This feature is useful but it's main problem is that all the references are public and people tends to coalize and quite soon they start sending multual bragging and very often exagerating their own capabilities. Therefore, it is sometime very hard to make a distinction between a good and a bad (or exagerrated) reference. How to solve this problem?

If you think about how "References" work in real life, you soon understand that we have a different mechanism in place. When I want to hire someone, I ask the candidate to provide the name of three anonymous refereers. Then, I contact the referees directly and their authoritativeness together with the secreteness of the communication will allow me to judge the quality of the candidate with no risk of listening to exaggerated bragging.

So LinkedIn, how about having a similar feature? I would pay for it. Give me a way to ask for secret refeerees and a way to communicate with them in a private manner.


Friday, February 11, 2011

Instant Twitter: can I follow who is around me?

I like the idea of followers in Twitter. I can follow my friends, celebrities and influencial people and other people can also follow me.

Not sure whether there is already a similar social service out of there, but I woul like to have a way to instantaneously communicate with the people around my geo-location that are willing to listen.

I believe that this would be an interesting feature for establishing new friendships around a place (e.g. a pub, a bistro, a disco, and so on and forth), or for getting help when I am in new area. Probably, there should also be a way to tag annoying people and for switching this feature on and off.

Doe anyone know whether there is already such social service?

Thursday, February 10, 2011

A feature suggestion to Facebook: Let my friends tag their messages, we have multiple interests.

I am a voracious user of Facebook News feed, where I receive messages from selected friends and updates from groups and pages that I like. Anyway, I believe that this service is currently working like old black and white colour TV. I can either decide to receive messages from a given friend (white colour), or decide to not receive them (black colour). Surely, Facebook is adopting quite sophisticate ranking algorithms for deciding whether messages coming from a specific friend that I like a lot should have a better position than the messages from a friend that I like a bit less. Nevertheless, there is no explicit way to differentiate among different types of message produced by a specific friend. This world is not having multiple colours, but just two.

In reality, people have multiple interests (or multiple colours). I could be very much interested in following all the messages coming from my friend Alice about latest cool technological gadgets, but not very interested in getting all her messages about recipes. Anyway, Mandy can have a lot of interest in recipes and zero interest in all the Alice's postings about gadgets. Alice has multiple interests and they should be expressed by the way of multiple colours. How to solve this problem?

How about to adopt a social solution? Let anyone posting a message annotate it with one more more tag ("recipes", "gadgets", and so on and so forth). Let anyone receiving a message express a like for that specific tag. This combination can be used by Facebook ranking algorithms as a powerful signal for expressing my willness to receive more and more messages about a specific topic (and less messages about a topic that I never liked).

How to create new topics? Well again a social solution. First, suggest a list of popular topics "as a I type", because I am lazy and I want to reuse meaningful tags already produced by my social graph. Second, simply allow the creation of new tags when no satisfying match is found. Each tag is a new colour and this world can therefore become a full HD colour TV.

How to select the topics that I like? Well again a social solution. I should be simply be allowed to "Like" that tag and this is an explicit signal to Facebook.

So Facebook can I have this feature?


Wednesday, February 9, 2011

Longest subarray with sum less than K

Given an array of integers find the longest subarray with sum less than k

Tuesday, February 8, 2011

find the longest palindromic substring.

Given a string, find the longest palindromic substring.

Monday, February 7, 2011

The node in the middle

There is a binary tree in which you are given three nodes x,y,z .Write a function which finds whether y lies in the path between x and z.

Sunday, February 6, 2011

Skip lists

Given an array of integers, each element represents the max number of jumps can make forward. What is the minimum number of element selections to reach the end of the array.

PS: In IR these are the skip lists, can you devise an quadratic algo? how about a linear one?

Saturday, February 5, 2011

Find a good sample of a query log

You have a stream of queries (infinite, meaning no way to hold all of it in memory). The stream follows a power law. Find a good strategy to sample the stream giving "a good" representativeness to both the tail and the head of the power law.

Friday, February 4, 2011

Find the top K elements in an hash

You have an hash (key, value). Find the top-K keys with largest value.

Thursday, February 3, 2011

Google, Bing, and web browsing data

Not fair to talk , but decided to mention this interesting posting by the way of Greg. It is also quotede below. Thanks Greg was my pleasure to quote you.

I suppose I should comment, as everyone else on the planet has, on Google's claim that Bing is copying their results.

My reaction is mostly one of surprise. I am surprised that Google wants this issue discussed in the press. I am surprised that Google wants this aired in the court of public opinion.

Google is trying to draw a line on what use of behavior data is acceptable. Google clearly thinks they are on the right side of that line, and I do too, but I'm not sure the average searcher would agree. And that is why Google is playing a dangerous game here, one that could backfire on them badly.

Let's take a look at what Google Fellow Amit Singhal said:
This experiment confirms our suspicion that Bing is using some combination of:
  • Internet Explorer 8, which can send data to Microsoft via its Suggested Sites feature
  • the Bing Toolbar, which can send data via Microsoft’s Customer Experience Improvement Program
or possibly some other means to send data to Bing on what people search for on Google and the Google search results they click.
Of course, what Amit does not mention here is that the widely installed Google Toolbar and the fairly popular Google Chrome web browser send very similar data back to Google, data about every page someone visits and every click they make. Moreover, Google tracks almost every web search and every click after a web search made by web users around the world, since almost every web search is done on Google.

By raising this issue, Google very publicly is trying to draw a particular line on how toolbar and web browsing data should be used, and that may be a dangerous thing for Google to do. The average searcher, for example, may want that line drawn somewhere other than where Google might expect it to be drawn -- they may want it drawn at not using any toolbar/Chrome data for any purposes, or even not using any kind of behavior data at all -- and, if that line is drawn somewhere other than where Google wants it, Google could be hurt badly. That is why I am surprised that Google is coming out so strong here.

As for the particular issue of whether this is copying or not, I don't have much to say on that, but I think the most thought-provoking piece I have seen related to that question is John Langford's post, "User preferences for search engines". John argues that searchers own their browsing behavior and can reveal what they do across the web to whoever they want to. Whether you agree or not with that, it is worth reading John's thoughts on it and considering what you think might be the alternative.

Wednesday, February 2, 2011

Given a function producing random numbers in the interval [0, X]

Given a function F1 producing random numbers in the interval [0, X] with uniform probability, build a function F2 producing random numbers in the interval [0, Y] with uniform probability. How many different solutions can you find to the problem?

Tuesday, February 1, 2011

Give a test of symmetry for a generic tree

Write a C code for understanding whether a tree is symmetric. This is a classical interview question, where the attention is to the programming details AND the ability to make induction reasoning.