Antonio Gulli's coding playground: April 2012

Thursday, April 19, 2012

About the openness

Google is complaining about the direction where Internet is going. Apparently, they say that there is less openness these days and it seems that the bad guy is Facebook, which is collecting user generated content which is not shared in an open way. I am puzzled.

If this is a genuine rant why are they building Google+ which is just a copycat of Facebook? Also, why they collect search data since 1998 about everyone's behaviour but they don't share this data with the rest of the world?

Long time ago, I dreamt about a different world with open standards for exchanging facebook-like news feeds but hosted by different providers (remember RSS and ATOM? something similar but social) so that data is public. Then, the aggregation is centralized and you access aggregated information in a similar way to what is happening nowadays with search engines. In that ideal world, login would be no longer centralized but it would be based on openID. The social graph would be shared across different servers and remote nodes would be just special form of href, so that someone can index public data.

(this is my personal idea and it's not representing any opinion expressed by my company)

Tuesday, April 17, 2012

Birthday

you are at a party with a friend and 10 people are present including you and the friend. your friend makes you a wager that for every person you find that has the same birthday as you, you get $1; for every person he finds that does not have the same birthday as you, he gets $2. would you accept the wager?

Sunday, April 15, 2012

Better a game where you can win in 1 shot or better one where you can win 2 out of 3 shots?

Saturday, April 14, 2012

An interesting project for parallel machine learning:

Interesting collection of data parallel machine learning algorithms

"MADlib is an open-source library for scalable in-database analytics. It provides data-parallel implementations of mathematical, statistical and machine-learning methods for structured and unstructured data.

The MADlib mission: to foster widespread development of scalable analytic skills, by harnessing efforts from commercial practice, academic research, and open-source development."

Friday, April 13, 2012

How many zeros at the end of 100!

This is tricky but funny

Thursday, April 12, 2012

how many different ways there are to paint a cube with 3 colours?

Wednesday, April 11, 2012

how many different ways there are to paint a cube with 6 colours?

Tuesday, April 10, 2012

Compute 2^128 -- no computer and possibly no paper

can you just compute it in your mind?

Monday, April 9, 2012

Can you estimate the cost of Facebook data center?

Sunday, April 8, 2012

Search ads is a well consolidate business: how large it is ? and how can you innovate?

Saturday, April 7, 2012

How many servers do you need to index the web?

Friday, April 6, 2012

How many numbers with a 7 in the first 10000 integers?

Thursday, April 5, 2012

DBScan

Hi Antonio,

First of all apologies if this query is out of scope- I appreciate you may have moved on to other things, or may be just too busy at the moment.

I was looking for a simple implementation of DBSCAN to use as a starting point for some work and I found your implemention here (http://codingplayground.blogspot.com/2009/11/dbscan-clustering-algorithm.html). I found it worked quite well for small enough data sets, but for large ones it was allocating excessive amounts of memory, and I eventually looked at the code to see where the problem could be. It seems to me there may be an issue with the cluster expansion part of your algorithm- this issue may not affect the results (I think it may still produce the correct clusters), but it uses much more memory than it needs and this problem is more severe for larger input datasets.

The problem is with these lines of code

  Neighbors ne1 = findNeighbors(nPid, _eps);

              // enough support

              if (ne1.size() >= _minPts)

            {

              debug_cerr << "\t Expanding to pid=" << nPid << std::endl;    

              // join

              BOOST_FOREACH(Neighbors::value_type n1, ne1)

                {

                  // join neighbord

                  ne.push_back(n1); 

                  //debug_cerr << "\tPushback pid=" << n1 << std::endl;

                }

              //debug_cerr << std::endl;

              //debug_cerr << " ne: " << ne << endl;

              debug_cerr << " ne size " << ne.size() << " " << ne1.size() << endl;

            }

specifically the line "ne.push_back(n1)". At this point in the algorithm you have found a cluster and are looking for density connected neighbours. You are adding all the new neighbours to the list to be considered, when you only need to add the ones that have not been considered yet. So rather than extending the original list with the full contents of the new list, you need to extend it with the difference of the two lists. This small change stabilises the memory usage and makes it run much faster.

kind regards

aonghus

Wednesday, April 4, 2012

Check if Alice has my phone number, and don't tell Eve

Alice is a friend of mine, Bob is my assistant but I don't want him to know my super-private phone number. Also, I want to check whether Alice has my super-private phone number but I don't want to communicate the number to Bob. What can I do to preserve my number?

Tuesday, April 3, 2012

An airplane is going from city A to city B

what would be the impact of the wind for a round-trip travel A to B?

Monday, April 2, 2012

Helium balloon in a car

What direction will the balloon go when you accelerate?

Sunday, April 1, 2012

One biased coin

You have a biased coin and the result can be H or T but biased towards T. How can you make sure whatever is the toss the result will be fair?