Google is complaining about the direction where Internet is going. Apparently, they say that there is less openness these days and it seems that the bad guy is Facebook, which is collecting user generated content which is not shared in an open way. I am puzzled.
If this is a genuine rant why are they building Google+ which is just a copycat of Facebook? Also, why they collect search data since 1998 about everyone's behaviour but they don't share this data with the rest of the world?
Long time ago, I dreamt about a different world with open standards for exchanging facebook-like news feeds but hosted by different providers (remember RSS and ATOM? something similar but social) so that data is public. Then, the aggregation is centralized and you access aggregated information in a similar way to what is happening nowadays with search engines. In that ideal world, login would be no longer centralized but it would be based on openID. The social graph would be shared across different servers and remote nodes would be just special form of href, so that someone can index public data.
(this is my personal idea and it's not representing any opinion expressed by my company)
Random commentary about Machine Learning, BigData, Spark, Deep Learning, C++, STL, Boost, Perl, Python, Algorithms, Problem Solving and Web Search
Thursday, April 19, 2012
Tuesday, April 17, 2012
Birthday
you are at a party with a friend and 10 people are present including you and the friend. your friend makes you a wager that for every person you find that has the same birthday as you, you get $1; for every person he finds that does not have the same birthday as you, he gets $2. would you accept the wager?
Sunday, April 15, 2012
Saturday, April 14, 2012
An interesting project for parallel machine learning:
Interesting collection of data parallel machine learning algorithms
"MADlib is an open-source library for scalable in-database analytics. It provides data-parallel implementations of mathematical, statistical and machine-learning methods for structured and unstructured data.
"MADlib is an open-source library for scalable in-database analytics. It provides data-parallel implementations of mathematical, statistical and machine-learning methods for structured and unstructured data.
The MADlib mission: to foster widespread development of scalable analytic skills, by harnessing efforts from commercial practice, academic research, and open-source development."
Friday, April 13, 2012
How many zeros at the end of 100!
This is tricky but funny
Thursday, April 12, 2012
Wednesday, April 11, 2012
Tuesday, April 10, 2012
Compute 2^128 -- no computer and possibly no paper
can you just compute it in your mind?
Monday, April 9, 2012
Can you estimate the cost of Facebook data center?
Sunday, April 8, 2012
Saturday, April 7, 2012
Friday, April 6, 2012
How many numbers with a 7 in the first 10000 integers?
Thursday, April 5, 2012
DBScan
Hi Antonio,
First of all apologies if this query is out of scope- I appreciate you may have moved on to other things, or may be just too busy at the moment.
I was looking for a simple implementation of DBSCAN to use as a starting point for some work and I found your implemention here (http://codingplayground.blogspot.com/2009/11/dbscan-clustering-algorithm.html). I found it worked quite well for small enough data sets, but for large ones it was allocating excessive amounts of memory, and I eventually looked at the code to see where the problem could be. It seems to me there may be an issue with the cluster expansion part of your algorithm- this issue may not affect the results (I think it may still produce the correct clusters), but it uses much more memory than it needs and this problem is more severe for larger input datasets.
The problem is with these lines of code
specifically the line "ne.push_back(n1)". At this point in the algorithm you have found a cluster and are looking for density connected neighbours. You are adding all the new neighbours to the list to be considered, when you only need to add the ones that have not been considered yet. So rather than extending the original list with the full contents of the new list, you need to extend it with the difference of the two lists. This small change stabilises the memory usage and makes it run much faster.
kind regards
aonghus
First of all apologies if this query is out of scope- I appreciate you may have moved on to other things, or may be just too busy at the moment.
I was looking for a simple implementation of DBSCAN to use as a starting point for some work and I found your implemention here (http://codingplayground.blogspot.com/2009/11/dbscan-clustering-algorithm.html). I found it worked quite well for small enough data sets, but for large ones it was allocating excessive amounts of memory, and I eventually looked at the code to see where the problem could be. It seems to me there may be an issue with the cluster expansion part of your algorithm- this issue may not affect the results (I think it may still produce the correct clusters), but it uses much more memory than it needs and this problem is more severe for larger input datasets.
The problem is with these lines of code
Neighbors ne1 = findNeighbors(nPid, _eps);
// enough support
if (ne1.size() >= _minPts)
{
debug_cerr << "\t Expanding to pid=" << nPid << std::endl;
// join
BOOST_FOREACH(Neighbors::value_type n1, ne1)
{
// join neighbord
ne.push_back(n1);
//debug_cerr << "\tPushback pid=" << n1 << std::endl;
}
//debug_cerr << std::endl;
//debug_cerr << " ne: " << ne << endl;
debug_cerr << " ne size " << ne.size() << " " << ne1.size() << endl;
}
// enough support
if (ne1.size() >= _minPts)
{
debug_cerr << "\t Expanding to pid=" << nPid << std::endl;
// join
BOOST_FOREACH(Neighbors::value_type n1, ne1)
{
// join neighbord
ne.push_back(n1);
//debug_cerr << "\tPushback pid=" << n1 << std::endl;
}
//debug_cerr << std::endl;
//debug_cerr << " ne: " << ne << endl;
debug_cerr << " ne size " << ne.size() << " " << ne1.size() << endl;
}
specifically the line "ne.push_back(n1)". At this point in the algorithm you have found a cluster and are looking for density connected neighbours. You are adding all the new neighbours to the list to be considered, when you only need to add the ones that have not been considered yet. So rather than extending the original list with the full contents of the new list, you need to extend it with the difference of the two lists. This small change stabilises the memory usage and makes it run much faster.
kind regards
aonghus
Wednesday, April 4, 2012
Check if Alice has my phone number, and don't tell Eve
Alice is a friend of mine, Bob is my assistant but I don't want him to know my super-private phone number. Also, I want to check whether Alice has my super-private phone number but I don't want to communicate the number to Bob. What can I do to preserve my number?
Tuesday, April 3, 2012
An airplane is going from city A to city B
what would be the impact of the wind for a round-trip travel A to B?
Monday, April 2, 2012
Helium balloon in a car
What direction will the balloon go when you accelerate?
Sunday, April 1, 2012
One biased coin
You have a biased coin and the result can be H or T but biased towards T. How can you make sure whatever is the toss the result will be fair?
Subscribe to:
Posts (Atom)