Thursday, April 5, 2012

DBScan

Hi Antonio,

First of all apologies if this query is out of scope- I appreciate you may have moved on to other things, or may be just too busy at the moment.

I was looking for a simple implementation of DBSCAN to use as a starting point for some work and I found your implemention here (http://codingplayground.blogspot.com/2009/11/dbscan-clustering-algorithm.html). I found it worked quite well for small enough data sets, but for large ones it was allocating excessive amounts of memory, and I eventually looked at the code to see where the problem could be. It seems to me there may be an issue with the cluster expansion part of your algorithm- this issue may not affect the results (I think it may still produce the correct clusters), but it uses much more memory than it needs and this problem is more severe for larger input datasets.

The problem is with these lines of code

  Neighbors ne1 = findNeighbors(nPid, _eps);

              // enough support
              if (ne1.size() >= _minPts)
            {
              debug_cerr << "\t Expanding to pid=" << nPid << std::endl;   
              // join
              BOOST_FOREACH(Neighbors::value_type n1, ne1)
                {
                  // join neighbord
                  ne.push_back(n1);
                  //debug_cerr << "\tPushback pid=" << n1 << std::endl;
                }
              //debug_cerr << std::endl;
              //debug_cerr << " ne: " << ne << endl;
              debug_cerr << " ne size " << ne.size() << " " << ne1.size() << endl;
            }

specifically the line "ne.push_back(n1)". At this point in the algorithm you have found a cluster and are looking for density connected neighbours. You are adding all the new neighbours to the list to be considered, when you only need to add the ones that have not been considered yet. So rather than extending the original list with the full contents of the new list, you need to extend it with the difference of the two lists. This small change stabilises the memory usage and makes it run much faster.

kind regards

aonghus

1 comment:

  1. thanks!

    you might consider putting the code up on github- its a great implementation of just the basic algorithm.

    regards

    aonghus

    ReplyDelete