Random commentary about Machine Learning, BigData, Spark, Deep Learning, C++, STL, Boost, Perl, Python, Algorithms, Problem Solving and Web Search
Monday, June 16, 2008
Shingling and Text Clustering (Broder's shingles)
Shingling is an elegant clustering algorithm which can compute an approximation of Jaccard similarity in linear time. It is one of my favorite text clustering algorithm. Here you can find a C++, STL, Boost implementation.
nice work
ReplyDeletethe use of connected component analysis has given me some ideas for my own project,
http://github.com/matpalm/resemblance/tree/master
mat