Random commentary about Machine Learning, BigData, Spark, Deep Learning, C++, STL, Boost, Perl, Python, Algorithms, Problem Solving and Web Search
Wednesday, December 31, 2008
Generic Hash_Map
This example is an extension of previous hash_set. Here I implemented an hash_map, which stores a pair into and chained hash. Data structure is based on generic programming.
Tuesday, December 30, 2008
Chained hash and hash function
This example leverages the generic list for building an array of lists. In turn, this is used for building a generic chained hash_set. My hash_set implementation is a subset of STL's hash_set extension. User can provide their own custom hash function by inheriting from STL's unary_function.
I suggest evaluating boost::unoderder, If you want to play with a more complete implementation.
I suggest evaluating boost::unoderder, If you want to play with a more complete implementation.
Generic list and mergesort
Generic programming is very powerful. You can write an ADT such as a list and don't care about the internal item. This example deal with lists and generic mergesort. A classical one.
Monday, December 29, 2008
Threads, Shared Lock, Unordered map
Boost is the C++ swiss knife. Here I play a bit with Threads, Shared Lock (a.k.a read/write locks) and hash map. In this example, multiple threads are writing (reading) data in an hash_map, synchronizing by using write lock (read lock).
Wednesday, August 27, 2008
Python web , rss, image crawler -- v2
Thanks for the comments to code v1. Here you have the Python web , rss, image crawler -- v2. Pylint 8.6 or something around that
Friday, August 22, 2008
A Python crawler for rss, html, and images
So I wanted to learn a bit of python. Well you know, I am used to script in Perl since 1997. Moreover, I am lazy. So why the heck I should learn a new language? Well let's say that the environment around me is full of these young and smart guys who love python. So I tried it. After all, it is nice to add another knife.
So where is the crawler? Here it is. Very compact. It uses Eventlet from SecondLife, a nice framework to support Async/IO and co-routines. The resulting code is very compact and it avoids all the pitfalls of calling a cascade of callbacks(). RSS/Atom feeds are parsed using feedparser.
Images are handled with PIL. HMTL pages are parser with Beautiful soap. Mysql is accessed with MySQLdb. Eventlet needs greenlet to run. The crawler downloads a bunch of rss/atom feeds, all the web pages referred by the postings, all the images contained in the web page. There is one single thread which performs all the network operations with pool of co-routines.
So where is the crawler? Here it is. Very compact. It uses Eventlet from SecondLife, a nice framework to support Async/IO and co-routines. The resulting code is very compact and it avoids all the pitfalls of calling a cascade of callbacks(). RSS/Atom feeds are parsed using feedparser.
Images are handled with PIL. HMTL pages are parser with Beautiful soap. Mysql is accessed with MySQLdb. Eventlet needs greenlet to run. The crawler downloads a bunch of rss/atom feeds, all the web pages referred by the postings, all the images contained in the web page. There is one single thread which performs all the network operations with pool of co-routines.
Boost::asio, Keep-alive, Boost::serialization
Boost::serialization is the ultimate solution if you want to send your objects trough a socket. I remember the old days when we were used to work with corba and similar stuff. Thanks to god this is not longer requested. This code is extending the async io server and serialize a test object. Ping messages are objects too. Generic programming is used to send as many different objects as you want. .. we are moving towards a full map reduce implementation
Monday, July 21, 2008
Boost::asio and Keep-alive
Boost::Asio is a very useful framework for portable networking. Same code can be used for all the unicies variants, macos and windows. Support for asynchronous IO is remarkable. It is possible to write a generic server and, then, specialize the framework for specific use. In this code, I experimented with portable keep-alive support using async no-op read() and asio timeout.
Saturday, June 28, 2008
Dictionaries and Serialization
Dictionaries are a very useful data structure. Here I played a bit with inheritance and boost::serialization. A base class Dictionary has been derived into an hash based Dictionary, implemented using hash_set or vector. The base class Dictionary has been also derived into a compressed Dictionary, using prefix compression and avoid strings. All the classes support serialization. Here you have the code.
PS: note that hash_set needs some special #define on my ubuntu. Otherwise it won't compile. Check the Makefile
PS: boost serialization for a polymorphic class is not a easy one. Check the code for the serialization of a derived class
PS: note that hash_set needs some special #define on my ubuntu. Otherwise it won't compile. Check the Makefile
PS: boost serialization for a polymorphic class is not a easy one. Check the code for the serialization of a derived class
Monday, June 16, 2008
Shingling and Text Clustering (Broder's shingles)
Shingling is an elegant clustering algorithm which can compute an approximation of Jaccard similarity in linear time. It is one of my favorite text clustering algorithm.
Here you can find a C++, STL, Boost implementation.
Here you can find a C++, STL, Boost implementation.
Subscribe to:
Posts (Atom)