Antonio Gulli's coding playground: A Python crawler for rss, html, and images

Friday, August 22, 2008

A Python crawler for rss, html, and images

So I wanted to learn a bit of python. Well you know, I am used to script in Perl since 1997. Moreover, I am lazy. So why the heck I should learn a new language? Well let's say that the environment around me is full of these young and smart guys who love python. So I tried it. After all, it is nice to add another knife.

So where is the crawler? Here it is. Very compact. It uses Eventlet from SecondLife, a nice framework to support Async/IO and co-routines. The resulting code is very compact and it avoids all the pitfalls of calling a cascade of callbacks(). RSS/Atom feeds are parsed using feedparser.
Images are handled with PIL. HMTL pages are parser with Beautiful soap. Mysql is accessed with MySQLdb. Eventlet needs greenlet to run. The crawler downloads a bunch of rss/atom feeds, all the web pages referred by the postings, all the images contained in the web page. There is one single thread which performs all the network operations with pool of co-routines.

4 comments:

Amr EllafyAugust 23, 2008 at 7:25 AM
I'm also lazy ! will definitely follow your path !
ReplyDelete
Replies
AnonymousAugust 23, 2008 at 9:58 AM
Nice example, you might want to pass it through PyLint and take a manual pass to fix style inconsistencies. Changing a number of the comments into docstrings would also be beneficial.
ReplyDelete
Replies
codingplaygroundAugust 27, 2008 at 2:31 AM
see also
http://codingplayground.blogspot.com/2008/08/python-web-rss-image-crawler-v2.html
ReplyDelete
Replies
UnknownFebruary 4, 2010 at 4:57 AM
any demo for one to have a look
ReplyDelete
Replies

Add comment