I'm half-tempted to write my own, but I don't really have enough time right now. I've seen the Wikipedia list of open source crawlers but I'd prefer something written in Python. I realize that I could probably just use one of the tools on the Wikipedia page and wrap it in Python. I might end up doing that - if anyone has any advice about any of those tools, I'm open to hearing about them. I've used Heritrix via its web interface and I found it to be quite cumbersome. I definitely won't be using a browser API for my upcoming project.
Thanks in advance. Also, this is my first SO question!
Check the HarvestMan, a multi-threaded web-crawler written in Python, also give a look to the spider.py module.
And here you can find code samples to build a simple web-crawler.
pyspider.py
Use Scrapy.
It is a twisted-based web crawler framework. Still under heavy development but it works already. Has many goodies:
Example code to extract information about all torrent files added today in the mininova torrent site, by using a XPath selector on the HTML returned:
Trust me nothing is better than curl.. . the following code can crawl 10,000 urls in parallel in less than 300 secs on Amazon EC2
CAUTION: Don't hit the same domain at such a high speed.. .
I hacked the above script to include a login page as I needed it to access a drupal site. Not pretty but may help someone out there.