Hi im writing a web crawler in python to extract news articles from news websites like nytimes.com. i want to know what would be a good db to use as a backend for this project?
Thanks in advance!
Hi im writing a web crawler in python to extract news articles from news websites like nytimes.com. i want to know what would be a good db to use as a backend for this project?
Thanks in advance!
This could be a great project to use a document database like CouchDB, MongoDB, or SimpleDB.
MongoDB has a hosted solution: http://mongohq.com. There is also a binding for Python (Pymongo).
SimpleDB is a great choice if you are hosting this on Amazon Web Services
CouchDB is an open source package from the Apache Foundation.
You can take a look at Firebird
Firebird python driver are developped by the core team
Personally, I love PostGreSQL -- but other free DBs such as MySql (or, if you have reasonably small amounts of data -- a few GB at most -- even the SQLite that comes with Python) will be fine too.
I think the database itself will probably be one of the easier aspects of a web crawler like this.
If expect high load reading or writing the database (for example if you intend to run many crawlers at the same time) then you will want to steer in the direction of MySql, otherwise something like Sqlite will probably do you just fine.