Our application will be serving a large number of small, thumbnail-size images (about 6-12KB in size) through HTTP. I've been asked to investigate whether using a NoSQL data store is a viable solution for data storage. Ideally, we would like our data store to be fault-toerant and distributed.
Is it a good idea to store blobs in NoSQL stores, and which one is good for it? Also, is NoSQL a good solution for our problem, or would we be better served storing the images in the file system and serving them directly from the web server (as an aside, CDN is currently not an option for us)?
I was looking for a similar solution for a personal project and came across Riak, which, to me, seems like an amazing solution to this problem. Basically, it distributes a specified number of copies of each file to the servers in the network. It is designed such that a server coming or going is no big deal. All the copies on a server that leaves are distributed amongst the others.
With the right configuration, Riak can deal with an entire datacenter crashing.
Oh, and it has commercial support available.
Whether or not to store images in a DB or the filesystem is sometime one of those "holy war" type of debates; each side feels their way of doing things is the one right way. In general:
To store in the DB:
To store on the filesystem:
I tend to come down on the side of the filesystem because it scales much better. But depending on the size of your project, either choice will likely work fine. With NoSQL, the differences are even less apparent.
Mongo DB should work well for you. I haven't used it for blobs yet, but here is a nice FLOSS Weekly podcast interview with Michael Dirolf from the Mongo DB team where he addresses this use case.
If you are in a Python environment, consider the y_serial module: http://yserial.sourceforge.net/
In under 10 minutes, you will be able to store and access your images (in fact, any arbitrary Python object including webpages) -- in compressed form; NoSQL.
Well CDN would be the obvious choice. Since that's out, I'd say your best bet for fault tolerance and load balancing would be your own private data center (whatever that means to you) behind 2 or more load balancers like an F5. This will be your easiest management system and you can get as much fault tolerance as your hardware budget allows. You won't need any new software expertise, just XCOPY.
For true fault tolerance you're going to need geographic dispersion or you're subject to anyone with a backhoe.
(Gravatars?)