I am developing a web system to handle a very large set of small images, about 100 millions images of 50kb ~ 200kb, working on ReiserFS
.
For now, it is very difficult to backup and sync
those large number of small files.
My question is that if it a good idea to store these small images to a key/value store or other nosql database such as GridFS (Mongodb)
, Tokyo Tyrant
, Voldemort
to gain more performance and bring better backup support?
Another alternative is to store the images in SVN and actually have the image folder on the web server be an svn sandbox of the images. That simplifies backup, but will have zero net effect on performance.
Of course, make sure you configure your web server to not serve the .svn files.
If all your images, or at least the ones most accessed, fit into memory, then mongodb GridFS might outperform the raw file system. You have to experiment to find out.
Of course, depending on your file-system, breaking up the images into folders or not would affect images. In the past I noticed that ReiserFS is better for storing large numbers of files in a single directory. However, I don't know if thats still the best file system for the job.
First off, have a look at this: Storing a millon images in the filesystem. While it isn't about backups, it is a worthwile discussion of the topic at hand.
And yes, large numbers of small files are pesky; They take up inodes, require space for filenames &c. (And it takes time to do backup of all this meta-data). Basically it sounds like you got the serving of the files figured out; if you run it on
nginx
, with avarnish
in front or such, you can hardly make it any faster. Adding a database under that will only make things more complicated; also when it comes to backing up. Alas, I would suggest working harder on a in-place FS backup strategy.First off, have you tried
rsync
with the-az
-switches (archive and compression, respectively)? They tend to be highly effective, as it doesn't transfer the same files again and again.Alternately, my suggestion would be to tar + gz into a number of files. In pseudo-code (and assuming you got them in different sub-folders):
This will create a number of .tar.gz-files that are easily transferred without too much overhead.