Large scale image storage

2019-02-01 14:26发布

问题:

I will likely be involved in a project where an important component is a storage for a large number of files (in this case images, but it should just act as a file storage).

Number of incoming files is expected to be around 500,000 per week (averaging around 100 Kb each), peaking around 100,000 files per day and 5 per second. Total number of files is expected to reach tens of million before reaching an equilibrium where files are being expired for various reasons at the input rate.

So I need a system that can store around 5 files per second at peak hours, while reading around 4 and deleting 4 at any time.

My initial idea is that a plain NTFS file system with a simple service for storing, expiring and reading should actually be sufficient. I could imagine the service creating sub-folders for each year, month, day and hour to keep the number of files per folder at a minimum and to allow manual expiration in case that should be needed.

A large NTFS solution has been discussed here, but I could still use some advice on what problems to expect when building a storage with the mentioned specifications, what maintenance problems to expect and what alternatives exist. Preferably I would like to avoid a distributed storage, if possible and practical.

edit

Thanks for all the comments and suggestions. Some more bonus info about the project:

This is not a web-application where images are supplied by end-users. Without disclosing too much, since this is in the contract phase, it's more in the category of quality control. Think production plant with conveyor belt and sensors. It's not traditional quality control since the value of the product is entirely dependent on the image and metadata database working smoothly.

The images are accessed 99% by an autonomous application in first in - first out order, but random access by a user application will also occur. Images older than a day will mainly serve archive purposes, though that purpose is also very important.

Expiration of the images follow complex rules for various reasons, but at some date all images should be deleted. Deletion rules follow business logic dependent on metadata and user interactions.

There will be downtime each day, where maintenance can be performed.

Preferably the file storage will not have to communicate image location back to the metadata server. Image location should be uniquely deducted from metadata, possibly though a mapping database, if some kind of hashing or distributed system is chosen.

So my questions are:

  • Which technologies will do a robust job?
  • Which technologies will have the lowest implementing costs?
  • Which technologies will be easiest to maintain by the client's IT-department?
  • What risks are there for a given technology at this scale (5-20 TB data, 10-100 million files)?

回答1:

Here's some random thoughts on implementation and possible issues based on the follwing assumptions: average image size of 100kb, and a steady state of 50M (5GB) images. This also assumes users will not be accessing the file store directly, and will do it through software or a web site:

  1. Storage medium: The size of images you give amounts to a rather paltry read and write speeds, I would think most common hard drives wouldn't have an issue with this throughput. I would put them in a RAID1 configuration for data security, however. Backups wouldn't appear to be too much of an issue, since it's only 5gb of data.

  2. File storage: To prevent issues with maximum files in a directory, I would take the hash (MD5 minimum, this would be the quickest, but most-collision likely. And before people chirp in to say MD5 is broken, this is for identification, and not security. An attacker could pad images for a second preimage attack, and replace all images with goatse, but we'll consider this unlikely), and convert that has to a hexadecimal string. Then, when it comes time to stash the file in the file system, take the hex string in blocks of 2 characters, and create a directory structure for that file based on that. E.g. if the file hashes to abcdef, the root directory would be ab then under that a directory called cd, under which you would store the image with the name of abcdef. The real name will be kept somewhere else (discussed below).

    With this approach, if you start hitting file system limits (or performance issues) from too many files in a directory, you can just have the file storage part create another level of directories. You could also store with the metadata how many levels of directories the file was created with, so if you expand later, older files won't be looked for in the newer, deeper directories.

    Another benefit here: If you hit transfer speed issues, or file system issues in general, you could easily split off a set off files to other drives. Just change the software to keep the top level directories on different drives. So if you want to split the store in half, 00-7F on one drive, 80-FF on another.

    Hashing also nets you single instance storage, which can be nice. Since hashes of a normal population of files tend to be random, this should also net you an even distribution of files across all directories.

  3. Metadata storage: While 50M rows seems like a lot, most DBMS's are built to scoff at that number of records, with enough RAM, of course. The following is written based on SQL Server, but I'm sure most of these will apply to others. Create a table with the hash of the file as the primary key, along with things like the size, format, and level of nesting. Then create another table with an artificial key (an int Identity column would be fine for this), and also the original name of the file (varchar(255) or whatever), and the hash as a foreign key back to the first table, and the date that it was added, with an index on the file name column. Also add any other columns you need to figure out if a file is expired or not. This will let you store the original name if you have people trying to put the same file in under different names (but are otherwise identical, since they hash the same).

  4. Maintenance: This should be a scheduled task. Let Windows worry about when your task runs, less for you to debug and get wrong (what if you do maintenance every night at 2:30AM, and you're somewhere that observes Summer/daylight saving time. 2:30AM doesn't happen during the spring changeover). This service will then run a query against the database to establish which files are expired (based on the data stored per-file name, so it knows when all references that point to a stored file are expired. Any hashed file that is not referenced by at least one row in the file name table is no longer needed). The service would then go delete these files.

I think that's about it for the major parts.

EDIT: My comment was getting too long, moving it into an edit:

Whoops, my mistake, that's what I get for doing math when I'm tired. In this case, if you want to avoid the extra redundancy of adding RAID levels (51 or 61 e.g. mirrored across a striped set), the hashing would net you the benefit of being able to slot 5 1TB drives into the server, and then have the file storage software span the drives by the hash like mentioned at the end of 2. You could even RAID1 the drives for added security for this.

Backing up would be more complex, though the file system creation/modification times would still hold for doing this (You could have it touch each file to update it's modification time when a new reference to that file is added).

I see a two-fold downside to going by date/time for the directories. First, it is unlikely the distribution would be uniform, this will cause some directories to be fuller than others. Hashing would distribute evenly. As for the spanning, you could monitor the space on the drive as you add files, and start spilling over to the next drive when space runs out. I imagine part of the expiry is date related, so you would have older drives start to empty as newer ones fill up, and you'd have to figure out how to balance that.

The metadata store doesn't have to be on the server itself. You're already storing file related data in the database. As opposed to just referencing the path directly from the row where it is used, reference the file name key (the second table I mentioned) instead.

I imagine users use some sort of web or application to interface to the store, so the smarts to figure out where the file would go on the storage server would live there, and just share out the roots of the drives (or do some fancy stuff with NTFS junctioning to put all the drives into one subdirectory). If you're expecting to pull down a file via a web site, create a page on the site that takes the file name ID, then perform the lookup in the DB to get the hash, then it would break the hash up to whatever configured level, and request that over the share to the server, then stream it back to the client. If expecting a UNC to access the file, have the server just build the UNC instead.

Both of these methods would make your end-user app less dependent on the structure on the file system itself, and will make it easier for you to tweak and expand your storage later.



回答2:

Store the images in a series of SQLite databases. Sounds crazy at first but it seriously is faster than storing them on the file system directly, and take up less space.

SQLite is extremely efficient at storing binary data and by storing the files in an aggregated database instead of individual OS files it saves overhead when the images don't fit into exact block sizes (which is significant for this many files). Also the paged data in SQLite can give you faster throughput overall than you would get with plain OS files.

SQLite has concurrency limitations on writes but well within the limits you're talking about and can be mitigated even further by clever use of multiple (hundreds) of SQLite databases.

Try it out, you'll be pleasantly surprised.



回答3:

Just a few suggestions, based on general info provided here, w/out knowing specifics on what your application actually does or will be doing.

  • use sha1 of the file as a file name (if needed, store user-supplied file name in DB)

    the thing is that if you care about the data, you would have to store a checksum anyways.
    If you use sha1 (sha256,md5,other hash) it then will be easy validate file data -- read file, cacl hash, if it matches the name then the data is valid. Assuming that this is a webapp of some kind, hash-based file name can be used as etag when serving data. (check your .git directory for an example on this). This assumes that you cannot use user-supplied file name anyways, as user can send something like "<>?:().txt"

  • use directory structure that makes sense from your app standpoint

    the main test here is that that it should be possible to identify a file just by looking at PATH\FILE alone, w/out doing metadata lookup in DB. If you store/access patterns are strictly time-based then STORE\DATE\HH\FILE would make sense, if you have files that are owned by users, then perhaps STORE\<1st N digits of UID>\UID\FILE would make sense.

  • use transactions for file/metadata operations

    i.e. start write file metadata trx, try writing a file to FS, on success commit trx, rollback on error. The utmost care should be taken to avoid a situation when you have file metadata in DB and no file in FS and vise-verso.

  • use several root storage locations

    i.e. STORE01\ STORE02\ STORE\ - this can help in development (and later with scaling out). It is possible that several developers will be using one central DB and file storage that is local to their machine. Using STORE from the start will help to avoid a situation when metadata/file comb. will be valid in one instance of an app, and not valid in the other..

  • never store absolute PATHes in DB