Best way to store/retrieve millions of files when

2019-04-06 12:18发布

站内文章 / 后端开发

50 0

女痞

女 | 书童

私信

可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I have a process that's going to initially generate 3-4 million PDF files, and continue at the rate of 80K/day. They'll be pretty small (50K) each, but what I'm worried about is how to manage the total mass of files I'm generating for easy lookup. Some details:

I'll have some other steps to run once a file have been generated, and there will be a few servers participating, so I'll need to watch for files as they're generated.
Once generated, the files will be available though a lookup process I've written. Essentially, I'll need to pull them based on an order number, which is unique per file.
At any time, an existing order number may be resubmitted, and the generated file will need to overwrite the original copy.

Originally, I had planned to write these files all to a single directory on a NAS, but I realize this might not be a good idea, since there are millions of them and Windows might not handle a million-file-lookup very gracefully. I'm looking for some advice:

Is a single folder okay? The files will never be listed - they'll only be retrieved using a System.IO.File with a filename I've already determined.
If I do a folder, can I watch for new files with a System.IO.DirectoryWatcher, even with that many files, or will it start to become sluggish with that many files?
Should they be stored as BLOBs in a SQL Server database instead? Since I'll need to retrieve them by a reference value, maybe this makes more sense.

Thank you for your thoughts!

回答1:

I'd group the files in specific subfolders, and try to organize them (the subfolders) in some business-logic way. Perhaps all files made during a given day? During a six-hour period of each day? Or every # of files, I'd say a few 1000 max. (There's probably an ideal number out there, hopefully someone will post it.)

Do the files ever age out and get deleted? If so, sort and file be deletable chunk. If not, can I be your hardware vendor?

There's arguments on both sides of storing files in a database.

On the one hand you get enhanced security, 'cause it's more awkward to pull the files from the DB; on the other hand, you get potentially poorer performance, 'cause it's more awkward to pull the files from the DB.
In the DB, you don't have to worry about how many files per folder, sector, NAS cluster, whatever--that's the DB's problem, and probably they've got a good implementation for this. On the flip side, it'll be harder to manage/review the data, as it'd be a bazillion blobs in a single table, and, well, yuck. (You could partition the table based on the afore-mentioned business-logic, which would make deletion or archiving infinitely easier to perform. That, or maybe partitioned views, since table partitioning has a limit of 1000 partitions.)
SQL Server 2008 has the FileStream data type; I don't know much about it, might be worth looking into.

A last point to worry about is keeping the data "aligned". If the DB stores the info on the file along with the path/name to the file, and the file gets moved, you could get totally hosed.

回答2:

To answer your questions:

I wouldn't store them in a single folder. As chances are at some point you will want to look at the actual files on the disk, rather then some other way.
Instead why not store them in seperate directories, split into batches of 1000? Possibly using the ID as a key.
That many files will probably flood the DirectorWatcher, so some will be lost. I've used this in the past, and past a certain point (afew hundred), i've found it starts to miss files. Possibly use a different directory for incoming files, and then process this every so often. This can then trigger a process to update the original.
I wouldn't store the documents in a database, but definately store metadata in a database.

回答3:

You can easily organize files into multiple folders without having to do this by business logic, or order-per-day, which is especially nice if that kind of ordering would be 'clumpy' (many hits in one folder, few in others).

The easiest way to do this is to create a unique hash for the file name, so that maybe you get something like this:

sf394fgr90rtfofrpo98tx.pdf

Then break this up into two-character blocks, and you will get this:

sf/39/4f/gr/90/rt/fo/fr/po/98/tx.pdf

As you can see, it gives you a deep directory tree that you can easily navigate.

With a good hash function, this will be very evenly distributed, and you will never get more than 1296 entries per directory. If you ever get a collision (which should be extremely rare), just add a number to the end: tx.pdf, tx_1.pdf, tx_2.pdf. Again, collisions on such large hashes should be extremely rare, so that the kind of clumping you get because of this are a non-issue.

You said that the documents are digitally signed, so you probably have the hash you need right there in form of the signature string.

回答4:

1) A simple folder may be acceptably fast with an separate index but as it's trivial to put it in subdirectories which would allow yourself the ability to browse just do that.
So now you have to figure out your naming convention. Although I'd normally suggest a hash to get an even distribution of ids but as you're doing so much it probably makes sense to use the values you've already got. If you've got an order number do you have a timestamp too? If so, just prefix the order number with a timestamp.

Just be aware that if you're using order ids you may experience http://en.wikipedia.org/wiki/Benford%27s_law

回答5:

Determine some logical ordering of subdirectories and store them in blocks of no more than 512 or so files in a folder.

Do not store the files in a database. Databases are for data, file servers are for files. Store them on a file server, but store the path and retrieval information in a database.

回答6:

You need to test it. All of these solutions depend on the underlying filesystem. Some filesystems can handle huge directories, some can't. Some file systems index their directories, some don't (these two points are not necessarily related).

Breaking things up in to a tree of directories has reasonable chance to be performant, simply because, in the end, the individual directories tend to have few overall entries. That works for most any filesystem, simply because even a "stupid" one that is doing a linear directory search for your file can search a couple hundred entries reasonably quickly.

If the filesystem is indexing the directories (like, say, a btree, or simply sorting it internally which is effectively the same thing in this context), then the directory sizes are less important, although some tools may complain (loading a Windows Explorer window with 4M files, who know what will happen).

So, I would research your planned operating system and the filesystem options, and test it and see which works best for you.

回答7:

Why not considering Storing all those files after been converted into PDF into the DB (blob) Hence Advantages:

I beleive you wont have to deal direclty with the OS I/O, and leave everything up to the DB.
No need to hash naming
Easy to backup and maintain

回答8:

When using a database to store your files, especially with small file the overhead should be small. but you can also do things like:

DELETE FROM BLOBTABLE WHERE NAME LIKE '<whatever>'

or when you have an expiry date, or want to refresh a file, you remove it by:

DELETE FROM BLOBTABLE WHERE CREATIONDATE < ...
etc...

回答9:

Question:

Why do these documents need to be generated and stored as PDFs?

If they can be generated, why not just keep the data in the database and generate them on the fly when required? This means you can search the actual data that's required for searching anyway and not have the files on disk. This way you can also update the PDF template when required without the need to regenerate anything?

回答10:

1) This goes totally contrary to what I typically preach, but you may want to store them in a SQL database since they are trully small files. SQL Server would also allow you to quickly and easily find the files you need without any crazy disk trashing normally associated with enumerating such a large directory. Also, storing the files in SQL (while I am generally against) would greatly ease the backup / restore process.

2) Store them all in directories and either index them with windows indexing service (shivers) or create your own index in SQL Server which would contain the file name and full path. I would suggest storing them in separate directories, with only a few tens of thousands of files each. Perhaps you could use the order year as the folder name?

Regardless of how their stored – do not scan the directory to find the files - you will definitely need to have an index of some sort.

Hope this helps!

回答11:

My file database contains over 4 million folders, with many files in each folder.

Just just tossed all the folders in one directory. NTFS can handle this without any issue, and advanced tools like robocopy can help when you need to move it.

Just make sure you can index the files without a scan. I did this by tossing my index in a mysql database.

So to get a file I search the mysql database upon some metadata and get an index. Then I use this index to read the file directly. Scaled well for me so far. But do note that you will be turning everything into random access and hence random read/writes. This is poor performance for HDD, but fortunately SSD will help a lot.

Also, I wouldn't toss the files into the mysql database. You won't be able to do network reads without having a client that understand mysql. Right now I can access any file over the network using any program because I can just use its network URL.

回答12:

I think like so many others have said, you should do sub-folders but in a way that you can find the data through code. For Example, if datetime works, use that. From reading what you said it would appear that there is some form of hierarchical structure to the reports (daily, weekly, Daily X report, hourly Y report, etc) I would look at the structure of when and why the reports are generated and build my directories up that way.