I have a process that's going to initially generate 3-4 million PDF files, and continue at the rate of 80K/day. They'll be pretty small (50K) each, but what I'm worried about is how to manage the total mass of files I'm generating for easy lookup. Some details:
- I'll have some other steps to run once a file have been generated, and there will be a few servers participating, so I'll need to watch for files as they're generated.
- Once generated, the files will be available though a lookup process I've written. Essentially, I'll need to pull them based on an order number, which is unique per file.
- At any time, an existing order number may be resubmitted, and the generated file will need to overwrite the original copy.
Originally, I had planned to write these files all to a single directory on a NAS, but I realize this might not be a good idea, since there are millions of them and Windows might not handle a million-file-lookup very gracefully. I'm looking for some advice:
- Is a single folder okay? The files will never be listed - they'll only be retrieved using a System.IO.File with a filename I've already determined.
- If I do a folder, can I watch for new files with a System.IO.DirectoryWatcher, even with that many files, or will it start to become sluggish with that many files?
- Should they be stored as BLOBs in a SQL Server database instead? Since I'll need to retrieve them by a reference value, maybe this makes more sense.
Thank you for your thoughts!
My file database contains over 4 million folders, with many files in each folder.
Just just tossed all the folders in one directory. NTFS can handle this without any issue, and advanced tools like robocopy can help when you need to move it.
Just make sure you can index the files without a scan. I did this by tossing my index in a mysql database.
So to get a file I search the mysql database upon some metadata and get an index. Then I use this index to read the file directly. Scaled well for me so far. But do note that you will be turning everything into random access and hence random read/writes. This is poor performance for HDD, but fortunately SSD will help a lot.
Also, I wouldn't toss the files into the mysql database. You won't be able to do network reads without having a client that understand mysql. Right now I can access any file over the network using any program because I can just use its network URL.
When using a database to store your files, especially with small file the overhead should be small. but you can also do things like:
or when you have an expiry date, or want to refresh a file, you remove it by:
Question:
Why do these documents need to be generated and stored as PDFs?
If they can be generated, why not just keep the data in the database and generate them on the fly when required? This means you can search the actual data that's required for searching anyway and not have the files on disk. This way you can also update the PDF template when required without the need to regenerate anything?
You can easily organize files into multiple folders without having to do this by business logic, or order-per-day, which is especially nice if that kind of ordering would be 'clumpy' (many hits in one folder, few in others).
The easiest way to do this is to create a unique hash for the file name, so that maybe you get something like this:
Then break this up into two-character blocks, and you will get this:
As you can see, it gives you a deep directory tree that you can easily navigate.
With a good hash function, this will be very evenly distributed, and you will never get more than 1296 entries per directory. If you ever get a collision (which should be extremely rare), just add a number to the end: tx.pdf, tx_1.pdf, tx_2.pdf. Again, collisions on such large hashes should be extremely rare, so that the kind of clumping you get because of this are a non-issue.
You said that the documents are digitally signed, so you probably have the hash you need right there in form of the signature string.
I'd group the files in specific subfolders, and try to organize them (the subfolders) in some business-logic way. Perhaps all files made during a given day? During a six-hour period of each day? Or every # of files, I'd say a few 1000 max. (There's probably an ideal number out there, hopefully someone will post it.)
Do the files ever age out and get deleted? If so, sort and file be deletable chunk. If not, can I be your hardware vendor?
There's arguments on both sides of storing files in a database.
A last point to worry about is keeping the data "aligned". If the DB stores the info on the file along with the path/name to the file, and the file gets moved, you could get totally hosed.
Determine some logical ordering of subdirectories and store them in blocks of no more than 512 or so files in a folder.
Do not store the files in a database. Databases are for data, file servers are for files. Store them on a file server, but store the path and retrieval information in a database.