Best way to store/retrieve millions of files when

2019-04-06 11:45发布

I have a process that's going to initially generate 3-4 million PDF files, and continue at the rate of 80K/day. They'll be pretty small (50K) each, but what I'm worried about is how to manage the total mass of files I'm generating for easy lookup. Some details:

  1. I'll have some other steps to run once a file have been generated, and there will be a few servers participating, so I'll need to watch for files as they're generated.
  2. Once generated, the files will be available though a lookup process I've written. Essentially, I'll need to pull them based on an order number, which is unique per file.
  3. At any time, an existing order number may be resubmitted, and the generated file will need to overwrite the original copy.

Originally, I had planned to write these files all to a single directory on a NAS, but I realize this might not be a good idea, since there are millions of them and Windows might not handle a million-file-lookup very gracefully. I'm looking for some advice:

  1. Is a single folder okay? The files will never be listed - they'll only be retrieved using a System.IO.File with a filename I've already determined.
  2. If I do a folder, can I watch for new files with a System.IO.DirectoryWatcher, even with that many files, or will it start to become sluggish with that many files?
  3. Should they be stored as BLOBs in a SQL Server database instead? Since I'll need to retrieve them by a reference value, maybe this makes more sense.

Thank you for your thoughts!

12条回答
时光不老,我们不散
2楼-- · 2019-04-06 12:17

My file database contains over 4 million folders, with many files in each folder.

Just just tossed all the folders in one directory. NTFS can handle this without any issue, and advanced tools like robocopy can help when you need to move it.

Just make sure you can index the files without a scan. I did this by tossing my index in a mysql database.

So to get a file I search the mysql database upon some metadata and get an index. Then I use this index to read the file directly. Scaled well for me so far. But do note that you will be turning everything into random access and hence random read/writes. This is poor performance for HDD, but fortunately SSD will help a lot.

Also, I wouldn't toss the files into the mysql database. You won't be able to do network reads without having a client that understand mysql. Right now I can access any file over the network using any program because I can just use its network URL.

查看更多
做个烂人
3楼-- · 2019-04-06 12:20

When using a database to store your files, especially with small file the overhead should be small. but you can also do things like:

DELETE FROM BLOBTABLE WHERE NAME LIKE '<whatever>'

or when you have an expiry date, or want to refresh a file, you remove it by:

DELETE FROM BLOBTABLE WHERE CREATIONDATE < ...
etc...
查看更多
虎瘦雄心在
4楼-- · 2019-04-06 12:20

Question:

Why do these documents need to be generated and stored as PDFs?

If they can be generated, why not just keep the data in the database and generate them on the fly when required? This means you can search the actual data that's required for searching anyway and not have the files on disk. This way you can also update the PDF template when required without the need to regenerate anything?

查看更多
我想做一个坏孩纸
5楼-- · 2019-04-06 12:23

You can easily organize files into multiple folders without having to do this by business logic, or order-per-day, which is especially nice if that kind of ordering would be 'clumpy' (many hits in one folder, few in others).

The easiest way to do this is to create a unique hash for the file name, so that maybe you get something like this:

sf394fgr90rtfofrpo98tx.pdf

Then break this up into two-character blocks, and you will get this:

sf/39/4f/gr/90/rt/fo/fr/po/98/tx.pdf

As you can see, it gives you a deep directory tree that you can easily navigate.

With a good hash function, this will be very evenly distributed, and you will never get more than 1296 entries per directory. If you ever get a collision (which should be extremely rare), just add a number to the end: tx.pdf, tx_1.pdf, tx_2.pdf. Again, collisions on such large hashes should be extremely rare, so that the kind of clumping you get because of this are a non-issue.

You said that the documents are digitally signed, so you probably have the hash you need right there in form of the signature string.

查看更多
爷的心禁止访问
6楼-- · 2019-04-06 12:24

I'd group the files in specific subfolders, and try to organize them (the subfolders) in some business-logic way. Perhaps all files made during a given day? During a six-hour period of each day? Or every # of files, I'd say a few 1000 max. (There's probably an ideal number out there, hopefully someone will post it.)

Do the files ever age out and get deleted? If so, sort and file be deletable chunk. If not, can I be your hardware vendor?

There's arguments on both sides of storing files in a database.

  • On the one hand you get enhanced security, 'cause it's more awkward to pull the files from the DB; on the other hand, you get potentially poorer performance, 'cause it's more awkward to pull the files from the DB.
  • In the DB, you don't have to worry about how many files per folder, sector, NAS cluster, whatever--that's the DB's problem, and probably they've got a good implementation for this. On the flip side, it'll be harder to manage/review the data, as it'd be a bazillion blobs in a single table, and, well, yuck. (You could partition the table based on the afore-mentioned business-logic, which would make deletion or archiving infinitely easier to perform. That, or maybe partitioned views, since table partitioning has a limit of 1000 partitions.)
  • SQL Server 2008 has the FileStream data type; I don't know much about it, might be worth looking into.

A last point to worry about is keeping the data "aligned". If the DB stores the info on the file along with the path/name to the file, and the file gets moved, you could get totally hosed.

查看更多
贼婆χ
7楼-- · 2019-04-06 12:29

Determine some logical ordering of subdirectories and store them in blocks of no more than 512 or so files in a folder.

Do not store the files in a database. Databases are for data, file servers are for files. Store them on a file server, but store the path and retrieval information in a database.

查看更多
登录 后发表回答